Comparison of different rating scales for the use in Delphi studies: different scales lead to different consensus and show different test-retest reliability

Table 3 Inter-individual comparison of rating scales

Statistic	three-point scale	five-point scale	nine-point scale	Sensitivity analysis
Statistic	three-point scale	five-point scale	nine-point scale	five-point scale^a	nine-point scale^a
Overall^b
Changes in 2nd survey (in %)	12.48	24.73	32.26	20.96	8.57
Class imbalance^a 1st survey (in %)	79.16	64.93	63.62	64.93	88.25
Test-retest agreement (in %)	87.52	75.27	67.74	79.04	91.43
Weighted kappa [95% CI]	0.63 [0.62; 0.64]	0.47 [0.07; 0.86]	0.78 [0.78; 0.78]	0.54 [0.50; 0.58]	0.58 [0.55; 0.62]
Mean [range] over the 19 proposed treatment goals
Changes in 2nd survey (in %)	12.60 [2.41; 25.61]	24.75 [16.05; 38.82]	32.43 [17.07; 55.13]	20.96 [16.05; 28.24]	8.69 [0.00; 24.00]
Class imbalance^c 1st survey (in %)	0.80 [49.38; 95.35]	66.05 [35.71; 83.13]	63.46 [21.25; 81.18]	68.65 [45.78; 83.13]	88.19 [37.50; 100.00]
Test-retest agreement (in %)	87.40 [74.39; 97.59]	75.25 [61.18; 83.95]	67.57 [44.87; 82.93]	79.04 [71.76; 83.95]	91.31 [76.00; 100.00]
Weighted kappa	0.55 [0.18; 0.87]	0.44 [0.29; 0.62]	0.61 [0.17; 0.81]	0.49 [0.35; 0.67]	0.40 [0.00; 0.80]

^aRating scale mapped onto three categories
^bOverall refers total ratings of all participants of all treatment goals, e.g., the number of participants times 19 goals times ratings of the respective scale five-point/nine-point scale
^cClass imbalance is highlighted by the percentage of the most frequently used rating category (e.g. in the first survey, the rating categories main goal/secondary goal/no goal scored 79%/11%/10% across all participants’ ratings of all goals, hence, the imbalance is 79%)

ISSN: 1471-2288