COSMIN Risk of Bias tool to assess the quality of studies on reliability or measurement error of outcome measurement instruments: a Delphi study

Table 6 Consensus reached on standards for preferred statistical methods for reliability

Statistical methods		very good	adequate	doubtful
7	For continuous scores: was an Intraclass Correlation Coefficient (ICC)^a calculated? 28/35 (80%)(R2^b)	ICC calculated; the model or formula was described, and matches the study design^c and the data 30/35 (86%)(R2)	ICC calculated but model or formula was not described or does not optimally match the study design^c OR Pearson or Spearman correlation coefficient calculated WITH evidence provided that no systematic difference between measurements has occurred	Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic difference between measurements has occurred 25/35 (71%) (R2) OR WITH evidence provided that systematic difference between measurements has occurred 25/34 (74%)(R2)
8	For ordinal scores: was a (weighted) Kappa calculated? 26/36 (72%)(R2)	Kappa calculated; the weighting scheme was described, and matches the study design and the data R3: 27/36 (75%)(R3^d)	Kappa calculated, but weighting scheme not described or does not optimally match the study design 19/36 (53%)(R3)
9	For dichotomous/nominal scores: was Kappa calculated for each category against the other categories combined? 23/33 (70%)(R3)	Kappa calculated for each category against the other categories combined

^a Generalizability and Decision coefficients are ICCs; ^b R2: consensus reached in round 2; ^c Based on panelists’ suggestions the steering committee decided after round 3 to use the word ‘study design’ instead of ‘reviewer constructed research question’; ^d R3: consensus reached in round 3

ISSN: 1471-2288