Skip to main content

Table 2 Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COSMIN checklist (COSMIN step 3)

From: Inter-rater agreement and reliability of the COSMIN (COnsensus-based Standards for the selection of health status Measurement Instruments) Checklist

Item nr

Item

N (minus articles with 1 rating)a

% agreement

N

Kappa

Box A Internal consistency (n = 195)b

A1

Does the scale consist of effect indicators, i.e. is it based on a reflective model?

185

82

193

0.06

Design requirements

A2c

Was the percentage of missing items given?

183

87

190

0.48

A3c

Was there a description of how missing items were handled?

180

90

187

0.54

A4

Was the sample size included in the internal consistency analysis adequate?

177

87

185

0.06d

A5c

Was the unidimensionality of the scale checked? i.e. was factor analysis or IRT model applied?

180

92

187

0.69

A6

Was the sample size included in the unidimensionality analysis adequate?

166

79

178

0.27

A7

Was an internal consistency statistic calculated for each (unidimensional) (sub)scale separately?

179

85

187

0.31d

A8c

Were there any important flaws in the design or methods of the study?

174

86

179

0.22d

Statistical methods

A9

for Classical Test Theory (CTT): Was Cronbach's alpha calculated?

179

93

187

0.27d,e

A10

for dichotomous scores: Was Cronbach's alpha or KR-20 calculated?

151

91

165

0.17d,e

A11

for IRT: Was a goodness of fit statistic at a global level calculated? e.g. χ2, reliability coefficient of estimated latent trait value (index of (subject or item) separation)

154

93

167

0.46d,e

Box B. Reliability (n = 141) b

Design requirements

B1c

Was the percentage of missing items given?

129

87

140

0.39

B2c

Was there a description of how missing items were handled?

125

91

137

0.43d

B3

Was the sample size included in the analysis adequate?

127

77

139

0.35

B4c

Were at least two measurements available?

129

98

140

0.72 d

B5

Were the administrations independent?

129

73

139

0.18

B6c

Was the time interval stated?

125

94

136

0.50d

B7

Were patients stable in the interim period on the construct to be measured?

126

75

138

0.24

B8

Was the time interval appropriate?

125

84

137

0.45

B9

Were the test conditions similar for both measurements? e.g. type of administration, environment, instructions

127

83

138

0.30

B10c

Were there any important flaws in the design or methods of the study?

117

77

129

0.08

Statistical methods

B11

for continuous scores: Was an intraclass correlation coefficient (ICC) calculated?

119

86

133

0.59e

B12

for dichotomous/nominal/ordinal scores: Was kappa calculated?

111

81

127

0.32e

B13

for ordinal scores: Was a weighted kappa calculated?

111

83

127

0.42e

B14

for ordinal scores: Was the weighting scheme described? e.g. linear, quadratic

108

81

124

0.35e

Box D. Content validity (n = 83) b

Design requirements

D1

Was there an assessment of whether all items refer to relevant aspects of the construct to be measured?

62

79

83

0.33

D2

Was there an assessment of whether all items are relevant for the study population? (e.g. age, gender, disease characteristics, country, setting)

62

76

83

0.46

D3

Was there an assessment of whether all items are relevant for the purpose of the measurement instrument? (discriminative, evaluative, and/or predictive)

62

66

83

0.21

D4

Was there an assessment of whether all items together comprehensively reflect the construct to be measured?

62

66

83

0.15

D5c

Were there any important flaws in the design or methods of the study?

58

76

78

0.13

Box E. Structural validity (n = 118) b

E1

Does the scale consist of effect indicators, i.e. is it based on a reflective model?

99

78

116

0f

Design requirements

E2c

Was the percentage of missing items given?

95

87

110

0.41

E3c

Was there a description of how missing items were handled?

93

91

109

0.55

E4

Was the sample size included in the analysis adequate?

94

87

109

0.56d

E5c

Were there any important flaws in the design or methods of the study?

89

84

103

0.27

Statistical methods

E6

for CTT: Was exploratory or confirmatory factor analysis performed?

92

90

106

0.51d,e

E7

for IRT: Were IRT tests for determining the (uni-) dimensionality of the items performed?

62

87

80

0.39e,f

Box F. Hypotheses testing (n = 170) b

Design requirements

F1c

Was the percentage of missing items given?

158

87

168

0.41

F2c

Was there a description of how missing items were handled?

159

92

169

0.60d

F3

Was the sample size included in the analysis adequate?

157

84

167

0.12d

F4

Were hypotheses regarding correlations or mean differences formulated a priori (i.e. before data collection)?

158

74

168

0.42

F5

Was the expected direction of correlations or mean differences included in the hypotheses?

159

75

169

0.26e

F6

Was the expected absolute or relative magnitude of correlations or mean differences included in the hypotheses?

159

82

168

0.29e

F7c

for convergent validity: Was an adequate description provided of the comparator instrument(s)?

125

83

136

0.30

F8c

for convergent validity: Were the measurement properties of the comparator instrument(s) adequately described?

124

81

135

0.35

F9c

Were there any important flaws in the design or methods of the study?

131

81

145

0.17

Statistical methods

F10

Were design and statistical methods adequate for the hypotheses to be tested?

150

78

161

0.00d,e,f

Box G. Cross-cultural validity (n = 33) b

Design requirements

G1c

Was the percentage of missing items given?

25

88

32

0.52

G2c

Was there a description of how missing items were handled?

22

82

30

0.32

G3

Was the sample size included in the analysis adequate?

26

81

33

0.23

G4c

Were both the original language in which the HR-PRO instrument was developed, and the language in which the HR-PRO instrument was translated described?

28

89

33

0.34d

G5c

Was the expertise of the people involved in the translation process adequately described? e.g. expertise in the disease(s) involved, expertise in the construct to be measured, expertise in both languages

28

86

33

0.46

G6

Did the translators work independently from each other?

28

89

33

0.61

G7

Were items translated forward and backward?

28

100

33

1.00

G8c

Was there an adequate description of how differences between the original and translated versions were resolved?

28

86

33

0.50

G9c

Was the translation reviewed by a committee (e.g. original developers)?

25

88

31

0.56

G10c

Was the HR-PRO instrument pre-tested (e.g. cognitive interviews) to check interpretation, cultural relevance of the translation, and ease of comprehension?

21

90

29

0.61

G11c

Was the sample used in the pre-test adequately described?

28

79

32

0f

G12

Were the samples similar for all characteristics except language and/or cultural background?

26

81

31

0.41

G13c

Were there any important flaws in the design or methods of the study?

26

85

31

0.42

Statistical methods

G14

for CTT: Was confirmatory factor analysis performed?

27

74

32

0.03e,f

G15

for IRT: Was differential item function (DIF) between language groups assessed?

13

77

23

0.28e,f

Box H. Criterion validity (n = 57) b

Design requirements

H1c

Was the percentage of missing items given?

35

91

56

0.59d

H2c

Was there a description of how missing items were handled?

35

97

56

0.79 d

H3

Was the sample size included in the analysis adequate?

35

69

54

0.06

H4

Can the criterion used or employed be considered as a reasonable 'gold standard'?

37

62

57

0f

H5c

Were there any important flaws in the design or methods of the study?

33

79

54

0.10

Statistical methods

H6

for continuous scores: Were correlations, or the area under the receiver operating curve calculated?

37

78

56

0.16e

H7

for dichotomous scores: Were sensitivity and specificity determined?

29

83

47

0.28e,f

Box I. Responsiviness (n = 79) b

Design requirements

I1c

Was the percentage of missing items given?

71

82

76

0.14d

I2c

Was there a description of how missing items were handled?

73

92

77

0.36d

I3

Was the sample size included in the analysis adequate?

72

72

76

0.40

I4c

Was a longitudinal design with at least two measurement used?

73

100

78

1.00 d

I5c

Was the time interval stated?

73

89

78

0.25d

I6c

If anything occurred in the interim period (e.g. intervention, other relevant events), was it adequately described?

72

78

75

0.17

I7c

Was a proportion of the patients changed (i.e. improvement or deterioration)?

70

97

73

0.32d

Design requirements for hypotheses testing

For constructs for which a gold standard was not available

I8

Were hypotheses about changes in scores formulated a priori (i.e. before data collection)?

65

69

72

0.35

I9

Was the expected direction of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses?

60

78

65

0.19e

I10

Were the expected absolute or relative magnitude of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses?

61

90

66

0.05d,e

I11c

Was an adequate description provided of the comparator instrument(s)?

56

70

63

0f

I12c

Were the measurement properties of the comparator instrument(s) adequately described?

56

80

63

0.06

I13c

Were there any important flaws in the design or methods of the study?

63

71

68

0.03

Statistical methods

I14

Were design and statistical methods adequate for the hypotheses to be tested?

63

73

67

0.21e,f

Design requirements for comparison to a gold standard

For constructs for which a gold standards was available:

I15

Can the criterion for change be considered as a reasonable 'gold standard'?

21

67

28

0f

I16c

Were there any important flaws in the design or methods of the study?

12

67

21

0f

Statistical methods

I17

for continuous scores: Were correlations between change scores, or the area under the Receiver Operator Curve (ROC) curve calculated?

28

79

39

0.47e,f

I18

for dichotomous scales: Were sensitivity and specificity (changed versus not changed) determined?

28

79

37

0.15e

Box J. Interpretability (n = 42) b

J1c

Was the percentage of missing items given?

22

95

41

0.80

J2c

Was there a description of how missing items were handled?

21

76

41

0.19

J3

Was the sample size included in the analysis adequate?

23

74

41

0f

J4c

Was the distribution of the (total) scores in the study sample described?

23

74

41

0.08

J5c

Was the percentage of the respondents who had the lowest possible (total) score described?

20

95

40

0.84

J6c

Was the percentage of the respondents who had the highest possible (total) score described?

21

90

41

0.70

J7c

Were scores and change scores (i.e. means and SD) presented for relevant (sub) groups? e.g. for normative groups, subgroups of patients, or the general population

21

76

41

0.05

J8c

Was the minimal important change (MIC) or the minimal important difference (MID) determined?

19

89

40

0.26d

J9c

Were there any important flaws in the design or methods of the study?

21

71

41

0f

  1. a When calculating percentage agreement, articles that were only scored once on the particular item were not taken into account;b number of times a box was evaluated;c dichotomous item;d Items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response category;e Combined kappa coefficient calculated because of nominal response scale in a one-way design;f Negative variance component in the calculation of kappa was set at 0;g sample sizes of Generalisability box are much higher that other items, because scores of the items on the Generalisability box for all measurement properties were combined; printed in bold indicates Kappa > 0.70 or % agreement >80%.