Inter-rater agreement and reliability of the COSMIN (COnsensus-based Standards for the selection of health status Measurement Instruments) Checklist

Mokkink, Lidwine B; Terwee, Caroline B; Gibbons, Elizabeth; Stratford, Paul W; Alonso, Jordi; Patrick, Donald L; Knol, Dirk L; Bouter, Lex M; de Vet, Henrica CW

doi:10.1186/1471-2288-10-82

Table 2 Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COSMIN checklist (COSMIN step 3)

From: Inter-rater agreement and reliability of the COSMIN (COnsensus-based Standards for the selection of health status Measurement Instruments) Checklist

Item nr	Item	N (minus articles with 1 rating)^a	% agreement	N	Kappa
Box A Internal consistency (n = 195)^b
A1	Does the scale consist of effect indicators, i.e. is it based on a reflective model?	185	82	193	0.06
Design requirements
A2^c	Was the percentage of missing items given?	183	87	190	0.48
A3^c	Was there a description of how missing items were handled?	180	90	187	0.54
A4	Was the sample size included in the internal consistency analysis adequate?	177	87	185	0.06^d
A5^c	Was the unidimensionality of the scale checked? i.e. was factor analysis or IRT model applied?	180	92	187	0.69
A6	Was the sample size included in the unidimensionality analysis adequate?	166	79	178	0.27
A7	Was an internal consistency statistic calculated for each (unidimensional) (sub)scale separately?	179	85	187	0.31^d
A8^c	Were there any important flaws in the design or methods of the study?	174	86	179	0.22^d
Statistical methods
A9	for Classical Test Theory (CTT): Was Cronbach's alpha calculated?	179	93	187	0.27^d,e
A10	for dichotomous scores: Was Cronbach's alpha or KR-20 calculated?	151	91	165	0.17^d,e
A11	for IRT: Was a goodness of fit statistic at a global level calculated? e.g. χ², reliability coefficient of estimated latent trait value (index of (subject or item) separation)	154	93	167	0.46^d,e
Box B. Reliability (n = 141) ^b
Design requirements
B1^c	Was the percentage of missing items given?	129	87	140	0.39
B2^c	Was there a description of how missing items were handled?	125	91	137	0.43^d
B3	Was the sample size included in the analysis adequate?	127	77	139	0.35
B4^c	Were at least two measurements available?	129	98	140	0.72 ^d
B5	Were the administrations independent?	129	73	139	0.18
B6^c	Was the time interval stated?	125	94	136	0.50^d
B7	Were patients stable in the interim period on the construct to be measured?	126	75	138	0.24
B8	Was the time interval appropriate?	125	84	137	0.45
B9	Were the test conditions similar for both measurements? e.g. type of administration, environment, instructions	127	83	138	0.30
B10^c	Were there any important flaws in the design or methods of the study?	117	77	129	0.08
Statistical methods
B11	for continuous scores: Was an intraclass correlation coefficient (ICC) calculated?	119	86	133	0.59^e
B12	for dichotomous/nominal/ordinal scores: Was kappa calculated?	111	81	127	0.32^e
B13	for ordinal scores: Was a weighted kappa calculated?	111	83	127	0.42^e
B14	for ordinal scores: Was the weighting scheme described? e.g. linear, quadratic	108	81	124	0.35^e
Box D. Content validity (n = 83) ^b
Design requirements
D1	Was there an assessment of whether all items refer to relevant aspects of the construct to be measured?	62	79	83	0.33
D2	Was there an assessment of whether all items are relevant for the study population? (e.g. age, gender, disease characteristics, country, setting)	62	76	83	0.46
D3	Was there an assessment of whether all items are relevant for the purpose of the measurement instrument? (discriminative, evaluative, and/or predictive)	62	66	83	0.21
D4	Was there an assessment of whether all items together comprehensively reflect the construct to be measured?	62	66	83	0.15
D5^c	Were there any important flaws in the design or methods of the study?	58	76	78	0.13
Box E. Structural validity (n = 118) ^b
E1	Does the scale consist of effect indicators, i.e. is it based on a reflective model?	99	78	116	0^f
Design requirements
E2^c	Was the percentage of missing items given?	95	87	110	0.41
E3^c	Was there a description of how missing items were handled?	93	91	109	0.55
E4	Was the sample size included in the analysis adequate?	94	87	109	0.56^d
E5^c	Were there any important flaws in the design or methods of the study?	89	84	103	0.27
Statistical methods
E6	for CTT: Was exploratory or confirmatory factor analysis performed?	92	90	106	0.51^d,e
E7	for IRT: Were IRT tests for determining the (uni-) dimensionality of the items performed?	62	87	80	0.39^e,f
Box F. Hypotheses testing (n = 170) ^b
Design requirements
F1^c	Was the percentage of missing items given?	158	87	168	0.41
F2^c	Was there a description of how missing items were handled?	159	92	169	0.60^d
F3	Was the sample size included in the analysis adequate?	157	84	167	0.12^d
F4	Were hypotheses regarding correlations or mean differences formulated a priori (i.e. before data collection)?	158	74	168	0.42
F5	Was the expected direction of correlations or mean differences included in the hypotheses?	159	75	169	0.26^e
F6	Was the expected absolute or relative magnitude of correlations or mean differences included in the hypotheses?	159	82	168	0.29^e
F7^c	for convergent validity: Was an adequate description provided of the comparator instrument(s)?	125	83	136	0.30
F8^c	for convergent validity: Were the measurement properties of the comparator instrument(s) adequately described?	124	81	135	0.35
F9^c	Were there any important flaws in the design or methods of the study?	131	81	145	0.17
Statistical methods
F10	Were design and statistical methods adequate for the hypotheses to be tested?	150	78	161	0.00^d,e,f
Box G. Cross-cultural validity (n = 33) ^b
Design requirements
G1^c	Was the percentage of missing items given?	25	88	32	0.52
G2^c	Was there a description of how missing items were handled?	22	82	30	0.32
G3	Was the sample size included in the analysis adequate?	26	81	33	0.23
G4^c	Were both the original language in which the HR-PRO instrument was developed, and the language in which the HR-PRO instrument was translated described?	28	89	33	0.34^d
G5^c	Was the expertise of the people involved in the translation process adequately described? e.g. expertise in the disease(s) involved, expertise in the construct to be measured, expertise in both languages	28	86	33	0.46
G6	Did the translators work independently from each other?	28	89	33	0.61
G7	Were items translated forward and backward?	28	100	33	1.00
G8^c	Was there an adequate description of how differences between the original and translated versions were resolved?	28	86	33	0.50
G9^c	Was the translation reviewed by a committee (e.g. original developers)?	25	88	31	0.56
G10^c	Was the HR-PRO instrument pre-tested (e.g. cognitive interviews) to check interpretation, cultural relevance of the translation, and ease of comprehension?	21	90	29	0.61
G11^c	Was the sample used in the pre-test adequately described?	28	79	32	0^f
G12	Were the samples similar for all characteristics except language and/or cultural background?	26	81	31	0.41
G13^c	Were there any important flaws in the design or methods of the study?	26	85	31	0.42
Statistical methods
G14	for CTT: Was confirmatory factor analysis performed?	27	74	32	0.03^e,f
G15	for IRT: Was differential item function (DIF) between language groups assessed?	13	77	23	0.28^e,f
Box H. Criterion validity (n = 57) ^b
Design requirements
H1^c	Was the percentage of missing items given?	35	91	56	0.59^d
H2^c	Was there a description of how missing items were handled?	35	97	56	0.79 ^d
H3	Was the sample size included in the analysis adequate?	35	69	54	0.06
H4	Can the criterion used or employed be considered as a reasonable 'gold standard'?	37	62	57	0^f
H5^c	Were there any important flaws in the design or methods of the study?	33	79	54	0.10
Statistical methods
H6	for continuous scores: Were correlations, or the area under the receiver operating curve calculated?	37	78	56	0.16^e
H7	for dichotomous scores: Were sensitivity and specificity determined?	29	83	47	0.28^e,f
Box I. Responsiviness (n = 79) ^b
Design requirements
I1^c	Was the percentage of missing items given?	71	82	76	0.14^d
I2^c	Was there a description of how missing items were handled?	73	92	77	0.36^d
I3	Was the sample size included in the analysis adequate?	72	72	76	0.40
I4^c	Was a longitudinal design with at least two measurement used?	73	100	78	1.00 ^d
I5^c	Was the time interval stated?	73	89	78	0.25^d
I6^c	If anything occurred in the interim period (e.g. intervention, other relevant events), was it adequately described?	72	78	75	0.17
I7^c	Was a proportion of the patients changed (i.e. improvement or deterioration)?	70	97	73	0.32^d
Design requirements for hypotheses testing
For constructs for which a gold standard was not available
I8	Were hypotheses about changes in scores formulated a priori (i.e. before data collection)?	65	69	72	0.35
I9	Was the expected direction of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses?	60	78	65	0.19^e
I10	Were the expected absolute or relative magnitude of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses?	61	90	66	0.05^d,e
I11^c	Was an adequate description provided of the comparator instrument(s)?	56	70	63	0^f
I12^c	Were the measurement properties of the comparator instrument(s) adequately described?	56	80	63	0.06
I13^c	Were there any important flaws in the design or methods of the study?	63	71	68	0.03
Statistical methods
I14	Were design and statistical methods adequate for the hypotheses to be tested?	63	73	67	0.21^e,f
Design requirements for comparison to a gold standard
For constructs for which a gold standards was available:
I15	Can the criterion for change be considered as a reasonable 'gold standard'?	21	67	28	0^f
I16^c	Were there any important flaws in the design or methods of the study?	12	67	21	0^f
Statistical methods
I17	for continuous scores: Were correlations between change scores, or the area under the Receiver Operator Curve (ROC) curve calculated?	28	79	39	0.47^e,f
I18	for dichotomous scales: Were sensitivity and specificity (changed versus not changed) determined?	28	79	37	0.15^e
Box J. Interpretability (n = 42) ^b
J1^c	Was the percentage of missing items given?	22	95	41	0.80
J2^c	Was there a description of how missing items were handled?	21	76	41	0.19
J3	Was the sample size included in the analysis adequate?	23	74	41	0^f
J4^c	Was the distribution of the (total) scores in the study sample described?	23	74	41	0.08
J5^c	Was the percentage of the respondents who had the lowest possible (total) score described?	20	95	40	0.84
J6^c	Was the percentage of the respondents who had the highest possible (total) score described?	21	90	41	0.70
J7^c	Were scores and change scores (i.e. means and SD) presented for relevant (sub) groups? e.g. for normative groups, subgroups of patients, or the general population	21	76	41	0.05
J8^c	Was the minimal important change (MIC) or the minimal important difference (MID) determined?	19	89	40	0.26^d
J9^c	Were there any important flaws in the design or methods of the study?	21	71	41	0^f

^a When calculating percentage agreement, articles that were only scored once on the particular item were not taken into account;^b number of times a box was evaluated;^c dichotomous item;^d Items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response category;^e Combined kappa coefficient calculated because of nominal response scale in a one-way design;^f Negative variance component in the calculation of kappa was set at 0;^g sample sizes of Generalisability box are much higher that other items, because scores of the items on the Generalisability box for all measurement properties were combined; printed in bold indicates Kappa > 0.70 or % agreement >80%.

Back to article page

ISSN: 1471-2288

Contact us

General enquiries: journalsubmissions@springernature.com

BMC Medical Research Methodology

Contact us