Item nr | Item | N (minus articles with 1 rating)a | % agreement | N | Kappa |
---|---|---|---|---|---|
Box A Internal consistency (n = 195)b | |||||
A1 | Does the scale consist of effect indicators, i.e. is it based on a reflective model? | 185 | 82 | 193 | 0.06 |
Design requirements | |||||
A2c | Was the percentage of missing items given? | 183 | 87 | 190 | 0.48 |
A3c | Was there a description of how missing items were handled? | 180 | 90 | 187 | 0.54 |
A4 | Was the sample size included in the internal consistency analysis adequate? | 177 | 87 | 185 | 0.06d |
A5c | Was the unidimensionality of the scale checked? i.e. was factor analysis or IRT model applied? | 180 | 92 | 187 | 0.69 |
A6 | Was the sample size included in the unidimensionality analysis adequate? | 166 | 79 | 178 | 0.27 |
A7 | Was an internal consistency statistic calculated for each (unidimensional) (sub)scale separately? | 179 | 85 | 187 | 0.31d |
A8c | Were there any important flaws in the design or methods of the study? | 174 | 86 | 179 | 0.22d |
Statistical methods | |||||
A9 | for Classical Test Theory (CTT): Was Cronbach's alpha calculated? | 179 | 93 | 187 | 0.27d,e |
A10 | for dichotomous scores: Was Cronbach's alpha or KR-20 calculated? | 151 | 91 | 165 | 0.17d,e |
A11 | for IRT: Was a goodness of fit statistic at a global level calculated? e.g. χ2, reliability coefficient of estimated latent trait value (index of (subject or item) separation) | 154 | 93 | 167 | 0.46d,e |
Box B. Reliability (n = 141) b | |||||
Design requirements | |||||
B1c | Was the percentage of missing items given? | 129 | 87 | 140 | 0.39 |
B2c | Was there a description of how missing items were handled? | 125 | 91 | 137 | 0.43d |
B3 | Was the sample size included in the analysis adequate? | 127 | 77 | 139 | 0.35 |
B4c | Were at least two measurements available? | 129 | 98 | 140 | 0.72 d |
B5 | Were the administrations independent? | 129 | 73 | 139 | 0.18 |
B6c | Was the time interval stated? | 125 | 94 | 136 | 0.50d |
B7 | Were patients stable in the interim period on the construct to be measured? | 126 | 75 | 138 | 0.24 |
B8 | Was the time interval appropriate? | 125 | 84 | 137 | 0.45 |
B9 | Were the test conditions similar for both measurements? e.g. type of administration, environment, instructions | 127 | 83 | 138 | 0.30 |
B10c | Were there any important flaws in the design or methods of the study? | 117 | 77 | 129 | 0.08 |
Statistical methods | |||||
B11 | for continuous scores: Was an intraclass correlation coefficient (ICC) calculated? | 119 | 86 | 133 | 0.59e |
B12 | for dichotomous/nominal/ordinal scores: Was kappa calculated? | 111 | 81 | 127 | 0.32e |
B13 | for ordinal scores: Was a weighted kappa calculated? | 111 | 83 | 127 | 0.42e |
B14 | for ordinal scores: Was the weighting scheme described? e.g. linear, quadratic | 108 | 81 | 124 | 0.35e |
Box D. Content validity (n = 83) b | |||||
Design requirements | |||||
D1 | Was there an assessment of whether all items refer to relevant aspects of the construct to be measured? | 62 | 79 | 83 | 0.33 |
D2 | Was there an assessment of whether all items are relevant for the study population? (e.g. age, gender, disease characteristics, country, setting) | 62 | 76 | 83 | 0.46 |
D3 | Was there an assessment of whether all items are relevant for the purpose of the measurement instrument? (discriminative, evaluative, and/or predictive) | 62 | 66 | 83 | 0.21 |
D4 | Was there an assessment of whether all items together comprehensively reflect the construct to be measured? | 62 | 66 | 83 | 0.15 |
D5c | Were there any important flaws in the design or methods of the study? | 58 | 76 | 78 | 0.13 |
Box E. Structural validity (n = 118) b | |||||
E1 | Does the scale consist of effect indicators, i.e. is it based on a reflective model? | 99 | 78 | 116 | 0f |
Design requirements | |||||
E2c | Was the percentage of missing items given? | 95 | 87 | 110 | 0.41 |
E3c | Was there a description of how missing items were handled? | 93 | 91 | 109 | 0.55 |
E4 | Was the sample size included in the analysis adequate? | 94 | 87 | 109 | 0.56d |
E5c | Were there any important flaws in the design or methods of the study? | 89 | 84 | 103 | 0.27 |
Statistical methods | |||||
E6 | for CTT: Was exploratory or confirmatory factor analysis performed? | 92 | 90 | 106 | 0.51d,e |
E7 | for IRT: Were IRT tests for determining the (uni-) dimensionality of the items performed? | 62 | 87 | 80 | 0.39e,f |
Box F. Hypotheses testing (n = 170) b | |||||
Design requirements | |||||
F1c | Was the percentage of missing items given? | 158 | 87 | 168 | 0.41 |
F2c | Was there a description of how missing items were handled? | 159 | 92 | 169 | 0.60d |
F3 | Was the sample size included in the analysis adequate? | 157 | 84 | 167 | 0.12d |
F4 | Were hypotheses regarding correlations or mean differences formulated a priori (i.e. before data collection)? | 158 | 74 | 168 | 0.42 |
F5 | Was the expected direction of correlations or mean differences included in the hypotheses? | 159 | 75 | 169 | 0.26e |
F6 | Was the expected absolute or relative magnitude of correlations or mean differences included in the hypotheses? | 159 | 82 | 168 | 0.29e |
F7c | for convergent validity: Was an adequate description provided of the comparator instrument(s)? | 125 | 83 | 136 | 0.30 |
F8c | for convergent validity: Were the measurement properties of the comparator instrument(s) adequately described? | 124 | 81 | 135 | 0.35 |
F9c | Were there any important flaws in the design or methods of the study? | 131 | 81 | 145 | 0.17 |
Statistical methods | |||||
F10 | Were design and statistical methods adequate for the hypotheses to be tested? | 150 | 78 | 161 | 0.00d,e,f |
Box G. Cross-cultural validity (n = 33) b | |||||
Design requirements | |||||
G1c | Was the percentage of missing items given? | 25 | 88 | 32 | 0.52 |
G2c | Was there a description of how missing items were handled? | 22 | 82 | 30 | 0.32 |
G3 | Was the sample size included in the analysis adequate? | 26 | 81 | 33 | 0.23 |
G4c | Were both the original language in which the HR-PRO instrument was developed, and the language in which the HR-PRO instrument was translated described? | 28 | 89 | 33 | 0.34d |
G5c | Was the expertise of the people involved in the translation process adequately described? e.g. expertise in the disease(s) involved, expertise in the construct to be measured, expertise in both languages | 28 | 86 | 33 | 0.46 |
G6 | Did the translators work independently from each other? | 28 | 89 | 33 | 0.61 |
G7 | Were items translated forward and backward? | 28 | 100 | 33 | 1.00 |
G8c | Was there an adequate description of how differences between the original and translated versions were resolved? | 28 | 86 | 33 | 0.50 |
G9c | Was the translation reviewed by a committee (e.g. original developers)? | 25 | 88 | 31 | 0.56 |
G10c | Was the HR-PRO instrument pre-tested (e.g. cognitive interviews) to check interpretation, cultural relevance of the translation, and ease of comprehension? | 21 | 90 | 29 | 0.61 |
G11c | Was the sample used in the pre-test adequately described? | 28 | 79 | 32 | 0f |
G12 | Were the samples similar for all characteristics except language and/or cultural background? | 26 | 81 | 31 | 0.41 |
G13c | Were there any important flaws in the design or methods of the study? | 26 | 85 | 31 | 0.42 |
Statistical methods | |||||
G14 | for CTT: Was confirmatory factor analysis performed? | 27 | 74 | 32 | 0.03e,f |
G15 | for IRT: Was differential item function (DIF) between language groups assessed? | 13 | 77 | 23 | 0.28e,f |
Box H. Criterion validity (n = 57) b | |||||
Design requirements | |||||
H1c | Was the percentage of missing items given? | 35 | 91 | 56 | 0.59d |
H2c | Was there a description of how missing items were handled? | 35 | 97 | 56 | 0.79 d |
H3 | Was the sample size included in the analysis adequate? | 35 | 69 | 54 | 0.06 |
H4 | Can the criterion used or employed be considered as a reasonable 'gold standard'? | 37 | 62 | 57 | 0f |
H5c | Were there any important flaws in the design or methods of the study? | 33 | 79 | 54 | 0.10 |
Statistical methods | |||||
H6 | for continuous scores: Were correlations, or the area under the receiver operating curve calculated? | 37 | 78 | 56 | 0.16e |
H7 | for dichotomous scores: Were sensitivity and specificity determined? | 29 | 83 | 47 | 0.28e,f |
Box I. Responsiviness (n = 79) b | |||||
Design requirements | |||||
I1c | Was the percentage of missing items given? | 71 | 82 | 76 | 0.14d |
I2c | Was there a description of how missing items were handled? | 73 | 92 | 77 | 0.36d |
I3 | Was the sample size included in the analysis adequate? | 72 | 72 | 76 | 0.40 |
I4c | Was a longitudinal design with at least two measurement used? | 73 | 100 | 78 | 1.00 d |
I5c | Was the time interval stated? | 73 | 89 | 78 | 0.25d |
I6c | If anything occurred in the interim period (e.g. intervention, other relevant events), was it adequately described? | 72 | 78 | 75 | 0.17 |
I7c | Was a proportion of the patients changed (i.e. improvement or deterioration)? | 70 | 97 | 73 | 0.32d |
Design requirements for hypotheses testing | |||||
For constructs for which a gold standard was not available | |||||
I8 | Were hypotheses about changes in scores formulated a priori (i.e. before data collection)? | 65 | 69 | 72 | 0.35 |
I9 | Was the expected direction of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses? | 60 | 78 | 65 | 0.19e |
I10 | Were the expected absolute or relative magnitude of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses? | 61 | 90 | 66 | 0.05d,e |
I11c | Was an adequate description provided of the comparator instrument(s)? | 56 | 70 | 63 | 0f |
I12c | Were the measurement properties of the comparator instrument(s) adequately described? | 56 | 80 | 63 | 0.06 |
I13c | Were there any important flaws in the design or methods of the study? | 63 | 71 | 68 | 0.03 |
Statistical methods | |||||
I14 | Were design and statistical methods adequate for the hypotheses to be tested? | 63 | 73 | 67 | 0.21e,f |
Design requirements for comparison to a gold standard | |||||
For constructs for which a gold standards was available: | |||||
I15 | Can the criterion for change be considered as a reasonable 'gold standard'? | 21 | 67 | 28 | 0f |
I16c | Were there any important flaws in the design or methods of the study? | 12 | 67 | 21 | 0f |
Statistical methods | |||||
I17 | for continuous scores: Were correlations between change scores, or the area under the Receiver Operator Curve (ROC) curve calculated? | 28 | 79 | 39 | 0.47e,f |
I18 | for dichotomous scales: Were sensitivity and specificity (changed versus not changed) determined? | 28 | 79 | 37 | 0.15e |
Box J. Interpretability (n = 42) b | |||||
J1c | Was the percentage of missing items given? | 22 | 95 | 41 | 0.80 |
J2c | Was there a description of how missing items were handled? | 21 | 76 | 41 | 0.19 |
J3 | Was the sample size included in the analysis adequate? | 23 | 74 | 41 | 0f |
J4c | Was the distribution of the (total) scores in the study sample described? | 23 | 74 | 41 | 0.08 |
J5c | Was the percentage of the respondents who had the lowest possible (total) score described? | 20 | 95 | 40 | 0.84 |
J6c | Was the percentage of the respondents who had the highest possible (total) score described? | 21 | 90 | 41 | 0.70 |
J7c | Were scores and change scores (i.e. means and SD) presented for relevant (sub) groups? e.g. for normative groups, subgroups of patients, or the general population | 21 | 76 | 41 | 0.05 |
J8c | Was the minimal important change (MIC) or the minimal important difference (MID) determined? | 19 | 89 | 40 | 0.26d |
J9c | Were there any important flaws in the design or methods of the study? | 21 | 71 | 41 | 0f |