BMC Medical Research Methodology

Background: Little work has been done to investigate the suggestion that the use of selected scales from a multi-scale health-status questionnaire would compromise reliability and validity. The aim of this study was to compare the performance of three scales selected from the SF-36 generic health questionnaire when administered in isolation or within the entire SF-36 to patients with musculoskeletal disorders.


Background
Measures of health status and quality of life are being increasingly used in clinical research. In the evaluation of many conditions, it might be necessary to combine generic and disease-specific questionnaires. Many questionnaires are long and consist of several scales, which might substantially increase responder burden. Generic health-status questionnaires usually consist of separate scales related to physical and mental health. In musculoskeletal conditions, physical health scales are more likely to show differences after treatments and would thus be used in sample size estimations; mental health scales would then lack the power to show differences. It has been suggested that multi-scale health-status questionnaires should be used in their entirety and that the use of selected scales would, by taking them out of their context, compromise their reliability and validity and the possibility to compare scores across studies and with population norms [1]. However there is little scientific work concerning the influence of excluding some scales in a health-status questionnaire on the performance of the remaining scales. Demonstrating whether the scores yielded when using selected scales are similar to those yielded when the entire questionnaire is administered would be important because similarity of scores would allow comparison with the corresponding scores in studies that used the entire questionnaire and with population norms. This would facilitate the interpretation of scores when selected scales are used.
The SF-36 is a widely used health-status measure that consists of eight scales related to physical and mental health [2][3][4]. Different SF-36 scales have been used selectively in previous studies without prior evidence of reliability and validity [5][6][7]. The purpose of this study was to investigate the performance of three SF-36 scales related to physical health (physical functioning, bodily pain and general health perceptions) when administered selectively or within the entire SF-36 to a patient population with musculoskeletal disorders.

Methods
This 2-part study was conducted on patients with musculoskeletal disorders referred from primary care physicians to the only orthopedic department available in the study region. All referred patients, aged 25 to 74 years, who had a scheduled visit to the orthopedic department during a 6week period were asked to complete a mailed questionnaire within 4 weeks before their visit and to complete a second questionnaire administered during the visit.

Consecutive administration of selected scales and entire questionnaire
In the first part of the study, one questionnaire comprised three SF-36 scales related to physical health (physical functioning, bodily pain, and general health perceptions) without any modifications in the order or composition of the items. The second questionnaire comprised all eight SF-36 scales with no modifications. During the first half of the study period the first questionnaire comprised the three selected SF-36 scales and the second questionnaire the entire SF-36; in the second half of the study period the two questionnaires were administered in reverse order. On both occasions the questionnaires were self-completed by the patients.

Repeated administration of entire questionnaire
In the second part of the study a formal test-retest reliability assessment of the SF-36 was performed; the entire SF-36 was administered on two occasions in a similar fashion as in the first part of the study.

Item concerning change in health status
In both groups the questionnaire that was administered on the second occasion started with an inquiry about current health status compared to that when the first questionnaire was completed (Question: Compared to when you completed the questionnaire regarding your health about a week ago, how is your health now? response options; much better, somewhat better, same, somewhat worse, much worse).

Statistical analysis
The reliability (internal consistency) of the SF-36 physical functioning, bodily pain and general health perceptions scales was assessed with the Cronbach alpha coefficient [8]. The item scores for each scale were transformed into scale scores ranging from 0 (worst) to 100 (best) [1]. The mean score and 95% confidence interval (CI) for each of the three scales were calculated. The agreement between the scores for each of the three scales administered as isolated scales and within the entire SF-36 was assessed using the intraclass correlation coefficient (ICC) and the differences were tested with the paired t-test [8]. This analysis included only the patients who reported unchanged health status at the time of completing the second questionnaire. Only questionnaires with complete responses for all items in all of the three scales were included in the analysis. Because the analysis involved assessment of agreement missing data were not replaced. The same analyses were performed on the data obtained when the entire SF-36 was administered on two occasions. The mean differences between the scores shown when the three scales were selectively administered and those shown when they were administered within the entire SF-36 were compared to the mean differences in the scores shown after repeated administration of the entire SF-36 using the t-test.

Consecutive administration of selected scales and entire questionnaire
During the 6-week study period, 137 consecutive referred patients attended the orthopedic department for a scheduled visit. Of these, 11 completed only one of the questionnaires, and 23 reported changed health status since completing the first questionnaire. The remaining 103 patients completed both questionnaires and reported unchanged health status. For 23 (22%) of these patients scores could not be computed for at least one scale because of missing item responses. Thus, 80 patients had scores for all three scales for both occasions. The mean age of these 80 patients was 50 (SD, 11) years and 41 (51%) were women. The mean time interval between the responses to the two questionnaires was 14 (SD, 3) days.
The Cronbach alpha reliability coefficient exceeded 0.8 for all three scales ( Table 1). The ICC was good for all three scales and the mean difference between the scores was 0.4 point for the physical functioning scale, 2.5 points for the bodily pain scale, and 0.5 point for the general health perceptions scale, indicating good agreement between the scores when the three scales were administered with and without the remaining SF-36 scales.

Repeated administration of entire questionnaire
In the second part of the study, 107 consecutive referred patients attended their scheduled visit during a 6-week period. Of these, 18 completed only one of the questionnaires, and 15 reported changed health status since completing the first questionnaire. The remaining 74 patients completed both questionnaires and reported unchanged health status. For 12 (16%) of these patients scores could not be computed for at least one of the three scales studied because of missing item responses. Thus, 62 patients had scores for all three scales for both occasions. The mean age of these 62 patients was 51 (SD, 11) years and 34 (55%) were women. The mean time interval between the responses to the two questionnaires was 13 (SD, 5) days. The Cronbach alpha reliability coefficient exceeded 0.7 for all three scales ( Table 2). The ICC was good for all three scales and the mean difference between the scores was 0.1 point for the physical functioning scale, 1.9 points for the bodily pain scale, and 1 point for the general health perceptions scale, indicating good testretest reliability.

Comparison of score differences
For all three scales, the mean differences between the scores obtained when the three scales were selectively administered and those obtained when they were administered within the entire SF-36 did not differ significantly from the mean differences shown after repeated administration of the entire SF-36. The mean difference (95% CI) for the physical functioning scale was 0.3 (-2.8-3.4), for the bodily pain scale 0.6 (-3.5-4.7), and for the general health perceptions scale -0.5 (-4.4-3.4).

Discussion
This study showed that the physical functioning and general health perceptions scales gave similar scores when administered independently or within the entire SF-36. Although the bodily pain scale showed a difference of 2.5 points, this occurred in a patient population with musculoskeletal disorders causing pain, the severity of which was rated on two occasions. A difference of approximately 2 points also was found when the entire SF-36 was administered on two occasions. Although no test-retest reliability data have been presented for most of the published SF-36 population norms, one study performed on patients with rheumatoid arthritis showed an intraclass correlation coefficient for the physical functioning, bodily pain, and general health perceptions scales of 0.93, 0.76, and 0.91, and mean score difference of -1.8, 2.9 and 0.2, respectively [9].
The findings of the present study do not support the suggestion that the exclusion of some scales of a health-status measure would influence the response patterns to the remaining scales. We have not found any previous study on the influence of excluding some scales in a health-status questionnaire. Specific diseases might have substantial impact on certain health dimensions and little or no impact on others, which would be reflected on the scores for the scales measuring these dimensions. Also, health-status scales are often used as part of more extensive questionnaires and researchers might elect to include selected scales that are relevant to the study purpose; the physical functioning, bodily pain and general health perceptions scales have been selectively used previously [5][6][7]. Shorter versions of certain health-status questionnaires have been introduced with the purpose of reducing responder bur- den. However, these shorter versions attenuate the original scales and might not perform as well in diseases that have larger impact on specific scales. The SF-12 (a shorter version of the SF-36) generates a physical and a mental component summary score [10]. These summary scales have demonstrated inferior performance compared to the bodily pain or physical functioning scales in musculoskeletal conditions [11]. Shorter questionnaires might have a higher response rate (although not consistently shown) [12,13], in addition to saving time and resources. By reducing the workload required, shorter questionnaires might facilitate the participation of clinicians in national databases improving the validity of the information obtained from these databases. Use of existing scales might be an alternative to the long process of constructing shorter questionnaires followed by extensive reliability and validity testing [14].
However, excluding certain scales has been discouraged by the questionnaire's developers on the basis that it might compromise reliability [1]. A previous study evaluated the use of selected scales but examined only reliability of these scales without showing whether they would generate similar scores when used within the entire questionnaire [15]. Demonstrating good psychometric properties of selectively used scales is important. However, maintaining good reliability of selected SF-36 scales does not necessarily ensure that the scales would yield similar scores as when administered within the entire SF-36 to allow comparison across studies. The findings of the present study imply that the scores of the selectively used scales can be compared with the corresponding scores in studies that used the entire SF-36 and with population norms, thus facilitating score interpretability.
In the first part of the study, the order in which the two questionnaires were administered was not random. However, it is unlikely that this could have influenced the re-sults because consecutive patients were included and the questionnaires were self-completed.