Responsiveness of five condition-specific and generic outcome assessment instruments for chronic pain

Background Changes of health and quality-of-life in chronic conditions are mostly small and require specific and sensitive instruments. The aim of this study was to determine and compare responsiveness, i.e. the sensitivity to change of five outcome instruments for effect measurement in chronic pain. Methods In a prospective cohort study, 273 chronic pain patients were assessed on the Numeric Rating Scale (NRS) for pain, the Short Form 36 (SF-36), the Multidimensional Pain Inventory (MPI), the Hospital Anxiety and Depression Scale (HADS), and the Coping Strategies Questionnaire (CSQ). Responsiveness was quantified by effect size (ES) and standardized response mean (SRM) before and after a four week in-patient interdisciplinary pain program and compared by the modified Jacknife test. Results The MPI measured pain more responsively than the SF-36 (ES: 0.85 vs 0.72, p = 0.053; SRM: 0.72 vs 0.60, p = 0.027) and the pain NRS (ES: 0.85 vs 0.62, p < 0.001; SRM: 0.72 vs 0.57, p = 0.001). Similar results were found for the dimensions of role and social interference with pain. Comparison in function was limited due to divergent constructs. The responsiveness of the MPI and the SF-36 was equal for affective health but both were better than the HADS (e.g. MPI vs HADS depression: ES: 0.61 vs 0.43, p = 0.001; SF-36 vs HADS depression: ES: 0.54 vs 0.43, p = 0.004). In the "ability to control pain" coping dimension, the MPI was more responsive than the CSQ (ES: 0.46 vs 0.30, p = 0.011). Conclusion The MPI was most responsive in all comparable domains followed by the SF-36. The pain-specific MPI and the generic SF-36 can be recommended for comprehensive and specific bio-psycho-social effect measurement of health and quality-of-life in chronic pain.


Background
Chronic pain is a syndrome of multiple etiology and has consequences for somatic, psychological and psychosocial well-being, functionality and health related qualityof-life [1]. Outcome assessment of chronic pain should comprehensively cover all relevant dimensions of these health characteristics and should, therefore, be performed with generic measurement tools [2,3]. However, more comprehensive measurement is often tied up with less sensitive assessment in specific domains as shown in var-ious studies: In the assessment of shoulder arthritis, a dose-response curve of specificity and responsiveness could be empirically proven [4]. Improvements following interventions for chronic pain disorders are often small and their detection requires specific instruments which are sensitive to change, i.e. responsive [5][6][7]. Responsiveness is, therefore, besides reliability and other aspects of validity, one of the most important properties of an outcome measure [8]. It is the basis on which the 'discrimination' criteria were established by the quality classification process of the Outcome Measures in Rheumatology Clinical Trials (OMERACT) carried out by the World Health Organization (WHO), the American College of Rheumatology (ACR), and the European League Against Rheumatism (EULAR) [9].
There are several methods to measure responsiveness. Commonly used is the effect size (ES) which gives a continuous parametric measure of the change between baseline and follow-up and can be easily interpreteddetermination and interpretation: see in Methods [8,[10][11][12][13]. However, many reports used the standardized response mean (SRM) which often results in similar values (see in Results) as the ES -determination and interpretation: see in Methods [14]. In this study, we reported both parameters to be comparable to the majority of findings in the literature. A similar parameter is provided by the Guyatt's responsiveness statistics but its determination requires a two point measurement of a "stable" time period, i.e. without health change and interventions and is, therefore, often not available [15,16]. It often results in higher values than the ES and the SRM as can be seen in the comparison of all three parameters in [7]; see also [11]. In case an external criterion (e.g. improved versus unchanged) or a diagnostic threshold of a score (e.g. score ≥ 60 for severe depression) is known as "gold" standard or "anchor" the receiver operating characteristics (ROC) curve is a sensitive method to characterize responsiveness. It provides sensitivity, specificity, negative and positive predictive values, and the area under ROC gives a goodness of fit measure of a test [13,14,16,17]. Further advantages and disadvantages of the different methods can be found in the indicated references.
Comparison of the responsiveness of two scales only makes sense if they measure more or less the same content and construct within the same domain, e.g. in pain, function or affective health [18,19]. This means that the two scales should have a high construct overlap which is most often quantified by the correlation between the two scales [19,20]. To our knowledge, only one study exists examining and comparing the responsiveness of different selfassessment instruments in chronic pain [19].
The present study aimed to determine and compare the responsiveness of five self-assessment instruments widely used in the evaluation of chronic pain patients in an effort to identify the best instruments and scales for the measurement of specific health and quality-of-life dimensions. We hypothesized that a condition-specific instrument is more responsive than a generic one.

Patients
The subjects included in the study were all participants of the "Zurzach Interdisciplinary Pain Program" (ZISP) who were suffering either from chronic non-specific back pain (i.e. lumbar, thoracic, cervical, or panvertebral pain syndrome), or fibromyalgia according to the definition of the American College of Rheumatology (ACR), or chronic widespread pain, i.e. generalized musculoskeletal pain syndrome which did not meet the definition criteria of fibromyalgia [21]. The ZISP program is a comprehensive, standardized, four week inpatient pain program at the rehabilitation clinic "RehaClinic", Bad Zurzach, Switzerland and consists mainly of medical care including adapted drug therapy, graded activity exercise, and cognitive behavioral therapy. A detailed description of the program with the inclusion and exclusion criteria has already been published as part of our outcome paper [5].

Measures
The Short Form 36 (SF-36) is the most widely accepted and frequently used generic instrument that comprehensively measures physical, mental and psychosocial health by means of 36 items (questions) that determine 8 scales [22,23]. The West Haven-Yale Multidimensional Pain Inventory (WHYMPI, abbreviated to MPI) assesses pain and pain-specific consequences in terms of symptoms, activity, behavior, mood, and social relationships on the basis of 51 items that construct 12 scales [24,25]. The Hospital Anxiety and Depression Scale (HADS) measures anxiety and depression based on 7 items each and is well established in psychology and psychiatry with a long history of application [26,27]. The Coping Strategies Questionnaire (CSQ) is the tool most often used to assess cognitive and behavioural strategies to tolerate, manage and compensate for pain and their consequences, and is based on 48 items resulting in 8 scales plus 2 additional control items [28][29][30]. All four instruments are standardized, well tested and widely used -a quick search in MedLine showed 2000-6300 citations for each of the four tools (February 26, 2008). In addition, current pain was assessed by the 11-point Numeric Rating Scale (NRS) ranging from 0 = no pain to 10 = most pain imaginable [31].
On scale level, two instruments can be compared if the items that make up the scales ask about the same domain, i.e. have the same construct [19,20]. Thus, MPI pain severity was compared with the SF-36 bodily pain and the pain NRS for the assessment of pain. SF-36 role physical together with SF-36 social functioning were compared to MPI interference with pain in the assessment of somatic and psychosocial consequences of the pain disorder covering activities of daily living, work, leisure, and social participation. Function, including ambulation and specific activities (home and outdoor), was covered by SF-36 physical functioning and MPI general activities score; the latter was determined by all 18 activity items as previously described [32]. Affective health/mood (explicitly: happiness, tension, irritability, nervousness, calmness/quietism) was assessed by HADS depression and anxiety scales and compared to SF-36 mental health and MPI negative mood. Control over pain was measured by MPI control pain and CSQ control pain (each by one item). These domains have been previously described and the overlap of their constructs has been tested empirically [19].

Analysis
Assessments were performed at entry into the clinic (baseline) and in the last two days before discharge, i.e. four weeks after entry (follow-up). The scores were determined following the "missing rules" of the instruments, i.e. to determine a score, at least 50% of the items had to be filled out for the SF-36 and 6/7 (86%) for the HADS [22,27]. For the MPI and the CSQ, where the developers of the questionnaires do not describe missing rules, we used the previously described 2/3 (67%) criteria [5,6,30]. The score range was transformed into 0 = maximal pain/ no function/worst coping/worst health to 100 = no pain/ full function/best coping/best health for all instruments' scores as originally described for the SF-36 to ease comparison between them with the exception of the pain NRS (0 = no pain, 10 = maximal pain) [5,6,22]. All analyses were performed using the statistical software package SPSS 16.0 for Windows ® (SPSS Inc., Chicago, IL, USA).
The score difference between follow-up and baseline divided by the standard deviation of the group's baseline scores is defined as effect size (ES), originally introduced as "Glass's delta" [10,12]. The score difference (follow-up -baseline) divided by the standard deviation of the group's score differences determines the standardized response mean (SRM), originally published as the "Hedge's g" for one sample which is equal to the "Cohen's delta" in this case [12,14]. The ES and the SRM are the most common measures for responsiveness. Positive values reflect (standardized) improvements in the number of standard deviations of the baseline scores (ES) or the score differences (SRM) (i.e. unit-free) [7,11]. An ES ≥ 0.80 is considered as large, 0.50-0.79 as moderate, 0.20-0.49 as small, and 0.00-0.19 as very small [10].
To test whether the difference of two responsiveness measures within a certain domain was statistically significant, the "modified Jacknife test" was applied [7,18]. This method is a linear regression between the difference of the ES or SRM of two comparable scores (e.g. between SF-36 bodily pain and MPI pain severity) as dependent variable and the "centered" ES/SRM of one of the two scales (which scale is not relevant) as independent variable. If the regression's intercept (value of the SRM/ES difference where the centered ES/SRM is equal to zero) is greater or smaller to zero with significance p < 0.050 there is significant difference of the responsiveness of the two scales. For that, the difference of the two ES/SRM are computed in SPSS individually for each patient as well as the "centered" ES/SRM which is equal to the individual ES/SRM minus the (mean) ES/SRM of the whole sample [18].
In multiple pairwise testing of (at least partly) non independent scores (e.g. within the patient-rating of pain), the significance level must be reduced by the number of tested scores (k), i.e. p = 0.05/(k!/(k-2)!*2!) which is well know as the Bonferroni-correction [33]. Thus, the significance level for type I error was set at p = 0.050/3 = 0.017 for comparison of k = 3 instruments (MPI, SF-36, pain NRS in pain and SF-36, MPI, HADS in affective health) and at p = 0.050 for comparison of two instruments.
To quantify the extent of the overlapping constructs within a domain, bivariate Spearman rank correlation coefficients of the baseline scores and the effects (raw score differences baseline → follow-up) were determined for each pair of scales being compared [20].
An additional way to assess the size of effects is to compare the ES with the minimal important difference (MID) for which the estimate is based on the standard error of measurement (SEM) [34]. The SEM in score units is equal to the baseline standard deviation of the scores multiplied by the square root of (1-r), where r is the reliability measure of the scale, usually the intraclass correlation coefficient [34]. The SEM in responsiveness units is therefore equal to the square root of (1-r) for the ES and equal to the baseline standard deviation divided by the standard deviation of the score differences times the square root of (1r) for the SRM. Note that the SEM in ES or SRM units is independent of the frequency distribution of the scores, i.e. from the sample itself -it is only dependent on the reliability coefficient. We chose the "one-SEM" criterion which means that 1*SEM is an estimate of the MID, the effect which patients (on average) perceive as subjective change [34]. The MID can be used as an estimate for the minimal clinically important difference (MCID) which principally is an anchor-based (on an external criterion) method to assess the smallest effect that patients perceive to be beneficial. As empirically shown, the MID and the MCID often are in the same size [34] or the MCID is even smaller than the MID (see example in the Discussion: [35]). Further information about the importance of the MCID can be found elsewhere [8,16]. However, the present study compared within-subject and not betweensubject effects.

Patients
The cohort consisted of 273 chronic pain patients assessed between 1999 and 2006 whose characteristics have been reported in detail elsewhere [5,36]. The median pain duration was 60 months (5 years). The mean age was 46.3 years (standard deviation = 10.5, normally distributed) and 79.9% were female. Fibromyalgia was present in 43.2%, chronic back pain in 42.5%, and chronic widespread pain in 14.3% of the cases. There were very few omissions, i.e. complete data pairs (i.e. 273 score pairs baseline -follow-up) were available for most of the scores: the MPI negative mood and the pain NRS had 272, the HADS anxiety 271, the MPI activity and the MPI control each 270, and the CSQ control 267 complete data pairs.

Responsiveness analysis
The descriptive data of all scores at baseline and follow-up together with the ES and the SRM, and the MCID are shown in Table 1. We report all data of the instruments for completeness. Effects for SF-36 vitality and MPI pain severity showed great improvement (ES≥ 0. Comparing the ES and SRM data within the same scale, the differences were mostly small (i.e. |ES-SRM| ≤ 0.10) except for SF-36 role physical, SF-36 bodily pain, the SF-36 mental health, SF-36 mental component summary (MCS), MPI pain severity, HADS depression, and CSQ praying or hoping. In SF-36 role physical, the reason for this may be the high floor effect: The scores at baseline were close to zero which resulted in a small baseline standard deviation and, by that, in a high ES when compared to the lower SRM which is determined by the relatively higher standard deviation of the difference. The effects of the sample were slightly or much higher than the estimated MCID (3 rd and 2 nd columns from the right of Table 1) of the Pain NRS, in 9/10 SF-36 scales, 4/10 MPI scales, 1/2 HADS scales, and 1/10 CSQ scales. Table 2 shows the comparison between the responsiveness measures of those instruments measuring the same construct domain. The MPI was more responsive than the SF-36 in the effect measurement of pain (and also than the pain NRS), role interference with pain (only by the SRM), and social interference with pain. The MPI was also more responsive than the HADS in affective health (but not better than the SF-36), and more responsive than the CSQ in coping (ability to control pain). In mood assessment the SF-36 was overall more responsive than the HADS.
The correlation coefficients of the baseline scores and raw score differences, i.e. absolute effects (

Discussion
We examined the ability of five self-assessed outcome instruments to sensitively measure changes in physical and mental health and quality-of-life in 273 chronic pain patients before and after a four week inpatient interdisciplinary pain program. The pain-specific MPI, specially developed for chronic pain conditions, was most responsive -or at least equally responsive as the compared scales -in the domains of pain, role and social interference with pain, depression, anxiety, and control of pain. The SF-36 was equally or more responsive than the pain NRS, the mood-specific HADS, and the coping-specific CSQ in the comparable domains. Both, HADS and CSQ are to some extent also generic measures as being applicable to various health and behavioural conditions. Besides interference with pain, responsiveness in function cannot be compared due to divergent constructs -the MPI asks about very specific functions in its activity dimension whereas the SF-36 covers physical tasks more generally, e.g. ambulation. Overall, the correlation data showed moderate (to partly high) overlap of the constructs of the compared scales in pain, coping, and affective health and were comparable to those reported previously between the SF-36 and the MPI [19].
The MPI confirmed our hypothesis that a condition-specific instrument measures more responsively than a generic one (the SF-36). The mood-specific HADS and the coping-specific CSQ failed to fulfil this hypothesis as far as could be determined on the basis of the compared scores. In other words, the generic SF-36 (also) is an excellent tool in the assessment of affective health, social function and role performance (physical and emotional) in chronic pain.
All examined domains of health and quality-of-life are important for the patient and are particularly addressed by treatment management in accordance with the International Classification of Functioning, Disability and Health -ICF [37]: In most cases chronic pain can not be eradi-cated but the patient can learn to tolerate, manage, and compensate for its consequences [38]. Most of the effect differences were small, a typical characteristic or problem in the measurement of chronic conditions, whether changes in health are due to the natural course of the disorder or due to treatment interventions. As a consequence, in clinical and interventional studies, it is essential to use a control group design and choose the most responsive instrument for the detection of changes in outcome in order to keep the sample size -and by that the costs -low. Vice versa, observed effects may remain In addition, the importance of effects which are smaller than the MID or the MCID is questionable since they are (on average) below the level of what the patients can perceive as a meaningful change. A more responsive instrument is able to measure higher effects above the MCID than an instrument with low responsiveness. To estimate MCID by the SEM is notably much more conservative than to determine the MCID by the "transition method", which asks the patients directly to rate their global health change (the anchor) and which relates this assessment to the measured size of ES or SRM [35,39]. For example, for the Western Ontario and McMaster Universities osteoarthritis index (WOMAC) global score, the MCID estimated by the SEM was 0.54 in ES units, whereas the MCID by transition method resulted in 0.40 [35].
The choice of the responsiveness parameter depends on the focus of interest and the characteristics of the different methods as outlined in the background. For the decision between ES and SRM, there is a directive rule [40]. The following scales showed pre-treatment to post-treatment scores rank correlations lower than 0.50 (data not shown in detail): Pain NRS, SF-36 role physical, bodily pain, social functioning, role emotional, MPI pain severity, life control, control pain, CSQ control pain and decrease pain. As parameter for responsiveness for these scales, the ES would be more appropriate than the SRM; for all other scales with correlation ≥ 0.50, it would be the SRM [40].
To our knowledge and after an extensive search in MedLine, there is only one comparable study which assessed the responsiveness of the MPI and the SF-36 in chronic pain patients referred to an outpatient interdisci-plinary pain program [19]. were not statistically tested since these domains showed a small overlap and correlation in the regression analysis whereas the other three (pain, interference, mood) largely overlapped -our data reproduced these overlap findings. In summary, these results were only partly consistent with our results but failed significance was probably caused by the small sample size (n = 87), the less sensitive t-test compared to modified Jacknife test, and small effects that were often below the MCID. Another study used some scales of the SF-36, MPI, and CSQ in 142 FM patients after a multidisciplinary outpatient pain program [41]. Out of these data, high ES can be determined ranging from 0.84 to 1.79. However, the only possible comparison is that between MPI interference (ES = 1.65) and SF-36 role physical (ES = 1.79) which is not significantly different (t-test) and is consistent to our findings.
The study's strengths are the large, prospectively examined cohort with consistent characteristics and almost no missing data. The assessment instruments used are well known worldwide, profoundly tested, and permit standardized measurement and comparison between cohorts of different conditions, countries, and cultures. To our knowledge, there is no previously published study which compared the five instruments for chronic pain. As a limitation it must be stated that the transition question to determine the MCID more precisely was not asked but this does not affect the responsiveness comparison itself [35,39].