Minimal important differences for fatigue patient reported outcome measures—a systematic review

Background Fatigue is the most frequent symptom reported by patients with chronic illnesses. As a subjective experience, fatigue is commonly assessed with patient-reported outcome measures (PROMs). Currently, there are more than 40 generic and disease-specific PROMs for assessing fatigue in use today. The interpretation of changes in PROM scores may be enhanced by estimates of the so-called minimal important difference (MID). MIDs are not fixed attributes of PROMs but rather vary in relation to estimation method, clinical and demographic characteristics of the study group, etc. The purpose of this paper is to compile published MIDs for fatigue PROMs, spanning diagnostic/patient groups and estimation methods, and to provide information relevant for appraising their appropriateness for use in specific clinical trials and in monitoring fatigue in defined patient groups in routine clinical practice. Methods A systematic search of three databases (Scopus, CINAHL and Cochrane) for studies published between January 2000 to April 2015 using fatigue and variations of the term MID, e.g. MCID, MIC, etc. Two authors screened search hits and extracted data independently. Data regarding MIDs, anchors used and study designs were compiled in tables. Results Included studies (n = 41) reported 60 studies or substudies estimating MID for 28 fatigue scales, subscales or single item measures in a variety of diagnostic groups and study designs. All studies used anchor-based methods, 21/60 measures also included distribution-based methods and 17/60 used triangulation of methods. Both similarities and dissimilarities were seen within the MIDs. Conclusions Magnitudes of published MIDs for fatigue PROMs vary considerably. Information about the derivation of fatigue MIDs is needed to evaluate their applicability and suitability for use in clinical practice and research.


Background
Fatigue is among the most frequent complaints reported by patients with chronic illnesses [1][2][3][4] and has far-ranging, often debilitating consequences on their wellbeing and physical, emotional and social functioning [5]. Although there is no consensus definition of fatigue, it is often described as 'a persistent, overwhelming sense of tiredness, weakness or exhaustion resulting in a decreased capacity for physical and⁄ or mental work [6]. Fatigue is a subjective experience and is commonly assessed by means of patient-reported outcome measures (PROMs). PROMs are widely used today in evaluating the effects of illness and treatment on symptoms, functioning, and other outcomes from the patient's perspective [7].
Currently, there are some 40 generic and disease-specific PROMs for assessing fatigue in use today [8]. Most of these fatigue measures have been evaluated regarding various aspects of validity and reliability. Although these are important psychometric properties reflecting the quality of the measure, they are of little value in interpreting the meaning of scores derived from that measure [9]. Nonetheless, interpretation of scores, in particular changes in scores, is of critical concern in trials evaluating effects of treatments aimed at reducing fatigue, as well as in routine clinical practice in monitoring and managing fatigue in individual patients. In clinical trials, it has long been recognized that conventional statistical significance testing provides information regarding the probability that an effect exists, not about the meaningfulness of the size of the effect [10]. In clinical practice, difficulties in evaluating and interpreting changes in PROM scores often impinge on their usefulness in informing clinical decision-making [11].
The interpretation of changes in PROM scores may be enhanced by estimates of the so-called minimal important difference (MID). MID was originally defined over 25 years ago as "the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side effects and excessive cost, a change in the patient's management" [12]. During the past decades considerable research attention has been directed towards deriving MIDs for PROMS. In this pursuit a variety of methods have been developed and applied, but no clear consensus exists regarding which method or methods are most suitable.
To date, two main methods have been applied, namely anchor-based approaches and distribution-based approaches. Descriptions of these methods are beyond the scope of this paper and are summarized in detail elsewhere [13]. Briefly, anchor-based approaches use various external criteria (patient-reported, physician-reported, or clinical anchors) to interpret whether a particular magnitude of change is important. For example, a common anchor-based method involves the use of global rating scales (GRS) where MIDs are derived by comparing patients' self-ratings of change (e.g., "much worse"-"much better") to change in PROM scores. The MID is often defined as lying within the range of "slightly worse/better" on the GRS [9]. Distribution-based approaches rely on the statistical characteristics of the distribution of scores in the sample, in which the magnitude of change is generally expressed as a function of the standard deviation (SD) of scores alone or in combination with the reliability of the PROM (standard error of the measurement (SEM)) [14]. Various SD and SEM cut-off values have been proposed for estimating MIDs, including ½ or 1/3 SD and 1-2 SEM. Another commonly applied method is the use of effect sizes (ES) or standardized response means (SRM), where change scores are divided by the SD at baseline or the SD of change, respectively. The MID is often defined as change values lying within the range of 0.2-0.5. A disadvantage to distribution-based approaches is that they do not address the clinical importance of the change. Recent recommendations have proposed that as a firstline method multiple anchor-based approaches should be used, which, supported by distribution-based methods, may be triangulated to a single MID value or smaller range of values [14][15][16][17].
Although appealing for its simplicity, the idea of a single, universal MID value for any particular PROM remains elusive for a number of reasons. Firstly, different MID estimation approaches have been shown to yield highly disparate MIDs and hence triangulation (combining different methods to estimate a MID) may be problematic [18]. Secondly, MIDs have also been shown to differ by population and context [14]. For example, MIDs vary by diagnostic group, characteristics of the study sample, e.g., demographics and baseline levels; disease severity; treatment; choice of anchors [18,19] as well as if MIDs gauge improvement versus deterioration [20]. This variability suggests the need to understand how a particular MID value was determined in order to judge its appropriateness for use in research for interpreting change and/or computing sample sizes, or in clinical practice for monitoring fatigue in specific patient groups [21].
The purpose of this paper is to compile published MIDs for fatigue PROMs, spanning diagnostic/patient groups and estimation methods, and to provide information relevant for appraising their appropriateness for use in specific clinical trials and in monitoring fatigue in defined patient groups in routine clinical practice.

Methods
A systematic literature review where three databases (Scopus, CINAHL and Cochrane) were searched from January 2000 to April 2015 to identify studies with calculated MIDs in fatigue scales, subscales and single item measures. The searches were limited to English language (search string: "minimal clinical important difference*" OR "minimal important difference*" OR "minimal clinically important difference*" OR "minimally important difference*" OR "clinical important improvement*" OR "clinically important improvement*" OR "minimal important clinical difference*" OR "minimally important clinical difference*" OR "responder definition") AND Fatigue). The search was augmented with screening of article reference lists. All expressions including "difference/change/improvement" or equivalent, "important" as well as "minimal" or "clinical", or "responder definition" were defined as MIDs. To facilitate the reading all minimally important changes are called MIDs in this paper.

Selection of articles
Inclusion criteria were reporting MIDs in text and/or tables for a fatigue scale, subscale or single item measurement of fatigue. Exclusion criteria were: reported MID was not derived directly in the study; insufficient information supplied about the study sample, study design and/or method for determining the MID; study sample < 18 years, not separate reporting of MIDs for a fatigue subscale and conference abstracts. Exclusion on title/abstract and on full-text levels were done independently by two researchers (ÅN and AD), see Fig. 1.

Data extraction
Two authors (ÅN and AD) extracted data regarding MIDs and methods used, including anchors used. The last author (AD) checked all data extraction and prepared the tables. To facilitate interpretation all MIDs are shown as absolute values and decimals are restricted to one significant number only, except for effect sizes. Some studies reported standard deviation (SD) and confidence intervals but these are not shown in our tables or text. The fatigue measurements were identified as multidimensional scales, unidimensional scales or subscales, single item measurement or item bank scales.

Results
The literature search generated 177 articles (Fig. 1), of which 41 met the inclusion criteria . The main reasons for exclusion were: reported MID was not derived in the study; and inadequate information was supplied about the study sample, study design and/or method for determining the MID. Many different expressions were used to name a small but important change in fatigue [13]. In this review we included studies using different phrases for MID (see Table 1), e.g. "MID", "MCID", "MCII" or an equivalent expression, all referred to as MID in this paper. Most of these expressions used some variation of "difference/change/improvement" or equivalent, "important" as well as "minimal". Some phrases also included "clinical". Two studies used "responder definition" [43,55], see Table 1. In two systematic reviews a phrase without "minimal" was used [59,60] but the authors defined values for a small or minimal change.
The included articles (n = 41) reported MIDs for 28 fatigue PROMs (characteristics shown in Table 2), resulting in 60 studies/substudies of MIDs. The studies varied in sample size, diagnostic group, MID estimation approach, study design, type of intervention and length of follow up. Sample sizes ranged from n = 40 to n = 2,583.
Sixteen different diagnoses were included in the reviewed studies. Twenty-seven of the studies in the 41 articles were longitudinal and follow-up periods ranged from two days after intervention to one year after baseline. An anchorbased approach alone was used in 39 of the 60 studies or substudies estimating MID, while the rest also used a distribution-based approach. Seventeen of these also included a method of triangulation to define MIDs. Two cross-sectional studies [33,46] reported MIDs for seven fatigue or vitality scales (MFI, FSS, MAF, CFS, VT/SF-36, FACIT-F and GRS). Other studies determined MIDs for two or more fatigue measures or subscales [28,47,48,51,[59][60][61]. Several PROMs had MIDs determined in a number of different studies and several studies reported MIDs for up to seven PROMs. Nevertheless, most MIDs were derived in single studies, with one study per PROM [22-27, 29-32, 34-43, 45, 49, 50, 52-58, 62], see Table 3. Altogether, 60 studies or substudies estimating MIDs for global change (not specified direction of change), improvement and/or deterioration are described in Table 3. In Table 3 all score changes are presented as positive values, regardless of the direction of change. Confidence intervals and SDs (if derived in study) are not shown. Numbers are rounded to one decimal place.

Multidimensional scales
Multidimensional fatigue inventory (MFI), score 20-100 Two cross-sectional studies [33,46] derived MIDs for systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) populations for the MFI total scale, using a patient global rating scale and interviews as anchors. MIDs ranged from 11.5 to 13.3 for global change and 6.8 to 9.6 for improvement and 9.5 and 12.8 for deterioration.
Fatigue severity scale (FSS), score 1-7 Three cross-sectional studies reporting MIDs for the FSS were identified [33,46,50]. Diagnostic groups included SLE, RA and multiple sclerosis (MS). Anchorbased approaches were applied in all the three studies and a distribution-based approach (viz. effect size, ES, of at least 0.25) was also applied in one [50]. Two used a patient global rating scale as an anchor [33,46] whereas the third used clinical anchors and baseline data from a clinical trial to establish MIDs [50] MIDs ranged from 0.5 to 1.2 for global change, 0.08 to 0.4 for improvement and 1.0 to 1.2 for deterioration.
Multidimensional assessment of fatigue (MAF), score 1-50 MID-estimates for the MAF in two cross-sectional studies with SLE and RA patients [33,46] were estimated to 5.0 and 9.2 for global change, 1.4 to 5.4 for improvement and 8.3 to 8.9 for worsening, using a patient global rating scale.   Chalder fatigue scale (CFS), score 0-33 The same two cross-sectional studies [33,46] reported MIDs for the CFS where MIDs for global change were 2.3-3.3; for improvement 0.7-1.4; and for deterioration 3.2-3.5.
Fatigue impact scale (FIS), score 0-160 One cross-sectional study with MS patients [49] reported MIDs for the FIS ranging from 9-24 points for the different patient and clinician rating anchors, with a mean of 15.5 and SD 4.9. Distribution-based methods yielded MIDs ranging between 4.8-17.3 (1-2 SEM; ± 1/3-1/2 SD). Triangulation of anchor and distribution-based methods gave a MID range of 10-20 points.
Trial outcome index-fatigue (TOI-F), score 0-108 One study [28] reported TOI-F MIDs using data from three separate cancer trials. Triangulation was used to determine a MID, combining a patient-reported anchor, two physician-reported anchors (including response to treatment ratings), and one clinical anchor (haemoglobin level). MID estimates ranged from 4.8 to 26.6, and a single triangulated MID of 5.0 was recommended.
Perform questionnaire (PQ), score 12-60 One longitudinal study [22] estimated the PQ MID in cancer patients to be 3.7 for improvement. Triangulation was used to estimate a recommended MID of 3.5.

Schwartz cancer fatigue scale (SCFS), score 3-30
A longitudinal study of the SCFS using a patient-rated anchor [51] reported MIDs for global change was 5.0; for improvement 2.1; and for deterioration 5.7 after a two days follow-up.
Fatigue associated with depression questionnaire (FAsD), score 1-5 MIDs for the FAsD were estimated in one longitudinal study [43] of patients with a clinical diagnosis of depression ranging from 0.3 to 0.6 for improvement and 0.2-0.3 for worsening after 6 weeks follow-up.
Neurological fatigue index for multiple sclerosis (NFI-MS), summary score 0-30 One longitudinal study [44] using a patient global assessment of change reported MIDs for the NFI-MS; 2.5 for the ten-item Summary scale, 2.4 for the Physical scale (score range 0-24) and 0.8 for the Cognitive scale (score range 0-12).
Unidimensional scales or subscales Multidimensional fatigue inventory (MFI) subscales, score 4-20 A longitudinal study [47] derived MIDs in a cancer population (pre and post radiotherapy) for the MFI five subscales. MIDs ranged between 1.4 to 2.4 depending on Unidimensional fatigue impact scale (U-FIS), score 0-66 One longitudinal study using EQ5D as an anchor [55] derived MIDs in an MS sample. U-FIS MIDs corresponded to 6.5 for improvement and 4.7 for deterioration, and distribution-based MIDs between 2.4 and 7.0.
Fatigue assessment scale (FAS), score 10-50 MIDs for the FAS were reported in one longitudinal study of sarcoidosis patients using WHOQOL-BREF/ Physical health domain and a ROC-curve as anchors as       well as distribution based methods [31]. MID ranged between 3.0 and 4.2 and a triangulated MID-value of 4 was suggested.
Vitality scale (VT) of the medical outcome study SF-36 health survey (SF-36), score 0-100 Eight studies [26,33,35,46,54,56,59,60] determined MIDs for the VT scale of the SF-36 using different designs and diagnostic groups; longitudinal with patientand/or clinician rated anchors, cross-sectional using patient-rated anchors and systematic reviews using combined study data and expert panels. The MIDs ranged from 7.3 to 11.3 for improvement, 11.9 to 18.3 for worsening and 3.5 to 20, for all those with a global change and 4.2 to 18.8 for a triangulated MID.
In these studies, MIDs varied from 3 to 8.3 irrespective of direction of change, 2.8 to 6.8 for improvement and 5.2 to 9.1 for deterioration. Two of the studies [29,38] combined various distribution-based approaches (SEM, SD and ES), resulting in MIDs ranging between 2.2 and 6.8, and presented triangulated MIDs ranging between 3 and 6.
FACT-an fatigue subscale (FACT-An Fatigue), score 0-80 One longitudinal study [45] estimated a MID for improvement of 4.2 in cancer patients using haemoglobin level as a clinical anchor and regression analysis to calculate MID.
Profile of mood states short form fatigue subscale (POMS-F), score 0-28 One longitudinal study reported MIDs for the POMS-F using a sample of cancer patients undergoing chemotherapy [51]. A global MID of 5.6 points was determined as well as separate MIDs for improvement (2.1 points) and deterioration (5.7 points).
European organization for research and treatment of cancer quality of life questionnaire core 30 (EORTC QLQ-30)-fatigue scale, score 0-100 Six cross-sectional and longitudinal studies [24,25,36,40,41,62] reported MIDs derived in a variety of cancer diagnoses. MIDs were reported as 11.4 to 17.3 points for improvement and 5.7-24.5 points for deterioration. Distribution-based MIDs ranged from 3.0 to 19.7.
Sleep impact scale (SIS), energy/fatigue and mental fatigue subscales, score 0-100 One longitudinal study [39] using a clinician-rated anchor and a distribution-based method to assess change at 8-week follow-up, reported MIDs derived in patients with major depressive disorder (MDD). The anchorbased approach yielded a MID of 11.9 for the Energy/Fatigue subscale, whereas the distribution-based MID was 8.7. The corresponding MIDs for the Mental Fatigue subscales were 13.3 and 10.6, respectively.
Chronic respiratory questionnaire (CRQ), score 1-7 Two systematic reviews [52,59] used CRQ data from earlier studies to determine MIDs for the CRQ/Fatigue subscale and triangulated MIDs of 0.5 and 2 were proposed. One of the reviews estimated MIDs between 0.5-0.6 for global change and distribution-based MIDs of 0.47-0.54 [52].
Chronic heart failure questionnaire (CHQ), score 4-28 One systematic review using CHQ data and an expert panel proposed a MID for the CHQ/Fatigue subscale of 3-4 irrespective of direction and a triangulated MID of 3 [60].
Quality of life inventory in Epilepsy (QOLIE-31), energy/fatigue subscale, score 0-100 One longitudinal study used 3 randomised controlled trials to examine MID for the QOLIE-31/Energy/fatigue subscale [27]. A MID of 7.5 was defined using a patient rating of change and regression analysis. Distributionbased MIDs ranged between 5.4 and 9.4.
Visual analogue scale (VAS), score 0-100 or 0-10 Six longitudinal studies [30,32,37,53,57,58] derived MIDs for the VAS 0-100 and one [34] for the VAS 0-10 in a variety of diagnostic groups. MIDs for the VAS-100 ranged from 1.4 to 13.9 for improvement and 3.6 to 15.2 for deterioration, while the global change varied between 6.7 and 17. One study [57] determined a triangulated MID of 10 using the Delphi method. MIDs for the VAS-10 ranged between 0.8 to 1.1 for improvement and 1.1 to 1.3 for worsening, and were derived from three different anchors and at different follow-up times in three different diagnostic groups (RA, SLE and cancer) [34].
Global rating scale (GRS), score 0-10 MIDs for the single item GRS scale were determined in SLE, RA and cancer patients in two cross-sectional studies [33,46] and one longitudinal study [51], all using a patient global rating scale as an anchor. Global MIDs ranged from 1.1 to 2.0, while MIDs for improvement were 0.3 to 0.9 and for deterioration 1.5.
Edmonton symptom assessment system (ESAS) fatigue item, score 0-10 Two longitudinal cancer studies [23,48] identified MIDs for the fatigue item in the ESAS scale. MIDs for improvement ranged from 0.1 to 4 and between 1.0 and 1.8 for worsening of fatigue. Distribution-based MIDs ranged from 0.1 to 1.4.
Immune thrombocytopenic Purpura-Patient assessment questionnaire, (ITP-PAC) fatigue subscale, score 0-100 One longitudinal study [42] assessed MIDs using patient impression of change for the ITP-PAC/Fatigue subscale. Global change was defined as 15.0 or as an effect size of 0.57.
PROMIS fatigue item bank scales 17-item PROMIS fatigue (fatigue-17) and 7-item PROMIS Fatigue (Fatigue-7), score 17-85 and 7-35 One study [61] derived   [14,18,19], suggesting the importance of considering such factors when appraising the appropriateness of published MIDs for use in clinical research and practice. In line with this, substantial variation was observed in MID values for individual fatigue PROMs in this review. For example, MIDs for the SF-36 Fatigue scale ranged from as low as 4.2 to as high as 20.0 points (0-100 point scale) in studies varying in methodologies, anchors, diagnostic groups and direction of change assessed. Similarly, MIDs for the VAS-100 Fatigue scale ranged from 1.4 to 17. MIDs for the cancer-specific EORTC QLQ-C30 fatigue scale also varied between 1.8 and 24.5 points (0-100 scale) and those for the FACIT-Fatigue scale ranged between 6 and 16 (converted to percent), see Table 3. This wide variation in MIDs for individual fatigue scales suggests the importance of understanding how any particular MID was derived and of applying this knowledge when appraising its appropriateness for interpreting changes in fatigue scores.

Discussion
MID estimation methods varied considerably in the identified studies and substudies. However, in accordance with recent recommendations regarding methods for MID estimation [14], nearly all studies applied an anchor-based approach, where at least one anchor was used. Patient global change ratings were by far the most common anchor, but even clinician-reported and clinical anchors were implemented. Where more than one anchor was applied either a range of values was generally reported or, as recommended [14,63,64], values were often triangulated to a single or smaller range of MIDs. Distribution-based methods were used in about a third of the studies and only in conjunction with anchor-based approaches. A few studies used a Delphi method (Table 3).
In the studies using several anchors to determine MID values, global MID ranges varied within single studies from as little as two points (percent scores), in relation to the FACIT-Fatigue scale using patient-based anchors [29], to about 20 points for the TOI-F [28] using patient, clinician and clinical anchors. Interestingly, two studies reporting MIDs for the SF-36 Vitality scale, using the same diagnostic group (RA) but different anchors, yielded two distinct ranges of MIDs. In the study by Kosinski et al. [35], using patient and physician global assessments as anchors, MIDs ranged from 4.9-11.1, whereas a range of 11.0-20.0 was reported by Ward et al. [56] using the HAQ, CES-D and the SF-36 health transition item. Neither of these studies triangulated the range of values to a single MID or smaller range of values and hence these wide ranges of MIDs are arguably of questionable practical value for interpreting change in fatigue in RA patients as measured with the SF-36 Vitality.
Triangulation was used in 17 substudies, of which 10 used more than two anchors. This method has been recommended for consolidating MIDs derived from different methods to a single or small range of MID values [14]. However, it has been criticized [19] since it may in practice involve the need to converge widely disparate MIDs derived using different estimation methods and diverse anchors, which often represent very different stakeholder perspectives. An example of a MID triangulated from a wide range of MIDs is the TOI-F [28] where a MID range of 4.4-24.6 (percent scores) was triangulated to 5.0. Where MID ranges are smaller, the value and applicability of the triangulated MID may be more immediately apparent. For example, Schünemann et al. [52] reported a MID range for the CRQ of 6.7-8.5 (percent scores), derived from patient anchors, a systematic review and distribution-based methods, which was triangulated to a MID of 6.7.
A second factor known to influence variation in MID values is the patient population in which the MID is determined. Variation by diagnostic group is exemplified by comparing MIDS from two studies, each using the same estimation method (7-step global rating scale) and study design (cross-sectional) but different diagnostic groups [33,46]. One of the studies [33] determined MIDs for seven different fatigue PROMs in patients with SLE and the other [46] did the same in patients with RA. Comparison of the global MIDs for the SLE and RA patients, shown in Table 3, shows consistently smaller MIDs for SLE versus RA across all seven PROMs. It is noteworthy that most PROMs had MIDs that were determined in only one patient population and the relevance of these MIDs for use in other patient groups thus remains unclear.
A third factor influencing variation in MID values is the context within which the MID is determined. Context issues concern, for example, characteristics of the patient population, e.g., such as baseline state [65], disease severity [66], and direction of change [13,20], as well as study design and intervention. For example, patients with baseline scores indicating more severe fatigue may value magnitudes of change in fatigue differently than those with less severe fatigue. Corroborating previous research finding, MIDs for improvement differed from those for deterioration in all identified studies. MIDs tended to be larger for deterioration than improvement, except in the EORTC QLQ-30 and VAS Fatigue item. MIDs for improvement were consistently smaller than global MIDs.
A strength of this study is that reported MIDs for fatigue scales or subscales were systematically compiled and described. Assessment for inclusion or exclusion and data extraction from included studies was done independently by two authors (ÅN and AD). A limitation is that the search period was restricted to studies from 2000 onwards and search strings for the many variations on MID was also limited and therefore some studies reporting MIDs for fatigue scales may not have been captured in the literature searches. Another limitation is that the description of the study designs and results had to be summarized and simplified in tables and information could be lost. Therefore, when evaluating MIDs the original study/studies should be consulted.

Conclusions
MIDs vary substantially by estimation method, patient population and context both across and within fatigue PROMs. In light of this variation, published MIDs should be applied judiciously, after carefully considering their applicability to characteristics of the study in question. The information provided in this paper may serve to aid researchers and clinicians in making informed decisions regarding the appropriateness of published MIDs for their particular study and patients.