BMC Medical Research Methodology

Background: The purpose of this study was to evaluate the role of study quality assessment of primary studies in cancer practice guidelines.


Background
Quality assessment of trials included in systematic reviews of evidence is a resource intensive and scientifically controversial endeavour. On the one hand, the routine use of quality assessment in the development of systematic reviews is encouraged by the Evidence-based Practice Center Program of the Agency for Healthcare Research & Quality (AHRQ) and the Cochrane Collaboration, two well respected groups that coordinate a substantial number of systematic reviews [1][2][3]. Indeed, West et al released a 2002 evidence report sponsored by the AHRQ comparing and contrasting various systems to rate the strength and quality of research evidence to assist in these activities [4]. Furthermore, many journal editors consider it important to include an assessment of study quality in reports featuring meta-analyses [3].
The concept of incorporating study quality assessment into systematic review methodology has also found empirical support. There is evidence that studies of lower methodological quality tend to report larger treatment effects than high quality studies [5][6][7]. For example, Moher and his colleagues found a 34% greater estimate of treatment effect for low quality versus high quality trials and a 37% greater estimate of treatment effect for inadequately concealed versus adequately concealed trials associated with reviews addressing a variety of clinical conditions [5]. Similar bias was found by Schultz and his colleagues [6] in their analysis of trials included in the Cochrane Collaboration's Childbirth and Pregnancy reviews. In addition, Colditz and colleagues found that nonrandomized and open studies were more likely to produce positive treatment effects than randomized and double-blinded studies [7].
Although this seminal work yields compelling results, these findings are not universal and the issue is not without detractors [8][9][10][11][12][13][14]. Some studies have found no reliable relationship between quality score and effect size [10][11][12] and another has found that low study quality was associated with diminished effect sizes [13]. Further, Juni et al [14] found the relationship between study quality and effect size depended on the scale used in the assessment.
Together these results suggest the study quality issue is controversial and that the merits of this methodological step in systematic review requires thoughtful analysis. Indeed, West et al conclude with recommendations advocating for research dedicated to comparing quality rating systems and the role of quality assessment within individual clinical contexts and for studies targeted at determining specific quality factors that make a difference in final quality scores [4].
The Practice Guidelines Initiative of the Cancer Care Ontario's Program in Evidence-based Care (PEBC) uses the Guidelines Development Cycle to create cancer practice guidelines comprised of a systematic review of the research literature, an interpretation and consensus of the evidence by members of the guideline development team, clinical recommendations informed by the evidence, and an external review process by Ontario clinicians [15][16][17][18][19]. We face the challenge of balancing scientific rigour and the timely production of guideline documents in an environment defined by limited financial and human resources. Hence, we try to approach our methodological decisions with a critical scientific and practical eye. We took note of the growing controversy in the study quality assessment literature and conducted an evaluation, reported below, to evaluate the benefits of assessing the quality of each study included in our systematic reviews. Our overall objective was to decide whether to augment our current practice of simply describing study characteristics to also incorporate study quality assessment as a routine formal component of the guideline development methodology. The evaluation was conducted in three steps, each of which was designed to address three specific issues: 1. What valid and reliable quality assessment instrument would be most appropriate for our context? 2. How is study quality currently being used in published systematic reviews of cancer trials and what is the relationship between effect size and study quality in this disease area?
3. What impact would study quality assessment have on the clinical recommendations made in evidence-based practice guidelines developed by the PEBC?

Search for a valid and reliable quality assessment tool for the PEBC context
For a comprehensive review of the strengths and weaknesses of quality assessment instruments, readers are referred to the 2002 West et al. evidence report commissioned by the AHRQ [4]. For our study, which began before the release of this report, we used components of Moher et al's definition of study quality that are related to internal validity (i.e., design, conduct and analysis) [20] and updated his 1992 published reports of check lists and scales used to measure the phenomenon [21]. We searched the Medline database using the following search strategy: (quality adj rat:).tw OR (quality adj assess:).tw. OR (quality adj scale:).tw. OR (quality adj checklist:).tw. AND randomized controlled trials.sh. OR clinical trial:.tw. OR random:.tw. Reference lists of reviews were scanned for additional citations.

Systematic review of the oncology literature on study quality
To locate systematic reviews on oncology topics, the strategy suggested by Moher et al for finding systematic reviews [3] was combined with the terms "neoplasms.sh. OR cancer.tw. OR carcinoma.tw" to search the Medline, CINAHL, and Cancerlit databases. To ascertain if the authors assessed the quality of the studies included in the systematic reviews, the search was narrowed to include the terms [(quality adj rat:).tw OR (quality adj assess:).tw. OR (quality adj scale:).tw. OR (quality adj checklist:).tw. OR (study adj quality).tw.]. Textwords were used to search the Cochrane Library for systematic reviews on oncology topics. Systematic reviews that included analyses exploring the relationship between study quality (using any assessment instrument, not just validated tools that met our criteria as described above) and effect size were examined.
Because survival following cancer treatment is commonly used as the primary outcome variable in our practice guidelines, this variable was selected as the primary outcome measure of interest.

Impact of study quality on PEBC practice guidelines
The validated scales were applied by two methodologists (MJ and MC) to studies reported in any our practice guidelines that included a pooled analysis based on at least ten randomized trials related to the main guideline question. Intraclass correlation coefficients with 95% confidence intervals (CI) were calculated using a random sample of the RCTs to assess inter-rater reliability, one coefficient calculated for each of the scales used. Because of budgetary limitations for staff time, the analysis was conducted on 18 randomly selected studies rather on the whole group of articles. This is a methodological limitation as fewer studies result in larger confidence intervals and less precise estimates. This may account for the difference in reliability ratings we found for the Sindu scale compared to published norms (see Table 1).
To assess the impact of study quality on effect sizes, sensitivity analyses were conducted for the meta-analysis from each guideline report. For each scale, studies were divided into two groups (low quality and high quality) based on total quality score. Where the scale developer suggested a cut-off point for low versus high quality, this was used. Where no cut point was specified, the observed median study quality score was used as the dividing point between low and high quality. Meta-analyses were repeated with the high quality studies. Because there would never be a situation in which guideline developers would consider low quality studies only, a meta-analysis using this sample of the studies was not conducted.

Valid and reliable quality assessment tools for our context
Four scales meeting our criteria were found; two instruments, Jadad et al [22] and Cho & Bero [23], were originally uncovered in the Moher review [21] and two instruments, Sindhu et al [24] and Downs & Black [25], were uncovered in our update of this review. While none of these scales were developed in the oncology setting, they all purport to be generic assessment tools that measure the quality of specific study designs regardless of clinical condition reflected in the design. The procedures undertaken to create the instruments followed appropri-  Max. = 12 points out of total score of 100 6. Has an 'intention-to-treat analysis been performed? i.e. everyone randomized is retained in the study; everyone randomized is included in the final analysis; and no selective dropouts. ate methodological processes for questionnaire design. In addition, while a number of additional scales and checklists emerged from our search, validity and reliability data were not reported. Because our practice guidelines are based primarily on evidence from randomized trials, we decided to reserve to employ the scales that focused specifically on RCTs. As such, the Cho & Bero scale, which is applicable to a range of study designs, was not employed here but will be considered at a later date when we have a portfolio of diverse study designs. The characteristics of the instruments included in our study are summarized in Table 1 and 2 and detailed descriptions and comparisons can be found in West et al. [4].

The relationship between study quality and effect size in the oncology literature
The literature review located 32 published systematic reviews on oncology-related topics that included some measure of study quality. Five of the reviews examined changes in pooled estimates of effect size of mortality rates when meta-analysis was restricted to high-quality randomized trials [26][27][28][29][30]. As shown in Table 3, four of the five reviews found somewhat larger effects (i.e., larger differences between experimental and control groups) with high-quality trials compared to all trials [26][27][28][29].
With one exception, the statistical relevance of the differences between the groups (i.e., significant differences or no significant differences) remained the same regardless of the number of trials included. Specifically, two of the reviews did not detect a statistically significant difference in survival between groups when all studies were included or when the meta-analysis was restricted to high-quality studies [26,29]. For one data set, the meta-analysis was repeated with study quality ratings used as weights [29]; there was still no significant difference between experimental and control groups. Two analyses detected signifi- weighted OR not reported but p = 0.07 after weighting for study quality LHRH, luteinizing hormone-releasing hormone; RR, relative risk; OR, odds ratio *RR or OR <1.0 indicates fewer deaths in the experimental group than in the control group ** OR <1.0 indicates more deaths in the experimental group than in the control group cant differences between experimental and control treatments with analysis of all trials and when the analysis was restricted to high-quality trials [27,28]. In the fifth review, a significant difference between experimental and control interventions was detected when all trials were synthesized that became only marginally significant (p < .07) when the meta-analysis was adjusted for study quality [30].

Impact of study quality on PEBC practice guidelines
Three of the PEBC practice guidelines included at least 10 RCTs in their systematic reviews of the evidence and were eligible for inclusion in this evaluation [31][32][33]: concomitant chemotherapy and radiotherapy in squamous cell head and neck cancer (18 trials) [31]; adjuvant therapy for stage II colon cancer following complete resection (11 trials) [32]; and neoadjuvant chemotherapy in locally advanced squamous cell carcinoma of the head and neck (23 trials) [33]. For the latter guideline [33], data could not be reliably reconstructed and is not discussed further.
At the conclusion of our study, we identified a fourth practice guideline which originally did not meet our 10 RCT inclusion criteria, but later did so after it was updated. The guideline focused on the role of erythropoietin (EPO) in the management of cancer patients with non-hematologic malignancies [34]. Unlike the chemotherapy trials included in the practice guidelines described above, which were not placebo-controlled and where the primary outcome was death, one-third of the EPO trials were double blind and all used the need for blood transfusion as the primary outcome. Although by the time this practice guideline emerged as eligible we had identified a preferred scale (see below), we chose to include it here and apply only the preferred scale as a demonstration of its use on a report that had differing characteristics than the chemotherapy topics covered.

Application of quality scales to primary studies informing practice guidelines
While the total quality scores emerging from each of the different scales did all significantly correlate with one another (range r = .35 to r = .73), there was considerable

All Studies
# studies (# comparisons) 11 (11) 18 (20) 15 ( OR, odds ratio; CI, 95% confidence interval variation in the classification of studies as high quality or low quality as a function of the scale that was applied (Table 4). For example, of the 11 comparisons from 11 trials comprising the stage II colon cancer review, the application of the Jadad 3-item, the Jadad 6-item, the Sindhu, and Downs & Black scales yielded 0, 8, 9, and 6 of these as high quality, respectively. The 6 studies categorized as high quality using the Downs & Black tool were also categorized as high quality when the Jadad 6-item and Sindhu scales were applied. Similarly, the 8 studies categorized as high by the Jadad 6-item were also categorized as high quality by the Sindhu scale.
The 20 comparisons from the 18 trials included in the head and neck concomitant therapy systematic review yielded 2, 14, 12 and 14 high quality studies, respectively, when the Jadad 3-item, the Jadad 6-item, the Sindhu and the Downs & Black scales was used. Although both Jadad 6-item and Downs & Black scales both assessed 14 comparisons to be from high quality studies, only 11 of these 14 studies were the same. For the 12 comparisons from studies categorized as high quality with the Sindhu scale, 10 of these were also rated high quality by both the Jadad 6-item and the Downs & Black scales, the other 2 were rated as high quality by the Jadad 6-item scale only. There was 1 comparison from a study rated as high quality by the Jadad 6-item scale only and two from studies rated as high quality by the Downs & Black scale only.

Impact on pooled estimates of outcome measures
Mortality data (i.e., numbers of deaths and number of patients randomized for each allocation group, abstracted from published trial reports) used for the meta-analysis included in the guideline reports were available for two guidelines and need for blood transfusion data were available for the third [31,32,34]. For each guideline, the pooled odds ratio based on only the high-quality trials was compared with the odds ratio from meta-analysis of all trials that had been included originally in the review (Table 4). For the first guideline [31], there was a significant survival benefit for concomitant chemotherapy and radiotherapy compared with radiotherapy alone for squamous cell head and neck cancer in the meta-analyses that included all studies and the meta-analyses restricted to high quality studies, regardless of quality appraisal tool used. Although the effect size was larger for meta-analysis of high-quality RCTs than for all RCTs (irrespective of quality scale used), the confidence intervals between the two calculations overlapped and the overall conclusions and the recommendations informed by the meta-analysis would have been the same. For the second guideline [32], no survival benefit was detected for adjuvant chemotherapy compared to standard therapy for stage II colon cancer in the meta-analysis of all the studies or the high quality studies, again, regardless of quality appraisal tool used. Although the meta-analysis of the high study quality studies was associated with smaller effect sizes than the calculation including all of the studies, the confidence intervals overlapped and the conclusions and the recommendations would have remained the same.
Only the 6-item Jadad scale was applied to the studies of the EPO guideline and the data were pooled to calculate an overall risk ratio for blood transfusion [34]. The risk ratio for all 15 trials was 0.57 (95% CI, 0.47 to 0.70); for nine trials that scored more than three out of eight on the 6-item Jadad scale, the risk ratio was also 0.57 (95% CI, 0.44 to 0.72) (see Table 4).

Conclusions
Several conclusions can be drawn from this study and review of the literature. First, there are established methods for assessing the quality of randomized controlled trials in which data on adequate reliability and validity were available. West et al uncovered 32 scales, check lists and component systems concerned with evaluating RCTs [4]; more than the four strategies we applied here. Although most (87%) of the instruments found by West included quality domains for which there is an empirical basis, most failed to report the use rigorous methods in their development and most failed to report data regarding reliability and validity, criteria we set for our study. Interestingly, West et al did not include the Jadad 6-item in their analysis [4], although the Jadad 3-item, Downs & Black, and Sindhu tools were reported.
Although all of the scales we used have established reliability and validity estimates, we found that the number of trials categorized as high quality or low quality depended specifically on the scale that was applied. For the head and neck cancer systematic review, the number comparisons from high quality studies ranged from 2 (when the Jadad 3-item scale was applied) to 14 (when the Jadad 6-item or Downs & Black scales were applied). The range for the colon cancer review was 0 to 9. There was also considerable variability regarding the specific quality category in which each trial was placed. These finding are consistent with those of Juni et al [14] and suggest caution should be applied if the intent of quality rating scales is to restrict the number of studies considered in the systematic review; clearly the choice of scale will have a significant impact regarding what studies are eligible. The problem of identifying to which quality category, high or low, studies should be placed is exacerbated by the lack of clear cut-off criteria identified by the instrument developers. This poses a significant methodological limitation to the utility of these instruments. In our study, we chose the median score as the cut-off criteria in situations where none was reported. However, it would be useful for researchers of these tools to continue the development work to create the evidence-base from which valid criteria can be established.
The lack of consistency of study classification from one scale to the next and the lack of clear cut-off criteria for users to employ when measuring quality of studies, presents a challenge to guideline developers when they need to make choice about which instrument they ought to adopt if the choose to adopt an instrument at all. Rather than clear evidence driving our decisions, we considered other features of the instrument in our decision making.
Of the rating scales we examined, our preferred choice would be the Jadad 6-item instrument. In contrast to the others considered in this report, this instrument is relatively easy to implement and interpret and good interrater reliability was established. Further, although the 3item version of the Jadad scale is most commonly used, we found the original 6-item version to be more relevant in our clinical context as it provides greater variation in scores. In the cancer discipline, few trials are placebo-controlled and treatment allocation tends to be poorly reported. In contrast to the pain trials which were profiled in the development phase of the Jadad instrument, the majority of the items in the 3-item version (randomization and blinding items) yield no variation in scores in our context and are, therefore, not useful to discriminate among cancer trials. The 6-item version of the scale more aptly differentiates quality across studies and includes more quality domains for which an empirical basis has been established [4].
Another conclusion that can be drawn from this study is that effect size can be related to study quality but that the nature of the relationship in one clinical area may not generalize to another clinical area. Some of the original work examining the role of study quality reinforces the need to be mindful of the variation among studies included in systematic review [5][6][7]. However, when we examined five published reviews that had conducted sensitivity analyses on pooled mortality data from RCTs, four of these found that larger effect sizes were associated with high-quality studies, not lower quality trials as has been convention, and the absence or presence of statistical differences between the two allocation groups remained constant. One of challenges in examining this work is that the number of high quality studies is limited; there is a reduction in power that subjects the point estimates to bias. Nonetheless, the potential bias of study design and quality requires thoughtful consideration within a given clinical field.
We conducted sensitivity analysis on the systematic reviews comprising the guidelines developed by the PEBC. Only four systematic reviews among 36 eligible practice guidelines included more than 10 trials with data appropriate for pooling; three from which we could extract data. Although there was some variation in the odds ratios observed, the confidence intervals of the pooled effects from each of the analyses of high quality trials overlapped with the confidence intervals of the pooled odds with all of the trials. In no case would the conclusions based on these results be affected by restricting the meta-analysis to only high quality studies; the recommendations remained the same. Had sensitivity analysis based on study quality been conducted prospectively, it is highly unlikely that different conclusions would have been drawn from the systematic review or that different clinical practice guidelines would have been formulated.
Together, these findings lead to our final conclusion that measuring study quality did not translate into altered conclusions from a systematic review in the oncology domain for the outcomes we used here. Thus, at this time we have decided that measuring study quality using a numerical assessment scale for the purposes of sensitivity analysis will not be a routine part of our guideline development program. We will, however, encourage guideline developers to describe the variation among studies and to point out methodologic flaws. In addition, it will be important for us to repeat this study looking at other outcome measures, such as quality of life and adverse effects, as they become more routinely reported in primary cancer research and incorporated into our practice guidelines. Outcomes other than those studied here may be more sensitive to the issues of study quality.
This study highlights a strategy that may be useful for guideline programs to utilize in making decisions regarding the methods employed in their guideline development process. It is important that scientific inquiry be maintained in studying the value and role of study quality assessment rather than accepting its role as convention. By exploring it within a specific clinical context one can identify it's most appropriate application.