Identification of tools used to assess the external validity of randomized controlled trials in reviews: a systematic review of measurement properties

Jung, Andres; Balzer, Julia; Braun, Tobias; Luedtke, Kerstin

doi:10.1186/s12874-022-01561-5

Table 2 Characteristics of included tools

From: Identification of tools used to assess the external validity of randomized controlled trials in reviews: a systematic review of measurement properties

Dimension and/or tool	Authors	Construct(s), as described by the authors	Target population	Domains, nr. of items	Response options	Development and validation
“Applicability”-dimension of LEGEND	Clark et al. [77]	Applicability of results to treating patients	P1: RCTs and CCTs P2: reviewers and clinicians	3 items	3-point-scale	Deductive and inductive item-generation. Tool was pilot tested among an interprofessional group of clinicians.
“Applicability”-dimension of Carr´s evidence-grading scheme	Carr et al. [63]	Generalizability of study population	P1: clinical trials P:2 authors of SRs	1 item	3-point-classification-scale	No specific information on tool development.
Bornhöft´s checklist	Bornhöft et al. [78]	External validity (EV) and Model validity (MV) of clinical trials	P1: clinical trials P2: authors of SRs	4 domains with 26 items for EV and MV each	4-point-scale	Development with a comprehensive, deductive item-generation from the literature. Pilot-tests were performed, but not for the whole scales.
Cleggs´s external validity assessment	Clegg et al. [64]	Generalizability of clinical trials to England and Wales	P1: clinical trials P2: authors of SRs and HTAs	5 items	3-point-scale	No specific information on tool development
Clinical applicability	Haraldsson et al. [66]	Report quality and applicability of intervention, study population and outcomes	P1: RCTs P2: reviewers	6 items	3-point-scale and 4-point-scale	No specific information on tool development
Clinical Relevance Instrument	Cho & Bero [79]	Ethics and Generalizability of outcomes, subjects, treatment and side effects	P1: clinical trials P2: reviewers	7 items	3-point-scale	Tool was pilot tested on 10 drug studies. Content validity was confirmed by 7 reviewers with research experience. - interrater reliability: ICC = 0.56 (n = 127) [80]
“Clinical Relevance” according to the CCBRG	Van Tulder et al. [81]	Applicability of patients, interventions and outcomes	P1: RCTs P2: authors of SRs	5 items	3-point-scale (Staal et al., 2008)	Deductive item-generation for Clinical Relevance. Results were discussed in a workshop. After two rounds, a final draft was circulated for comments among editors of the CCBRG.
Clinical Relevance Score	Karjalainen et al. [68]	Report quality and applicability of results	P1: RCTs P2: reviewers	3 items	3-point-scale	No specific information on tool development.
Estrada´s applicability assessment criteria	Estrada et al. [82]	Applicability of population, intervention, implementation and environmental context to Latin America	P1: RCTs P2: reviewers	5 domains with 8 items	3-point-scale for each domain	Deductive item generation from the review by Munthe-Kaas et al. [17]. Factors and items were adapted, and pilot tested by the review team (n = 4) until consensus was reached.
EVAT (External Validity Assessment Tool)	Khorsan & Crawford [83]	External validity of participants, intervention, and setting	P1: RCTs and non-randomized studies P2: reviewers	3 items	3-point-scale	Deductive item-generation. Tool developed based on the GAP-checklist [76] and the Downs and Black-checklist [22]. Feasibility was tested and a rulebook was developed but not published.
“External validity”-dimension of the Downs & Black-Checklist	Downs & Black [22]	Representativeness of study participants, treatments and settings to source population or setting	P1: RCTs and non-randomised studies P2: reviewers	3 items	3-point-scale	Deductive item-generation, pilot test and content validation of pilot version. Final version tested for: - internal consistency: KR-20 = 0.54 (n = 20), - reliability: test-retest: k = -0.05-0.48 and 10–15% disagreement (measurement error) (n = 20), [22] interrater reliability: k = -0.08-0.00 and 5–20% disagreement (measurement error) (n = 20) [22]; ICC = 0.76 (n = 20) [84]
“External validity”-dimension of Foy´s quality checklist	Foy et al. [65]	External validity of patients, settings, intervention and outcomes	P1: intervention studies P2: reviewers	6 items	not clearly described	Deductive item-generation. No further information on tool development.
“External validity”-dimension of Liberati´s quality assessment criterias	Liberati et al. [69]	Report quality and generalizability	P1: RCTS P2: reviewers	9 items	dichotomous and 3-point-scale	Tool is a modified version of a previously developed checklist [85] with additional inductive item-generation. No further information on tool development.
“External validity”-dimension of Sorg´s checklist	Sorg et al. [71]	External validity of population, interventions, and endpoints	P1: RCTs P2: reviewers	4 domains with 11 items	not clearly described	Developed based on Bornhöft et al. [78] No further information on tool development.
“external validity”-criteria of the USPSTF	USPSTF Procedure manual [73]	Generalizability of study population, setting and providers for US primary care	P1: clinical studies P2: USPSTF reviewers	3 items	Sum-score- rating: 3-point-scale	Tool developed for USPSTF reviews. No specific information on tool development. - interrater reliability: ICC = 0.84 (n = 20) [84]
FAME (Feasibility, Appropriateness, Meaningfulness and Effectiveness) scale	Averis et al. [70]	Grading of recommendation for applicability and ethics of intervention	P1: intervention studies P2: reviewers	4 items	5-point-scale	The FAME framework was created by a national group of nursing research experts. Deductive and inductive item-generation. No further information on tool development.
GAP (Generalizability, Applicability and Predictability) checklist	Fernandez-Hermida et al. [76]	External validity of population, setting, intervention and endpoints	P1: RCTs P2: Reviewers	3 items	3-point-scale	No specific information on tool development.
Gartlehner´s tool	Gartlehner et al. [86]	To distinguish between effectiveness and efficacy trials	P1: RCTs P2: reviewers	7 items	Dichotomous	Deductive and inductive item-generation. - criterion validity testing with studies selected by 12 experts as gold standard.: specificity = 0.83, sensitivity = 0.72 (n = 24) - measurement error: 78.3% agreement (n = 24) - interrater reliability: k = 0.42 (n = 24) [86]; k = 0.11–0.81 (n = 151) [87]
Green & Glasgow´s external validity quality rating criteria	Green & Glasgow [88]	Report quality for generalizability	P1: trials (not explicitly described) P2: reviewers	4 Domains with 16 items	Dichotomous	Deductive item-generation. Mainly based on the Re-Aim framework.[89] - interrater reliability: ICC = 0.86 (n = 14) [90] - discriminative validity: TREND studies report on 77% and non-TREND studies report on 54% of scale items (n = 14) [90] - ratings across included studies (n = 31) [91], no hypothesis was defined
“Indirecntess”-dimension of the GRADE handbook	Schünemann et al. [92]	Differences of population, interventions, and outcome measures to research question	P1: intervention studies P2: authors of SRs, clinical guidelines and HTAs	4 items	Overall: 3-point-scale (downgrading options)	Deductive and inductive item-generation, pilot-testing with 17 reviewers (n = 12) [48]. - interrater reliability: ICC = 0.00–0.13 (n > 100) [93]
Loyka´s external validity framework	Loyka et al. [75]	Report quality for generalizability of research in psychological science	P1: intervention studies P2: researchers	4 domains with 15 items	Dichotomous	Deductive item generation (including Green & Glasgow [88]) and adaptation for psychological science. No further information on tool development. - measurement error: 60-100% agreement (n = 143)
Modified “Indirectness” of the Checklist for GRADE	Meader et al. [94]	Differences of population, interventions, and outcome measures to research question.	P1: meta-analysis of RCTs P2: authors of SRs, clinical guidelines and HTAs	5 items	Item-level: 2-and 3-point-scale Overall: 3-point-scale (grading options)	Developed based on GRADE method, two phase pilot-tests, - interrater reliability: kappa was poor to almost perfect on item-level [94] and k = 0.69 for overall rating of indirectness (n = 29) [95]
external validity checklist of the NHMRC handbook	NHMRC handbook [74]	external validity of an economic study	P1: clinical studies P2: clinical guideline developers, reviewers	6 items	3-point-scale	No specific information on tool development.
revised GATE in NICE manual (2012)	NICE manual [72]	Generalizability of population, interventions and outcomes	P1: intervention studies P2: reviewers	2 domains with 4 items	3-point-scale and 5-point-scale	Based on Jackson et al. [96] No specific information on tool development.
RITES (Rating of Included Trials on the Efficacy-Effectiveness Spectrum)	Wieland et al. [47]	To characterize RCTs on an efficacy-effectiveness continuum.	P1: RCTs P2: reviewers	4 items	5-point-likert-scale	Deductive and inductive item-generation, modified Delphi procedure with 69–72 experts, pilot testing in 4 Cochrane reviews, content validation with Delphi procedure and core expert group (n = 14) [47], - interrater reliability: ICC = 0.54-1.0 (n = 22) [97] - convergent validity with PRECIS 2 tool: r = 0.55 correlation (n = 59) [97]
Section A (Selection Bias) of EPHPP (Effective Public health Practice Project) tool	Thomas et al. [98]	Representativeness of population and participation rate.	P1: clinical trials P2: reviewers	2 items	Item-level: 4-point-scale and 5-point-scale Overall: 3-point-scale	Deductive item-generation, pilot-tests, content validation by 6 experts, - convergent validity with Guide to Community Services (GCPS) instrument: 52.5–87.5% agreement (n = 70) [98] - test-retest reliability: k = 0.61–0.74 (n = 70) [98] k = 0.60 (n = 20) [99]
Section D of the CASP checklist for RCTs	CASP Programme [100]	Applicability to local population and outcomes	P1: RCTs P2: participants of workshops, reviewers	2 items	3-point-scale	Deductive item-generation, development and pilot-tests with group of experts.
Whole Systems research considerations´ checklist	Hawk et al. [67]	Applicability of results to usual practice	P1: RCTs P2: Reviewers (developed for review)	7 domains with 13 items	Item-level: dichotomous Overall: 3-point-scale	Deductive item-generation. No specific information on tool development.

Abbreviations: CASP Critical Appraisal Skills Programme, CCBRG Cochrane Collaboration Back Review Group, CCT controlled clinical trial, GATE Graphical Appraisal Tool for Epidemiological Studies, GRADE Grading of Recommendations Assessment, Development and Evaluation, HTA Health Technology Assessment, ICC intraclass correlation, LEGEND Let Evidence Guide Every New Decision, NICE National Institute for Health and Care Excellence, PRECIS PRagmatic Explanatory Continuum Indicator Summary, RCT randomized controlled trial, TREND Transparent Reporting of Evaluations with Nonrandomized Designs, USPSTF U.S. Preventive Services Task Force

Back to article page

ISSN: 1471-2288

Contact us

General enquiries: journalsubmissions@springernature.com

BMC Medical Research Methodology

Contact us