Dimension and/or tool | Authors | Construct(s), as described by the authors | Target population | Domains, nr. of items | Response options | Development and validation |
---|---|---|---|---|---|---|
“Applicability”-dimension of LEGEND | Clark et al. [77] | Applicability of results to treating patients | P1: RCTs and CCTs P2: reviewers and clinicians | 3 items | 3-point-scale | Deductive and inductive item-generation. Tool was pilot tested among an interprofessional group of clinicians. |
“Applicability”-dimension of Carr´s evidence-grading scheme | Carr et al. [63] | Generalizability of study population | P1: clinical trials P:2 authors of SRs | 1 item | 3-point-classification-scale | No specific information on tool development. |
Bornhöft´s checklist | Bornhöft et al. [78] | External validity (EV) and Model validity (MV) of clinical trials | P1: clinical trials P2: authors of SRs | 4 domains with 26 items for EV and MV each | 4-point-scale | Development with a comprehensive, deductive item-generation from the literature. Pilot-tests were performed, but not for the whole scales. |
Cleggs´s external validity assessment | Clegg et al. [64] | Generalizability of clinical trials to England and Wales | P1: clinical trials P2: authors of SRs and HTAs | 5 items | 3-point-scale | No specific information on tool development |
Clinical applicability | Haraldsson et al. [66] | Report quality and applicability of intervention, study population and outcomes | P1: RCTs P2: reviewers | 6 items | 3-point-scale and 4-point-scale | No specific information on tool development |
Clinical Relevance Instrument | Cho & Bero [79] | Ethics and Generalizability of outcomes, subjects, treatment and side effects | P1: clinical trials P2: reviewers | 7 items | 3-point-scale | Tool was pilot tested on 10 drug studies. Content validity was confirmed by 7 reviewers with research experience. - interrater reliability: ICC = 0.56 (n = 127) [80] |
“Clinical Relevance” according to the CCBRG | Van Tulder et al. [81] | Applicability of patients, interventions and outcomes | P1: RCTs P2: authors of SRs | 5 items | 3-point-scale (Staal et al., 2008) | Deductive item-generation for Clinical Relevance. Results were discussed in a workshop. After two rounds, a final draft was circulated for comments among editors of the CCBRG. |
Clinical Relevance Score | Karjalainen et al. [68] | Report quality and applicability of results | P1: RCTs P2: reviewers | 3 items | 3-point-scale | No specific information on tool development. |
Estrada´s applicability assessment criteria | Estrada et al. [82] | Applicability of population, intervention, implementation and environmental context to Latin America | P1: RCTs P2: reviewers | 5 domains with 8 items | 3-point-scale for each domain | Deductive item generation from the review by Munthe-Kaas et al. [17]. Factors and items were adapted, and pilot tested by the review team (n = 4) until consensus was reached. |
EVAT (External Validity Assessment Tool) | Khorsan & Crawford [83] | External validity of participants, intervention, and setting | P1: RCTs and non-randomized studies P2: reviewers | 3 items | 3-point-scale | Deductive item-generation. Tool developed based on the GAP-checklist [76] and the Downs and Black-checklist [22]. Feasibility was tested and a rulebook was developed but not published. |
“External validity”-dimension of the Downs & Black-Checklist | Downs & Black [22] | Representativeness of study participants, treatments and settings to source population or setting | P1: RCTs and non-randomised studies P2: reviewers | 3 items | 3-point-scale | Deductive item-generation, pilot test and content validation of pilot version. Final version tested for: - internal consistency: KR-20 = 0.54 (n = 20), - reliability: test-retest: k = -0.05-0.48 and 10–15% disagreement (measurement error) (n = 20), [22] interrater reliability: k = -0.08-0.00 and 5–20% disagreement (measurement error) (n = 20) [22]; ICC = 0.76 (n = 20) [84] |
“External validity”-dimension of Foy´s quality checklist | Foy et al. [65] | External validity of patients, settings, intervention and outcomes | P1: intervention studies P2: reviewers | 6 items | not clearly described | Deductive item-generation. No further information on tool development. |
“External validity”-dimension of Liberati´s quality assessment criterias | Liberati et al. [69] | Report quality and generalizability | P1: RCTS P2: reviewers | 9 items | dichotomous and 3-point-scale | Tool is a modified version of a previously developed checklist [85] with additional inductive item-generation. No further information on tool development. |
“External validity”-dimension of Sorg´s checklist | Sorg et al. [71] | External validity of population, interventions, and endpoints | P1: RCTs P2: reviewers | 4 domains with 11 items | not clearly described | Developed based on Bornhöft et al. [78] No further information on tool development. |
“external validity”-criteria of the USPSTF | USPSTF Procedure manual [73] | Generalizability of study population, setting and providers for US primary care | P1: clinical studies P2: USPSTF reviewers | 3 items | Sum-score- rating: 3-point-scale | Tool developed for USPSTF reviews. No specific information on tool development. - interrater reliability: ICC = 0.84 (n = 20) [84] |
FAME (Feasibility, Appropriateness, Meaningfulness and Effectiveness) scale | Averis et al. [70] | Grading of recommendation for applicability and ethics of intervention | P1: intervention studies P2: reviewers | 4 items | 5-point-scale | The FAME framework was created by a national group of nursing research experts. Deductive and inductive item-generation. No further information on tool development. |
GAP (Generalizability, Applicability and Predictability) checklist | Fernandez-Hermida et al. [76] | External validity of population, setting, intervention and endpoints | P1: RCTs P2: Reviewers | 3 items | 3-point-scale | No specific information on tool development. |
Gartlehner´s tool | Gartlehner et al. [86] | To distinguish between effectiveness and efficacy trials | P1: RCTs P2: reviewers | 7 items | Dichotomous | Deductive and inductive item-generation. - criterion validity testing with studies selected by 12 experts as gold standard.: specificity = 0.83, sensitivity = 0.72 (n = 24) - measurement error: 78.3% agreement (n = 24) - interrater reliability: k = 0.42 (n = 24) [86]; k = 0.11–0.81 (n = 151) [87] |
Green & Glasgow´s external validity quality rating criteria | Green & Glasgow [88] | Report quality for generalizability | P1: trials (not explicitly described) P2: reviewers | 4 Domains with 16 items | Dichotomous | Deductive item-generation. Mainly based on the Re-Aim framework.[89] - interrater reliability: ICC = 0.86 (n = 14) [90] - discriminative validity: TREND studies report on 77% and non-TREND studies report on 54% of scale items (n = 14) [90] - ratings across included studies (n = 31) [91], no hypothesis was defined |
“Indirecntess”-dimension of the GRADE handbook | Schünemann et al. [92] | Differences of population, interventions, and outcome measures to research question | P1: intervention studies P2: authors of SRs, clinical guidelines and HTAs | 4 items | Overall: 3-point-scale (downgrading options) | Deductive and inductive item-generation, pilot-testing with 17 reviewers (n = 12) [48]. - interrater reliability: ICC = 0.00–0.13 (n > 100) [93] |
Loyka´s external validity framework | Loyka et al. [75] | Report quality for generalizability of research in psychological science | P1: intervention studies P2: researchers | 4 domains with 15 items | Dichotomous | Deductive item generation (including Green & Glasgow [88]) and adaptation for psychological science. No further information on tool development. - measurement error: 60-100% agreement (n = 143) |
Modified “Indirectness” of the Checklist for GRADE | Meader et al. [94] | Differences of population, interventions, and outcome measures to research question. | P1: meta-analysis of RCTs P2: authors of SRs, clinical guidelines and HTAs | 5 items | Item-level: 2-and 3-point-scale Overall: 3-point-scale (grading options) | Developed based on GRADE method, two phase pilot-tests, - interrater reliability: kappa was poor to almost perfect on item-level [94] and k = 0.69 for overall rating of indirectness (n = 29) [95] |
external validity checklist of the NHMRC handbook | NHMRC handbook [74] | external validity of an economic study | P1: clinical studies P2: clinical guideline developers, reviewers | 6 items | 3-point-scale | No specific information on tool development. |
revised GATE in NICE manual (2012) | NICE manual [72] | Generalizability of population, interventions and outcomes | P1: intervention studies P2: reviewers | 2 domains with 4 items | 3-point-scale and 5-point-scale | Based on Jackson et al. [96] No specific information on tool development. |
RITES (Rating of Included Trials on the Efficacy-Effectiveness Spectrum) | Wieland et al. [47] | To characterize RCTs on an efficacy-effectiveness continuum. | P1: RCTs P2: reviewers | 4 items | 5-point-likert-scale | Deductive and inductive item-generation, modified Delphi procedure with 69–72 experts, pilot testing in 4 Cochrane reviews, content validation with Delphi procedure and core expert group (n = 14) [47], - interrater reliability: ICC = 0.54-1.0 (n = 22) [97] - convergent validity with PRECIS 2 tool: r = 0.55 correlation (n = 59) [97] |
Section A (Selection Bias) of EPHPP (Effective Public health Practice Project) tool | Thomas et al. [98] | Representativeness of population and participation rate. | P1: clinical trials P2: reviewers | 2 items | Item-level: 4-point-scale and 5-point-scale Overall: 3-point-scale | Deductive item-generation, pilot-tests, content validation by 6 experts, - convergent validity with Guide to Community Services (GCPS) instrument: 52.5–87.5% agreement (n = 70) [98] - test-retest reliability: k = 0.61–0.74 (n = 70) [98] k = 0.60 (n = 20) [99] |
Section D of the CASP checklist for RCTs | CASP Programme [100] | Applicability to local population and outcomes | P1: RCTs P2: participants of workshops, reviewers | 2 items | 3-point-scale | Deductive item-generation, development and pilot-tests with group of experts. |
Whole Systems research considerations´ checklist | Hawk et al. [67] | Applicability of results to usual practice | P1: RCTs P2: Reviewers (developed for review) | 7 domains with 13 items | Item-level: dichotomous Overall: 3-point-scale | Deductive item-generation. No specific information on tool development. |