- Research
- Open access
- Published:
Pitfalls of single-study external validation illustrated with a model predicting functional outcome after aneurysmal subarachnoid hemorrhage
BMC Medical Research Methodology volume 24, Article number: 176 (2024)
Abstract
Background
Prediction models are often externally validated with data from a single study or cohort. However, the interpretation of performance estimates obtained with single-study external validation is not as straightforward as assumed. We aimed to illustrate this by conducting a large number of external validations of a prediction model for functional outcome in subarachnoid hemorrhage (SAH) patients.
Methods
We used data from the Subarachnoid Hemorrhage International Trialists (SAHIT) data repository (n = 11,931, 14 studies) to refit the SAHIT model for predicting a dichotomous functional outcome (favorable versus unfavorable), with the (extended) Glasgow Outcome Scale or modified Rankin Scale score, at a minimum of three months after discharge. We performed leave-one-cluster-out cross-validation to mimic the process of multiple single-study external validations. Each study represented one cluster. In each of these validations, we assessed discrimination with Harrell’s c-statistic and calibration with calibration plots, the intercepts, and the slopes. We used random effects meta-analysis to obtain the (reference) mean performance estimates and between-study heterogeneity (I2-statistic). The influence of case-mix variation on discriminative performance was assessed with the model-based c-statistic and we fitted a “membership model” to obtain a gross estimate of transportability.
Results
Across 14 single-study external validations, model performance was highly variable. The mean c-statistic was 0.74 (95%CI 0.70–0.78, range 0.52–0.84, I2 = 0.92), the mean intercept was -0.06 (95%CI -0.37–0.24, range -1.40–0.75, I2 = 0.97), and the mean slope was 0.96 (95%CI 0.78–1.13, range 0.53–1.31, I2 = 0.90). The decrease in discriminative performance was attributable to case-mix variation, between-study heterogeneity, or a combination of both. Incidentally, we observed poor generalizability or transportability of the model.
Conclusions
We demonstrate two potential pitfalls in the interpretation of model performance with single-study external validation. With single-study external validation. (1) model performance is highly variable and depends on the choice of validation data and (2) no insight is provided into generalizability or transportability of the model that is needed to guide local implementation. As such, a single single-study external validation can easily be misinterpreted and lead to a false appreciation of the clinical prediction model. Cross-validation is better equipped to address these pitfalls.
Background
Clinical prediction models are used to predict the probability of a disease or an outcome conditional on a set of patient characteristics. As clinical prediction models are meant to facilitate clinical decision-making, it is paramount that the prognostic estimates are valid and precise. Unreliable risk estimates could give rise to faulty decision-making and thus patient harm [1]. For example, in vascular neurology we use the PHASES score (Population, Hypertension, Age, Size, Earlier Subarachnoid Hemorrhage, Site). The PHASES score is a clinical prediction model used to predict the 5-year rupture risk in patients with an unruptured intracranial aneurysm [2]. The risk of rupture predicted by the PHASES score is balanced with the risk of intervention to determine optimal management [3]. Misclassification could result in withholding preventive aneurysm treatment in high-risk patients or unnecessary treatment-related harm in low-risk patients. To investigate accuracy and precision of the prognostic estimates, validation is performed.
Two types of validation are distinguished: internal validation and external validation. Internal validation assesses the robustness of the model and the degree of overfitting (i.e., modeling random noise within the development data). External validation is used to investigate model performance in independent data that was not involved in model development. Model performance is expressed in terms of discrimination – the ability to distinguish patients likely to experience the outcome of interest from those who are not – and calibration – the agreement between the predicted and observed risk.
External validation is usually conducted in a study with data from a single center of a certain period, from a certain geographical area, and consisting of patients selected based on specific criteria. This method is called “single-study external validation”. The model performance obtained through this type of validation is often thought to represent model performance that can be expected in the population. However, this interpretation is not as straightforward as assumed. It is highly debatable whether (a single) single-study external validation provides true insight into the validity and accuracy of the risk predictions and there are several pitfalls to consider when interpreting its results [4]. There is an alternative, hybrid, internal–external cross-validation approach available for clustered data that might better address these pitfalls. Clustered data consists of multiple sources differing in multiple dimensions (e.g., geographical areas, studies, periods, etc.). In this approach, each cluster is left out once to validate the model in to obtain cluster-specific performance estimates.
Because the number of model development and validation studies is rising exponentially accurate critical appraisal of such studies is becoming increasingly important. Here, we conducted a large number of single-study validations of a prediction model for functional outcome in aneurysmal subarachnoid hemorrhage (aSAH) patients to illustrate potential pitfalls leading to misinterpretation of model performance.
Methods
Study population and model development
We used data from The Subarachnoid Hemorrhage International Trialists (SAHIT) data repository, an individual participant meta-analysis registry (Supplementary Material 1). The SAHIT data repository consisted of eleven randomized controlled trials (RCTs) [5,6,7,8,9,10,11,12,13,14,15,16] and ten prospective observational hospital registries (n= 13,046) [17,18,19,20,21]. Patient enrolment took place in four continents over 30 years (1983–2012). From the SAHIT data repository, we selected studies that determined functional outcome with the (extended) Glasgow Outcome Scale ((e)GOS) or modified Rankin Scale (mRS) with a minimum follow-up of 3 months (n = 11,931). We excluded nine data sources (Supplementary Material 2).
The GOS was dichotomized into favorable and unfavorable (defined as a GOS 4–5 versus GOS 1–3, or eGOS 4–8 versus 1–3, respectively). The GOS ranges from one to five, with five being no symptoms and one death. The eGOS is a more detailed nine-level version of the GOS. If the GOS scores were not available we used the mRS or eGOS scores. The mRS is a 7-level scale ranging from zero, no symptoms to six, death. We dichotomized the mRS into 0–3, favorable, and 4–6, unfavorable. All functional outcome scales are commonly used in stroke research [22, 23]. Conversion algorithms were described in Supplementary Material 3. Missing data were handled by using multiple imputation by chained equations. We assumed data were missing at random [24].
We refitted the previously published SAHIT (neuroimaging) model using identical data and predictors and similar modeling strategies as in the original development study [25]. The purpose of this study was not to develop a new or better model. In the SAHIT model, age, aneurysm location, aneurysm size, World Federation of Neurological Surgeons (WFNS) grade on admission, premorbid hypertension, and CT Fisher grade were used as predictors of functional outcome [26, 27]. We dichotomized the WFNS grade into I-III and IV-V and categorized aneurysm location into anterior cerebral artery, anterior communicating artery, internal cerebral artery, middle cerebral artery, posterior communicating artery, and posterior circulation.
Model performance
We assessed discrimination with Harrell’s c-statistic and the model-based c-statistic [28]. We compared Harrell’s c-statistics at single-study and cross-validation validation with the internally validated optimism-corrected Harrell’s c-statistic obtained via bootstrapping (benchmark). Next, to quantify to which extent the case-mix variation influences the discriminative performance at validation we evaluated the difference between Harrell’s c-statistic and the model-based c-statistic. This difference is attributable to case-mix variation. We assessed calibration graphically with calibration plots and numerically with the intercept and the slope. To facilitate understanding of the study, we have included a detailed explanation of all performance measures that are discussed in this study (Supplementary Material 4).
Validation
We performed two types of validation to assess model performance: leave-one-cluster-out internal–external cross-validation (henceforth cross-validation) and single-study external validation. With cross-validation, each cluster (representing one study or registry) is alternatingly left out of model development to assess model performance [29]. The split in the data is non-random because existing heterogeneity between clusters is utilized (by study, geographical area, period). To mimic the process of multiple single-study external validations, we assessed model performance, obtained via cross-validation, individually in each cluster as if they were predefined external validation clusters. We assessed the performance metrics with leave-one-cluster-out cross-validation as with single-study external validation.
Next, we pooled the performance with random effects meta-analysis to obtain mean performance estimates that serve as a reference value for overall external model performance [30]. We used the I2-statistic to describe the degree of variability in model performance that is attributable to between-cluster heterogeneity. All statistical analyses were performed with R software (version 3.6.3, R Foundation for Statistical Computing) using the rms (version 6.2.0) [31], mice (version 3.13.0) [32], Hmisc (version 4.5.0) [33], metamisc (version 0.2.5) [34], CalibrationCurves (version 0.1.2) [35], metafor (version 3.4.0) [36], and PredictionTools (version 0.1.0) [37] packages. We adhered to the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis checklist (TRIPOD) statement (Supplementary Material 5) [38].
Transportability
We assessed the relatedness of the derivation and validation clusters by fitting a logistic regression “membership model” [29]. This model predicts a patient being part of the development (0) or validation (1) cluster and includes all predictors and the outcome. A high c-statistic of the membership model (easy discrimination) means that the development and validation clusters are not related and a low c-statistic (hard discrimination) vice versa. By jointly assessing Harrell’s c-statistic and the c-statistic of the membership model we obtain a gross estimate of the generalizability and transportability of the model. Generalizability is the reproducibility in a similar population and transportability is the reproducibility in a different but related population. Transportability is important because many models will be applied to populations that differ from the original study population. To illustrate, observing a high membership model c-statistic and a high Harrell’s c-statistic means that even though there was a lot of heterogeneity between clusters the model still discriminated well.
Results
Baseline characteristics
In the overall SAHIT data repository, the mean age was 53 years (SD 13, Table 1), 80% of patients presented in a favorable clinical condition (WFNS grade I-III, n = 8869), and 38% had premorbid hypertension (n = 2884). In 5% of patients, no hemorrhage was detected on the CT scan (Fisher grade 1, n = 493), 21% had a Fisher grade 2 (n = 2050), 39% had a Fisher grade 3 (n = 3762), and 35% had a Fisher grade 4 (n = 3440). Most ruptured aneurysms were smaller than 13 mm (84%, n = 7314) and 88% had a ruptured intracranial aneurysm located in the anterior circulation (n = 9081). Eighty percent of patients had a favorable functional outcome at a minimum of 3-month follow-up (range 60–92%, n = 8537).
Cluster-specific baseline characteristics
We observed large heterogeneity in baseline characteristics between individual clusters (between-study heterogeneity). The proportion of patients presenting in unfavorable clinical condition (WFNS IV-V) ranged from 0% (IHAST) to 78% (HHU) and the proportion of aneurysms larger than 13 mm from 6% (MAPS) to 26% (Tirilazad). In the Tirilazad cluster and ISAT clusters, no posterior communicating artery aneurysms (PCOM) were included, in the Tirilazad cluster no anterior communicating artery aneurysms (ACOM), and in EPO/Statin no anterior cerebral artery aneurysms. Conversely, in the Conscious-I, EPO/Statin, I-HAST, MAPS, MASH1/2, and Utrecht clusters the majority (> 50%) of aneurysms were located at the ACOM and PCOM sites. Both the mean age and the proportion of hypertension were similar across clusters. Favorable functional outcome ranged from 60% (Chicago) to 92% (MAPS).
Discrimination
The predictor effects were described in Supplementary Material 6. The internally validated, optimism-corrected Harrell’s c-statistic was 0.77 (95% CI 0.76–0.79, Table 2). We observed a large variability in discriminative performance when evaluating the c-statistics at single-study external validation. Model discrimination ranged from (very) poor, comparable to a coin flip, in the IMASH cluster (0.52, 95% CI 0.45–0.59), to moderate in the Leeds cluster (0.66, 95% CI 0.52–0.79), and the ISAT clusters (0.68, 95% CI 0.65–0.71), to excellent in the Conscious-1 (0.78, 95% CI 0.72–0.83), the D-SAT (0.80, 95% CI 0.75–0.85), and the SHOP clusters (0.84, 95% CI 0.82–0.86). The pooled mean c-statistic with cross-validation was 0.75 (95% CI 0.70–0.78). The I2-statistic was 0.92, indicating that the proportion of the total variability in the c-statistic was to a large extent explained by between-study heterogeneity.
We observed a substantial decrease in discriminative performance in 6 clusters compared to the optimism-corrected c-statistic (benchmark) and the pooled mean c-statistic (reference for overall external performance). Specifically, in the Leeds, IMASH, and Chicago clusters, this decline can be attributed to case-mix variation (Harrell’s 0.66, 0.52, and 0.76 versus model-based 0.76, 0.78, and 0.73, respectively). In the HHU and IHAST clusters, the drop is due to miscalibration (Harrell’s 0.73 and 0.71 versus model-based 0.71 and 0.68, respectively) and in the ISAT cluster it is explained by a combination of case-mix variation and miscalibration (Harrell's 0.68 and model-based 0.70).
Calibration
Again, we observed large variability in the calibration with single-study external validation across clusters. Calibration ranged from poor in the HHU cluster with an intercept of -1.40 (95% CI -2.08–-0.71, Fig. 1A-N) and slope of 1.17 (0.16–2.17) to excellent in the CONSCIOUS-I, IHAST, ISAT, MASH1/2, and D-SAT clusters. The pooled mean intercept was -0.06 (95% CI -0.37–0.24) and the pooled mean slope was 0.96 (95% CI 0.78–1.17) with cross-validation. For both the intercept (I2 = 0.97) and the slope (I2 = 0.90), the I2-statistic indicated that the degree of total variability was largely explained by between-study heterogeneity.
Transportability
In most clusters the membership models’ c-statistics were moderately high (IHAST, IMASH, ISAT, MAPS, DSAT, SHOP, and Utrecht, all between 0.70–0.80, Table 3) or high (CONSCIOUS-I, Chicago, EPO/Statin, HHU, and Tirilazad, all above 0.80). Despite this, the discriminative performance remained satisfactory indicating good transportability of the model. In other words: even in distinctly different study populations the model discriminated well between high-risk and low-risk patients. There was one exception to this generalization. In the Leeds cluster the derivation and validation samples were similar (membership model c-statistic 0.69), but the discriminative performance was unsatisfactory indicating poor generalizability of the model (Supplementary Material 7).
Discussion
We used the SAHIT data repository, a large individual participant meta-analysis dataset, to study external validity in a large number of single-study cohorts to illustrate the potential pitfalls in interpreting model performance, of single-study external validation. Although single-study external validation is preferred over no at all, our analysis clearly illustrates two pitfalls and how this may lead to a false appreciation of the model.
(1) We observed that model performance was highly variable between cohorts. This means that the appreciation of the model is highly dependent on the choice of the validation data. For example, validating the model in the IMASH (0.52), Leeds (0.66), or the ISAT cluster (0.68) would indicate poor to moderate discrimination. Contrarily, the reference mean c-statistic (0.74) and most cluster-specific c-statistics suggest otherwise. There is no formal threshold for the c-statistic for implementation in clinical practice. Conversely, the model performance in the SHOP cluster (0.84) was probably more optimistic than can be expected in the population. A similar problem is observed when examining the calibration of the model at external validation. In some clusters such as HHU, I-MASH, Leeds or SHOP calibration was (very) poor, but in others it was excellent. Several factors may explain the large variation in performance.
First, case-mix variation is known to influence discriminative performance [39, 40]. The more patients within a cluster are alike (homogeneous) the more difficult it is to discriminate between high and low-risk individuals. It is known that ISAT excluded 90% of initially screened patients leaving patients with similar characteristics such as favorable prognosis and aneurysms predominantly located at specific sites [41]. As such, in ISAT we observed a slightly higher model-based c-statistic than Harrell’s c-statistic. Validating the model in the ISAT cohort alone may lead to a false conclusion about the model’s discriminative performance.
Second, between-study heterogeneity can also affect model performance. Case-mix variation refers to the differences between subjects within a population, while between-study heterogeneity refers to differences between study populations. Slight changes in the definition or the measurement of predictors and the outcome can change the size and direction of predictor effects [42]. Additionally, predictors may affect treatment decisions downstream (confounding by indication) and subsequently affect patient outcomes. Because of this, such slight changes can lead to severe miscalibration and poor discrimination of a model.
With cross-validation, between-study heterogeneity can be used to benefit interpretation. The SAHIT registry consists of data from randomized trials that have had stringently selected (thus homogeneous) study populations, that together, form a very heterogeneous overall study population. Differences in predictor and outcome definition or measurement and the context of the study period and geographical origin of the SAHIT registry can be used to explain variability in model performance. The I2-statistics obtained with random effects meta-analysis confirmed that the proportion of total variability in model performance was largely explained by between-study heterogeneity.
(2) Without understanding the heterogeneity between the derivation and validation sample and the population we cannot assess the generalizability (i.e., reproducibility in a similar population) and the transportability (i.e., reproducibility in a different but related population) of the model [43, 44]. For example, the intended subpopulation – in which the model is to be applied – may have distinct differences from the validation sample (e.g., in the healthcare setting, measurement of biomarkers or imaging findings, and treatment algorithms). Also, the validation may have been conducted in an independent, but equally selected sample not representative of the population. In both cases, we do not obtain a valid insight into the expected model performance.
Due to overall improvements in diagnostic capabilities, treatment algorithms, and patients’ outcomes the validity of all models will eventually expire. Thus, even if a model performs well it will most likely not be globally and eternally applicable [45]. Because of this, most models will need local and continuous updating. Some clinical prediction models will require updating of the intercept and/or the slope, while others may need a full re-estimation of the parameters. Geographical, temporal, or methodological clustering can be utilized to assess model performance across multiple dimensions and inform updating efforts to fit the intended context. In addition, the membership model can be used to obtain a gross estimate of the relatedness of two samples.
Strengths and Limitations
This study is strengthened by the use of a large individual participant meta-analysis dataset to assess model performance across multiple dimensions. Furthermore, because of the clustered nature of the SAHIT data repository, we were able to investigate between-cluster heterogeneity, generalizability, and transportability.
Our study is limited by the fact that in some clusters predictors were missing completely. We performed multiple imputation of the entire SAHIT data repository instead of the individual clusters. This may mean that between-cluster heterogeneity could be diluted and that model performance may be overestimated for individual clusters with completely missing predictor variables. The largest proportion of missingness for the full cohort was with premorbid hypertension (36%) and will not have a large effect on the overall conclusions of this study.
Another limitation of our study was the use of multiple outcome scales and varying time points of assessment. We chose to use the GOS at 3 months as much as possible, but not all studies assessed outcome equally. Patients can improve or worsen from 3 months up to 12 months, but we hypothesize that even though on an ordinal scale these changes will be substantial on a dichotomized scale this might be limited.
Recommendations
Insight into case-mix variation, between-cluster heterogeneity, generalizability, and transportability is required to decide if a clinical prediction model performs well enough for to be considered for implementation. Implementation without these insights could lead to patient harm due to inaccurate medical decision-making and possibly incentivize the development of new clinical prediction models instead of validating already existing ones, thereby contributing to research waste. Even for a relatively rare disease an abundance of models predicting functional outcome in aSAH patients already exist [46,47,48,49,50]. Most of these models contain more or less the same set of predictors. The rising availability of clustered data from large international collaborations will open up possibilities for leave-one-cluster-out cross-validation and should be utilized [51].
We advocate using cross-validation instead of single-study external validation, but there are also disadvantages to this approach. First, cross-validation requires a large clustered dataset with sufficient patients per cluster that is usually obtained through international collaborations and may not always be available. Second, an important feature of external validation is an independent evaluation of a clinical prediction model. Usually, leave-one-cluster-out internal–external cross-validation will be conducted directly after model development and thus not performed independently. To aid transparency, regression formulas, code, and data should be made publicly available.
If not available, a reasonable alternative strategy is conducting multiple (smaller) single-study external validations each exploring another dimension. We stress that a single single-study external validation cannot be interpreted as decisive proof of a well or poor model performance and that local and continuous validation is usually required. As a minimum, we advise evaluating selection criteria, recruitment dates, geographical location, and study design of the development and the validation data to obtain a gross estimate of between-cluster heterogeneity.
Conclusions
Two potential pitfalls of single-study external validation have to be considered when interpreting such a validation study. (1) Model performance with single-study external validation can depend heavily on the choice of validation data and can thus lead to a false appreciation of a clinical prediction model. (2) To accurately appreciate generalizability and transportability it is necessary to investigate heterogeneity between the derivation and validation data and the representativeness to the intended population. Thus, a single validation is not equipped to draw definitive conclusions about model performance. As an alternative leave-one-cluster-out internal–external cross-validation enables inspecting model performance across multiple settings with varying temporal, geographical, and methodological dimensions and can inform more reliably about expected performance and whether local revision is required.
Availability of data and materials
We are not at liberty to make the data supporting this publication available. Patient-level data is available upon reasonable request to the SAHIT data repository. More detailed information about the SAHIT data repository including an ethical statement is available elsewhere [52, 53].
Upon publication, Rcode will be made publicly available via: https://github.com/WinkelJordi/crossvalidation.
Abbreviations
- (a)SAH:
-
Aneurysmal subarachnoid hemorrhage
- CT:
-
Computed tomography
- GOS:
-
Glasgow Outcome Scale
- mRS:
-
Modified Rankin Scale
- RCT:
-
Randomized controlled trial
- SAHIT:
-
Subarachnoid Haemorrhage International Trialists’
- WFNS:
-
World Federation of Neurological Surgeons grade
References
Ramspek CL, Jager KJ, Dekker FW, Zoccali C, van Diepen M. External validation of prognostic models: what, why, how, when and where? Clin Kidney J. 2021;14(1):49–58. https://doi.org/10.1093/ckj/sfaa188.
Greving JP, Wermer MJ, Brown RD Jr, et al. Development of the PHASES score for prediction of risk of rupture of intracranial aneurysms: a pooled analysis of six prospective cohort studies. Lancet Neurol. 2014;13(1):59–66. https://doi.org/10.1016/S1474-4422(13)70263-1.
Algra AM, Greving JP, de Winkel J, et al. Development of the SAFETEA scores for predicting risks of complications of preventive endovascular or microneurosurgical intracranial aneurysm occlusion. Neurology. 2022. https://doi.org/10.1212/WNL.0000000000200978.
Van Calster B, Steyerberg EW, Wynants L, van Smeden M. There is no such thing as a validated prediction model. BMC Med. 2023;21(1):70. https://doi.org/10.1186/s12916-023-02779-w.
Etminan N, Beseoglu K, Eicker SO, Turowski B, Steiger HJ, Hanggi D. Prospective, randomized, open-label phase II trial on concomitant intraventricular fibrinolysis and low-frequency rotation after severe subarachnoid hemorrhage. Stroke. 2013;44(8):2162–8. https://doi.org/10.1161/STROKEAHA.113.001790.
Jang YG, Ilodigwe D, Macdonald RL. Metaanalysis of tirilazad mesylate in patients with aneurysmal subarachnoid hemorrhage. Neurocrit Care. 2009;10(1):141–7. https://doi.org/10.1007/s12028-008-9147-y.
Macdonald RL, Kassell NF, Mayer S, et al. Clazosentan to overcome neurological ischemia and infarction occurring after subarachnoid hemorrhage (CONSCIOUS-1): randomized, double-blind, placebo-controlled phase 2 dose-finding trial. Stroke. 2008;39(11):3015–21. https://doi.org/10.1161/STROKEAHA.108.519942.
McDougall CG, Johnston SC, Gholkar A, et al. Bioactive versus bare platinum coils in the treatment of intracranial aneurysms: the MAPS (Matrix and Platinum Science) trial. AJNR Am J Neuroradiol. 2014;35(5):935–42. https://doi.org/10.3174/ajnr.A3857.
Molyneux A, Kerr R, International Subarachnoid Aneurysm Trial Collaborative G, et al. International Subarachnoid Aneurysm Trial (ISAT) of neurosurgical clipping versus endovascular coiling in 2143 patients with ruptured intracranial aneurysms: a randomized trial. J Stroke Cerebrovasc Dis. 2002;11(6):304–14. https://doi.org/10.1053/jscd.2002.130390.
Pickard JD, Murray GD, Illingworth R, et al. Effect of oral nimodipine on cerebral infarction and outcome after subarachnoid haemorrhage: British aneurysm nimodipine trial. BMJ. 1989;298(6674):636–42. https://doi.org/10.1136/bmj.298.6674.636.
Suarez JI, Martin RH, Calvillo E, et al. The Albumin in Subarachnoid Hemorrhage (ALISAH) multicenter pilot clinical trial: safety and neurologic outcomes. Stroke. 2012;43(3):683–90. https://doi.org/10.1161/STROKEAHA.111.633958.
Todd MM, Hindman BJ, Clarke WR, Torner JC. Intraoperative Hypothermia for Aneurysm Surgery Trial I. Mild intraoperative hypothermia during surgery for intracranial aneurysm. N Engl J Med. 2005;352(2):135–45. https://doi.org/10.1056/NEJMoa040975.
Tseng MY, Hutchinson PJ, Richards HK, et al. Acute systemic erythropoietin therapy to reduce delayed ischemic deficits following aneurysmal subarachnoid hemorrhage: a Phase II randomized, double-blind, placebo-controlled trial. Clinical article J Neurosurg. 2009;111(1):171–80. https://doi.org/10.3171/2009.3.JNS081332.
van den Bergh WM, Algra A, van Kooten F, et al. Magnesium sulfate in aneurysmal subarachnoid hemorrhage: a randomized controlled trial. Stroke. 2005;36(5):1011–5. https://doi.org/10.1161/01.STR.0000160801.96998.57.
Wong GK, Poon WS, Chan MT, et al. Intravenous magnesium sulphate for aneurysmal subarachnoid hemorrhage (IMASH): a randomized, double-blinded, placebo-controlled, multicenter phase III trial. Stroke. 2010;41(5):921–6. https://doi.org/10.1161/STROKEAHA.109.571125.
Tseng MY, Hutchinson PJ, Turner CL, et al. Biological effects of acute pravastatin treatment in patients after aneurysmal subarachnoid hemorrhage: a double-blind, placebo-controlled trial. J Neurosurg. 2007;107(6):1092–100. https://doi.org/10.3171/JNS-07/12/1092.
Johnston SC, Dowd CF, Higashida RT, et al. Predictors of rehemorrhage after treatment of ruptured intracranial aneurysms: the Cerebral Aneurysm Rerupture After Treatment (CARAT) study. Stroke. 2008;39(1):120–5. https://doi.org/10.1161/STROKEAHA.107.495747.
Schatlo B, Fung C, Fathi AR, et al. Introducing a nationwide registry: the Swiss study on aneurysmal subarachnoid haemorrhage (Swiss SOS). Acta Neurochir (Wien). 2012;154(12):2173–8; discussion 2178. https://doi.org/10.1007/s00701-012-1500-4.
Helbok R, Kurtz P, Vibbert M, et al. Early neurological deterioration after subarachnoid haemorrhage: risk factors and impact on outcome. J Neurol Neurosurg Psychiatry. 2013;84(3):266–70. https://doi.org/10.1136/jnnp-2012-302804.
Smith ML, Abrahams JM, Chandela S, Smith MJ, Hurst RW, Le Roux PD. Subarachnoid hemorrhage on computed tomography scanning and the development of cerebral vasospasm: the Fisher grade revisited. Surg Neurol. 2005;63(3):229–34. discussion 234–5. https://doi.org/10.1016/j.surneu.2004.06.017.
Reilly C, Amidei C, Tolentino J, Jahromi BS, Macdonald RL. Clot volume and clearance rate as independent predictors of vasospasm after aneurysmal subarachnoid hemorrhage. J Neurosurg. 2004;101(2):255–61. https://doi.org/10.3171/jns.2004.101.2.0255.
Jennett B, Bond M. Assessment of outcome after severe brain damage. Lancet. 1975;1(7905):480–4. https://doi.org/10.1016/s0140-6736(75)92830-5.
van Swieten JC, Koudstaal PJ, Visser MC, Schouten HJ, van Gijn J. Interobserver agreement for the assessment of handicap in stroke patients. Stroke. 1988;19(5):604–7. https://doi.org/10.1161/01.str.19.5.604.
D.B. R. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York (1987); 1987.
Jaja BNR, Saposnik G, Lingsma HF, et al. Development and validation of outcome prediction models for aneurysmal subarachnoid haemorrhage: the SAHIT multinational cohort study. BMJ. 2018;360:j5745. https://doi.org/10.1136/bmj.j5745.
Report of World Federation of Neurological Surgeons Committee on a Universal Subarachnoid Hemorrhage Grading Scale. J Neurosurg. 1988;68(6):985–6. https://doi.org/10.3171/jns.1988.68.6.0985.
Fisher CM, Kistler JP, Davis JM. Relation of cerebral vasospasm to subarachnoid hemorrhage visualized by computerized tomographic scanning. Neurosurgery. 1980;6(1):1–9. https://doi.org/10.1227/00006123-198001000-00001.
van Klaveren D, Gonen M, Steyerberg EW, Vergouwe Y. A new concordance measure for risk prediction models in external validation settings. Stat Med. 2016;35(23):4136–52. https://doi.org/10.1002/sim.6997.
Debray TP, Vergouwe Y, Koffijberg H, Nieboer D, Steyerberg EW, Moons KG. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol. 2015;68(3):279–89. https://doi.org/10.1016/j.jclinepi.2014.06.018.
van Klaveren D, Steyerberg EW, Perel P, Vergouwe Y. Assessing discriminative ability of risk models in clustered data. BMC Med Res Methodol. 2014;14:5. https://doi.org/10.1186/1471-2288-14-5.
Harrell Jr. FE. rms: Regression Modeling Strategies. R Package version 6.3-0. 2022. Available from: https://cran.rproject.org/web/packages/rms/index.html.
van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011;45(3):1–67.
Harrel Jr. FE. Hmisc: Harrell Miscellaneous. R packages version 4.7-0. 2022. Available from: https://cran.rproject.org/web/packages/Hmisc/index.html.
Debray T, de Jong V. metamisc: Meta-Analysis of Diagnosis and Prognosis Research Studies. R package version 0.2.4. 2021. Available from: https://cran.r-project.org/web/packages/metamisc/index.html.
De Cock B, Nieboer D, Van Calster B, Steyerberg EW, Vergouwe Y. CalibrationCurves: Calibration performance. R package version 0.1.2. 2016.
Viechtbauer W. Conducting meta-analyses in R with the metafor package. J Stat Softw. 2010;36(3):1–48.
Maas C, Kent DM, Hughes MC, Dekker R, Lingsma HF, van Klaveren D. Performance metrics for models designed to predict treatment effect. BMC Med Res Methodol. 2023;23(1):165. https://doi.org/10.1186/s12874-023-01974-w.
Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD Statement. Br J Surg. 2015;102(3):148–58. https://doi.org/10.1002/bjs.9736.
Austin PC, Steyerberg EW. Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable. BMC Med Res Methodol. 2012;12:82. https://doi.org/10.1186/1471-2288-12-82.
Vergouwe Y, Moons KG, Steyerberg EW. External validity of risk models: Use of benchmark values to disentangle a case-mix effect from incorrect coefficients. Am J Epidemiol. 2010;172(8):971–80. https://doi.org/10.1093/aje/kwq223.
Molyneux AJ, Kerr RS, Yu LM, et al. International subarachnoid aneurysm trial (ISAT) of neurosurgical clipping versus endovascular coiling in 2143 patients with ruptured intracranial aneurysms: a randomised comparison of effects on survival, dependency, seizures, rebleeding, subgroups, and aneurysm occlusion. Lancet. 2005;366(9488):809–17. https://doi.org/10.1016/S0140-6736(05)67214-5.
Luijken K, Groenwold RHH, Van Calster B, Steyerberg EW, van Smeden M. Impact of predictor measurement heterogeneity across settings on the performance of prediction models: A measurement error perspective. Stat Med. 2019;38(18):3444–59. https://doi.org/10.1002/sim.8183.
Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J. 2014;35(29):1925–31. https://doi.org/10.1093/eurheartj/ehu207.
Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med. 1999;130(6):515–24. https://doi.org/10.7326/0003-4819-130-6-199903160-00016.
Steyerberg EW, Nieboer D, Debray TPA, van Houwelingen HC. Assessment of heterogeneity in an individual participant data meta-analysis of prediction models: an overview and illustration. Stat Med. 2019;38(22):4290–309. https://doi.org/10.1002/sim.8296.
Dijkland SA, Roozenbeek B, Brouwer PA, et al. Prediction of 60-day case fatality after aneurysmal subarachnoid hemorrhage: external validation of a prediction model. Crit Care Med. 2016;44(8):1523–9. https://doi.org/10.1097/CCM.0000000000001709.
Risselada R, Lingsma HF, Bauer-Mehren A, et al. Prediction of 60 day case-fatality after aneurysmal subarachnoid haemorrhage: results from the International Subarachnoid Aneurysm Trial (ISAT). Eur J Epidemiol. 2010;25(4):261–6. https://doi.org/10.1007/s10654-010-9432-x.
van Donkelaar CE, Bakker NA, Birks J, et al. Prediction of outcome after aneurysmal subarachnoid hemorrhage. Stroke. 2019;50(4):837–44. https://doi.org/10.1161/STROKEAHA.118.023902.
Witsch J, Frey HP, Patel S, et al. Prognostication of long-term outcomes after subarachnoid hemorrhage: the FRESH score. Ann Neurol. Jul2016;80(1):46–58. https://doi.org/10.1002/ana.24675.
Witsch J, Kuohn L, Hebert R, et al. Early prognostication of 1-year outcome after subarachnoid hemorrhage: the FRESH score validation. J Stroke Cerebrovasc Dis. 2019;28(10):104280. https://doi.org/10.1016/j.jstrokecerebrovasdis.2019.06.038.
Debray TP, Moons KG, Ahmed I, Koffijberg H, Riley RD. A framework for developing, implementing, and evaluating clinical prediction models in an individual participant data meta-analysis. Stat Med. 2013;32(18):3158–80. https://doi.org/10.1002/sim.5732.
Jaja BN, Attalla D, Macdonald RL, et al. The Subarachnoid Hemorrhage International Trialists (SAHIT) Repository: advancing clinical research in subarachnoid hemorrhage. Neurocrit Care. 2014;21(3):551–9. https://doi.org/10.1007/s12028-014-9990-y.
Macdonald RL, Cusimano MD, Etminan N, et al. Subarachnoid Hemorrhage International Trialists data repository (SAHIT). World Neurosurg. 2013;79(3–4):418–22. https://doi.org/10.1016/j.wneu.2013.01.006.
Acknowledgements
The authors wish to thank the SAHIT Collaboration for supplying the data necessary to conduct this study.
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
Study Conceptualization (J.d.W., C.C.H.M, D.v.K., B.R., H.F.L.); Literature review (J.d.W.); Formal analysis (J.d.W., C.C.H.M.); Visualization (J.d.W.); Writing original draft (J.d.W.); Critical review of the manuscript (J.d.W., C.C.H.M, D.v.K., B.R., H.F.L.); Supervision (H.F.L.).
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The institutions’ medical ethical research committee (MERC) of the Erasmus MC University Medical Center Rotterdam approved the study protocol under the exemption category and waived the need for written informed consent (MEC-2020–0810). Participant consent was not obtained and the study involved de-identified data.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
de Winkel, J., Maas, C.C.H.M., Roozenbeek, B. et al. Pitfalls of single-study external validation illustrated with a model predicting functional outcome after aneurysmal subarachnoid hemorrhage. BMC Med Res Methodol 24, 176 (2024). https://doi.org/10.1186/s12874-024-02280-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874-024-02280-9