 Research
 Open access
 Published:
Weighted metrics are required when evaluating the performance of prediction models in nested case–control studies
BMC Medical Research Methodology volume 24, Article number: 115 (2024)
Abstract
Background
Nested case–control (NCC) designs are efficient for developing and validating prediction models that use expensive or difficulttoobtain predictors, especially when the outcome is rare. Previous research has focused on how to develop prediction models in this sampling design, but little attention has been given to model validation in this context. We therefore aimed to systematically characterize the key elements for the correct evaluation of the performance of prediction models in NCC data.
Methods
We proposed how to correctly evaluate prediction models in NCC data, by adjusting performance metrics with sampling weights to account for the NCC sampling. We included in this study the Cindex, thresholdbased metrics, Observedtoexpected events ratio (O/E ratio), calibration slope, and decision curve analysis. We illustrated the proposed metrics with a validation of the Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA version 5) in data from the populationbased Rotterdam study. We compared the metrics obtained in the full cohort with those obtained in NCC datasets sampled from the Rotterdam study, with and without a matched design.
Results
Performance metrics without weight adjustment were biased: the unweighted Cindex in NCC datasets was 0.61 (0.58–0.63) for the unmatched design, while the Cindex in the full cohort and the weighted Cindex in the NCC datasets were similar: 0.65 (0.62–0.69) and 0.65 (0.61–0.69), respectively. The unweighted O/E ratio was 18.38 (17.67–19.06) in the NCC datasets, while it was 1.69 (1.42–1.93) in the full cohort and its weighted version in the NCC datasets was 1.68 (1.53–1.84). Similarly, weighted adjustments of thresholdbased metrics and net benefit for decision curves were unbiased estimates of the corresponding metrics in the full cohort, while the corresponding unweighted metrics were biased. In the matched design, the bias of the unweighted metrics was larger, but it could also be compensated by the weight adjustment.
Conclusions
Nested case–control studies are an efficient solution for evaluating the performance of prediction models that use expensive or difficulttoobtain biomarkers, especially when the outcome is rare, but the performance metrics need to be adjusted to the sampling procedure.
Background
Risk prediction models are becoming increasingly popular in the medical community to predict clinical outcomes and can be used to provide more personalized decisions to patients. Populationbased longitudinal cohorts, with information on thousands of individuals, are now widespread. These cohorts enable the identification of individuals with a disease and their biological samples at the population level, as well as the development of (risk) prediction models [1, 2]. While populationbased cohorts are the preferred study design for building and validating such models [3, 4], they are expensive, particularly when the collected data go beyond routinely available variables (e.g. demographic or basic clinical characteristics), and include additional information (e.g. measurements of expensive or difficulttoobtain biomarkers). In addition, collecting this information for all subjects might not always be feasible. Using the full cohort is, therefore, functional for unbiased sample identification, but not efficient for model development and validation, particularly for rare outcomes. These scenarios require more efficient study designs, such as nested case–control (NCC) and casecohort designs [3, 4].
In NCC studies, a subsample of a fully enumerated source population is identified, containing all patients who experience the outcome of interest during the study followup period (cases), together with a sample of patients who do not experience the outcome during the timeatrisk (controls) [4, 5]. Often, controls are matched to cases to reduce potential confounding [5]. Followup and matching variables must be available in the full cohort to identify the controls, whereas other variables of interest are only collected for the case–control set. This greatly reduces the amount of data collection needed (and the associated costs), while still providing accurate estimates of the effects of risk factors and performance estimates [4, 6]. In contrast to the traditional case–control design, where cases and controls are sampled from a population of unknown size, in NCC studies, cases and controls are sampled from a welldefined population of known size (i.e. hence the designation “nested”) [3]. This makes them suitable to estimate hazard and odds ratios of the full cohort and to build absolute risk models [3, 5], without the need for additional data sources. However, appropriate methodology must be employed to accommodate the undersampling of controls and potentially any matching [5, 7, 8].
Clear recommendations on how to develop prediction models in NCC designs can be found in the literature [9, 10], as well as some examples of risk prediction models for rare outcomes developed in NCC data [8, 11,12,13]. However, we have noticed that there is a lack of clear guidance on how to properly validate prediction models in this study design. This has led to incorrect performance estimates of these models reported in several works in the literature [14, 15], and to the misconception that NCC studies are not suited to validate prediction models [16]. Clear guidance on this matter would therefore be useful for clinical and methodological researchers evaluating the performance of prediction models in NCC data.
We aim to present the key elements for correct evaluation of the performance of prediction models in NCC datasets. We propose how relevant model performance metrics should be adjusted to compensate for 1) the overrepresentation of cases in the NCC dataset and 2) the fact that the controls in the NCC dataset are no longer representative of all controls in the source population. We then illustrate the importance of using adjusted performance metrics with a real world example, where we validate the wellknown BOADICEA [17] breast cancer risk prediction model in NCC datasets, and compare the obtained performance metrics to those that were reported in the original full cohort study [18].
Methods
Validating risk prediction models in nested case–control data
In this study, we focus on the scenario where a prediction model has already been developed, in a populationbased cohort or in another design, and we aim to evaluate its performance in an NCC design. Table 1 describes how NCC datasets are obtained using incidence density sampling, and Fig. 1 illustrates two completely worked out examples for an unmatched (A) and matched design (B) respectively. Of note, under this sampling method, the same subject can be selected multiple times: cases can be selected as controls of other cases with shorter event times; controls can be selected in risk sets of different cases. However, for model validation in NCC datasets, including the same subject multiple times can lead to biased estimates of performance metrics [4]. Therefore, only one record should be kept for each subject: the case record for subjects selected as cases, and a randomly selected control record for controls [4].
Challenges in validating prediction models in NCC data
Validating prediction models in NCC datasets cannot be performed in the same way as in full cohorts for two reasons. First, the sampling procedure artificially increases the proportion of cases (Fig. 2A). The difference between the original proportion and the one in the NCC dataset will depend on the chosen sampling ratio (number of controls sampled per case). Second, the controls in the NCC dataset are no longer representative of all controls: even if no matching is performed, subjects with longer survival time are more likely to be included in the NCC dataset as controls. This can distort the distributions of the risk factors in the NCC dataset (Fig. 2B). Failing to account for this distortion will result in overestimation of the absolute risk, since controls are underrepresented in the NCC dataset compared to the full cohort, and in the inaccurate estimation of both risk factor effects (e.g., odds ratios) and performance metrics.
However, if subjects are weighted based on their sampling probability, NCC datasets can be used to develop and validate prediction models. The weighting process enables the recovery of the original ratio of cases to controls and the original risk factor distributions, which consequently enables the development of models that predict absolute risk [5, 10]. Additionally, in the validation of a model in an NCC dataset, the same weighting process can be applied to adjust the performance metrics, making it possible to evaluate model performance accurately, without applying the model to all subjects of the corresponding full cohort.
Assigning weights to subjects in NCC data
The weight of subject \(k\) \({(w}_{k})\) is computed as the inverse of the probability of subject \(k\) being included in the NCC dataset, and accounts for the sampling design, namely any matching performed, as well as the case–control sampling ratio. For a NCC dataset with \(K\) subjects (\(J\) cases and \(I\) controls), the sampling probability of case \(j\) \({(p}_{j})\) is straightforward to compute. However, the sampling probabilities of controls, each denoted as \({p}_{i}\), take a more complex form, as several factors can influence their selection.
Sampling probabilities of cases (\({{\varvec{p}}}_{{\varvec{j}}}\))
In a typical NCC study, all cases present in the full cohort are included in the NCC dataset; therefore, all cases have a sampling probability \({p}_{j}\) of 1. However, extracting model variables for all cases might not be possible due to time or cost constraints, leading to more atypical NCC studies, where only a portion of the cases are included. In this scenario, the sampling probability corresponds to the proportion of cases identified in the full cohort that are included in the NCC dataset [19].
Sampling probabilities of controls (\({{\varvec{p}}}_{{\varvec{i}}}\))
There are different ways of estimating the probability of sampling subject \(k\) as a control [20]. We focus on the Kaplan–Meier and the logistic/generalized additive model estimators, for their ease of implementation and ability to deal with matched NCC designs.
The sampling probability of control \(i\) (\({p}_{i}\)) estimated with the Kaplan–Meier estimate [20] uses the property that under the incidence density sampling scheme, the probabilities of a control being eligible for sampling for different cases are independent across the (matched) risk sets [21]. It is defined as follows:
where \(m\) is the number of controls sampled per case, \(I\) is an indicator variable (with value 1 if control \(i\) is eligible for case \(j\), and 0 otherwise), \({t}_{j}\) is the event time of case \(j\), and \({n}_{j}\left({t}_{j}\right)\) is the number of subjects at risk at time \({t}_{j}\), which could be matched to case \(j\).
Sampling probabilities can also be estimated using a modelbased approach. For example, a logistic regression model can be fitted to predict the probability of a control being sampled, using its censoring time and additional matching variables as predictors. This logistic regression is fitted to the full cohort, excluding all cases. The sampling probability \({p}_{i}\) then corresponds to the output probability of this logistic regression applied to control \(i\), and the inverse of such sampling probabilities are denominated generalized linear model (GLM) weights. Sampling probabilities can also be estimated based on a generalized additive model (GAM), leading to GAM weights.
These weighting schemes are described in more detail in Støer and Samuelsen [20]. Of note, other weighing methods exist, such as the Chen weights [20]. Despite their differences, there is little variation in model estimates and standard errors using different weighting schemes during model development [10, 22]. All of these methods for weight computation can be implemented with the multipleNCC R package [20].
Inverse probability weights
Once the sampling probabilities are computed, the weight \({w}_{k}\) assigned to subject \(k\) is:
In both typical and atypical NCC data, the weights of all cases have the same value, because the sampling probability is the same for all cases. The weights of controls depend on the control and are always larger (or equal) to 1, in order to represent multiple controls in the original cohort: the lower their probability of being sampled, the higher their weight [20].
Of note, the sum of the weights will generally not correspond to the number of subjects in the cohort, particularly if KMtype of weights are used. To compensate for this, the weights of controls should be rescaled [4]:
where \({w}_{k}{\prime}\) is the rescaled version of weight \({w}_{k}\) and \({n}_{c}\) is the total number of controls in the cohort.
Model performance metrics adjusted to the NCC data
As recommended in the TRIPOD guidelines for risk prediction models [23], models should be validated with respect to their discrimination ability, the agreement between observed and predicted outcomes (calibration), and the clinical utility they provide. Therefore, we focused on the following performance metrics to quantify these characteristics: Cindex, thresholdbased metrics (such as sensitivity and specificity), observedtoevents ratio, calibration slope, and decision curve analysis. The application of these metrics to full cohorts is straightforward, and we refer the reader to the literature for an in depth characterization of these metrics [24, 25]. While these metrics can be obtained in NCC studies, they must be adjusted due to the undersampling of controls, by using the weights of the subjects included in the NCC dataset. Of note, adjusted metrics should be used in both internal or external validation of the models in NCC datasets (see the section “Bootstrapping and crossvalidation in NCC data”).
Cindex
The Cindex estimates the probability that the predicted order of the events of a randomly selected subject pair is correct. The Cindex is calculated by identifying all possible pairs where at least one subject had an event (usable pairs). From these, concordant pairs and discordant pairs are identified as pairs where subjects with longer survival time have smaller or larger predicted risk, respectively. The Cindex is then calculated as follows:
where \({C}_{k}\) and \({D}_{k}\) denote respectively the number of concordant and discordant pairs for subject \(k\).
When a model is validated in an NCC dataset, the Cindex should be adjusted by weighing the concordant and discordant pairs by the estimated weight of subject \(k\) (\({w}_{k}\)), as follows [4]:
Thresholdbased performance metrics
Often, prediction models are used with decision thresholds: subjects above and below the risk threshold are considered highrisk and lowrisk, respectively. Performance metrics such as sensitivity (SE), specificity (SP), positive predictive value (PPV) and negative predictive value (NPV) are informative to validate a model in this setting. The need to adjust PPV and NPV is clear, as these metrics are directly dependent on the prevalence of the outcome of interest in a cohort [26], but SE and SP must also be adjusted in NCC datasets, due to the biased sampling. Unbiased estimates of these metrics are obtained by using the weights of cases and controls as described in Table 2.
The adjustment of these metrics has been described for model validation in NCC datasets when the outcome of interest is binary, and no survival data was available [6]. In this situation, the weights of all controls are the same and they correspond to the inverse of the control sampling fraction (i.e., \(\frac{1}{\frac{sampled \space controls}{total \space number \space of \space controls}}\)). In Table 2, we generalized this adjustment for survival data and for the possibility of controls having different weights.
Calibration
Model calibration aims to assess whether the absolute risks predicted by the model correctly estimate the observed risks. For example, for patients with, say, a 20% predicted risk of an event of interest, 20 patients out of 100 should indeed experience the event.
Calibration can be evaluated in multiple ways. We focused on the most important calibration metrics for risk prediction models [27]: mean calibration, calibration slope, and calibration plot.
Mean calibration corresponds to the difference between the proportion of observed events, for binary events; or 1observed survival fraction at the chosen time point estimated by e.g. the Kaplan–Meier estimate, for survival data [28], and the average predicted risk. It can also be expressed as a ratio of observed to expected (i.e., predicted) events. In a NCC dataset, the observed events or survival fraction should be derived from the full cohort, and, apart from rounding differences, these should correspond to the weighted event proportion or weighted KM survival probability, respectively. Likewise, the average predicted risk should be computed using the weighted average of the predicted risk, with the previously described weights.
The calibration slope summarizes the strength of the association between predicted and observed outcomes [27]. It corresponds to the slope of the regression of the observed outcomes (binary or survival) on the linear predictor of the model. A perfectly calibrated model has a slope of 1, while lower and greater values indicate over and underfitting respectively. Calibration can be visualized in a calibration plot showing the observed proportions or (1observed survival fractions) against the average predicted risks for a given number of subject groups. For NCC datasets, a weighted regression of the observed outcomes (binary or survival) on the linear predictors of the model should be used to obtain the calibration slope (Supplementary Table 1). The calibration plot should show weighted observed proportions/survival fractions against weighted averages of the predicted risks for a given number of subject groups.
Decision curve analysis
Discrimination and calibration measures are insufficient to evaluate the model utility in the context of a clinical decision. Decision curve analysis aims to assess the net benefit of making clinical decisions based on the prediction model compared to other models or default strategies by accounting for the tradeoff between the relative harms of false positives and false negatives [29]. The relative harm can be interpreted as how much worse it is to miss a poor clinical outcome and wrongly withholding treatment (i.e., false negative) compared to providing an unnecessary intervention to a patient who will not develop the outcome (i.e., false positive). For example, if the harm of missing a cancer is 4 times greater than performing an unnecessary invasive procedure, we can translate this relative harm into a probability threshold such that the patient should only undergo the procedure if the risk is greater than the threshold (1/(4 + 1) = 0.2). The decision curve is constructed by computing the net benefit over a clinically meaningful range of probability thresholds.
The net benefit (NB) at a given threshold is computed as follows [29]:
where \(TP\) denotes true positives, \(FP\) denotes false positives, \(n\) is the total number of subjects and \({p}_{t}\) is the probability threshold.
In NCC data, the net benefit of Eq. (6) should be calculated using the definitions of \(TP\) and \(FP\) for binary or survival data provided in Table 2. Net benefit for binary outcomes in NCC data has already been derived in [30], but not for survival outcomes.
Bootstrapping and crossvalidation in NCC data
Bootstrapping and crossvalidation are techniques that are frequently employed to internally validate prediction models and to obtain estimates of the variability of model performance metrics. In general, bootstrap samples and crossvalidation folds should mimic the original sampling design as well as possible [31]. Therefore, when these procedures are applied to NCC datasets, they should account for the fact that the NCC design is a stratified design, where controls are matched to cases (at least on followup time). This translates into obtaining bootstrap samples by sampling case–control pairs with replacement, rather than individual samples. In the case of crossvalidation, case–control pairs should be included in the same fold in crossvalidation schemes. Weighted metrics should also be used when evaluating the model within the NCC crossvalidation folds and the bootstrap samples.
Code availability
We implemented a subset of the weighted performance metrics (thresholdbased metrics, calibration plot and decision curve analysis for survival data). We used the R packages described in Supplementary Table 1 for computing sampling weights and the remaining weighted performance metrics. The code to reproduce our analysis is available at https://github.com/emcdermatology/nccevaluation.
Realworld illustration: validation of the BOADICEA model
Breast cancer is the most commonly diagnosed cancer in women worldwide [32]. Identifying highrisk individuals is critical to reduce mortality and improve quality of life. However, excessive screening also results in false positives and overdiagnosis [33]. Risk prediction models can help healthcare systems by targeting women at high risk of developing breast cancer for screening, while reducing the sideeffects of screening for those at lower risk.
In this clinical illustration, we focused on the validation of the Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA version 5) model in the Rotterdam study, a populationbased cohort. The performance of the model predicting the development of breast cancer (invasive or in situ) within 10 years has been previously evaluated in this cohort [18]. We use this cohort to show that it is also possible to validate a prediction model in a subcohort much smaller than the full cohort (more than 10 times smaller in this case), if the subcohort is carefully designed (i.e., the nested case–control dataset).
Model and study cohort
The BOADICEA model is an absolute risk prediction model that estimates the probability of developing breast cancer within 5 years, 10 years or within a lifetime, for women until the age of 80. The model uses genetic input variables such as a polygenic risk score (PRS) based on 313 breast cancer associated variants, together with nongenetic risk factors and family history. The nongenetic risk factors are: mammographic density, age at menarche, age at menopause, parity, age at first live birth, oral contraceptive use, hormonal treatment, height body mass index and alcohol intake. The formula for computing the absolute risk is available in Lee et al. [17]. The model was validated in the Rotterdam Study cohort, a populationbased cohort of elderly Dutch individuals living in a district of Rotterdam in the Netherlands. In 2008, the cohort consisted of 14,926 subjects aged 45 years or older, out of which 8,823 were women. As in Lakeman et al. [18], we excluded all women for whom 10year BOADICEA risk estimates could not be obtained: women for whom genotype data were not available or did not have enough quality (n = 2153), and women who, at the time of recruitment into the cohort, had breast cancer (n = 148) or were older than 70 years (n = 2145). This left a total of 4,377 women. All BOADICEA risk factors were available in this cohort, except for mammographic density. However, the BOADICEA model allows for missing information [17].
Ethics statement
The Rotterdam Study has been approved by the institutional review board (Medical Ethics Committee) of the Erasmus Medical Center and by the review board of The Netherlands Ministry of Health, Welfare and Sports. The analyses in this study were approved by the management team of the Rotterdam study.
Illustration setup
NCC datasets with one control per case were derived from the full cohort of 4,377 subjects in 3 different scenarios: a NCC design without any matching variables (NCCNM); an NCC design with matching on an administrative variable (Rotterdam study subcohort), which is not strongly associated with the model predictions (NCCMNR); and an NCC design with matching based on the nongenetic risk estimates (NCCMR). In the NCCMNR design, the matching variable has 3 categories, and in the NCCMR, the matching variable has 6 categories, representing the risk percentages (from 1 to 6%). Control sampling in all designs was performed with incidence density sampling without duplicated subjects. Cases without a control after duplicate removal were matched to another control so that the total KM weight sum before rescaling was similar to the size of the full cohort. Control replacement is typically possible for NCC designs in rare outcomes and sufficiently loose matching.
The BOADICEA model was applied to both the full cohort and all the derived NCC datasets (Supplementary Fig. 1). Performance metrics, including the Cindex, calibration metrics, and decision curve analysis, were computed in all cohorts. For the NCC datasets, the performance of the BOADICEA model was evaluated with and without adjustment for the NCC design. These unweighted (“naïve”) and weighted metrics were compared with the performance metrics on the full cohort. In practice, only one NCC dataset would be sampled, and performance metrics would be evaluated on that single cohort, however, to account for sampling variation, we repeated the derivation of NCC datasets 100 times in each scenario. As a sensitivity analysis, we used three ways to compute the sampling weights: Kaplan–Meier type of weights (KM), weights computed with logistic regression (GLM) and weights computed with generalized additive models (GAM). We also repeated the analyses with NCC datasets with 2 controls per case, to investigate the impact of the case–control ratio on the precision of the performance metric estimates.
Results
A total of 4,377 subjects were included in our analysis, out of which 163 developed breast cancer within 10 years of their recruitment into the cohort (Table 3). The median followup was 12.9 (6.9–20.9) years for subjects without an event. Women who developed breast cancer (cases) differed significantly from those who did not (controls) in Body Mass Index and in their 313variant PRS. The original Rotterdam study (RSI) was extended twice (RSII and RSIII) [34]. The proportion of cases was significantly different among these subcohorts.
Model validation in the full cohort
Similarly to the validation publication in the Rotterdam Study [18], the performance of the BOADICEA model was evaluated in all the subjects of this cohort, and compared to the performance of the risk estimates based on subsets of the model components (age only, age and nongenetic risk factors, and age and genetic component). All of the metrics are reported in Supplementary Table 2). The BOADICEA model (using age, nongenetic risk factors and the PRS) showed reasonable discriminative ability (Cindex 0.65 (0.61–0.69)) and calibration (calibration slope 1.19 (0.87–1.54)). However, breast cancer occurrences were substantially underestimated, with an observedtoexpected ratio (O/E ratio) of 1.69 (1.42–1.93). This underestimation was more evident in the higher risk range (Supplementary Fig. 2). Despite the observed miscalibration, decision curve analysis showed that the BOADICEA model is still clinically useful for targeted screening in the vicinity of the outcome cumulative incidence (Fig. 3): for threshold probabilities between 3% and 7.5%, the model outperformed the treat all strategy. Interestingly, comparison of different subcomponents of the model shows that risk estimates based on the genetic component were more clinically useful than risk estimates relying on age or risk factors alone. Since risk predictions are often dichotomized to guide decisions, we used the risk threshold described in the BOADICEA development publication [17] (3%) and classified subjects with a risk prediction lower than 3% as lowrisk and the remaining as highrisk. Performance metrics obtained with this risk threshold were suboptimal: namely, sensitivity to detect breast cancer occurrences was only 0.52 (0.43–0.61).
Model validation in NCC data
We evaluated the performance of the model in 3 scenarios with variations in the NCC sampling design; all sampling designs led to NCC datasets with 163 cases and 163 controls. To evaluate model predictions, we first analyzed performance metrics without setting any risk threshold. Fig. 4 clearly shows that the unweighted performance metrics obtained in the NCC datasets do not correspond to those obtained in the full cohort. The bias was larger when controls were matched to cases on a variable that is associated with the model predictions, but precision was similar for most metrics. In contrast, weighted performance metrics were unbiased in all scenarios. The results were very similar for other types of sampling weights (GLM and GAM weights, Supplementary Fig. 3). The unweighted calibration plot suggests that the model substantially underestimated the risk of the outcome (Fig. 5), while all weighted calibration plots resemble the calibration plot of the full cohort. Similarly, the unweighted decision curve indicates that the model is not more useful than the screenall strategy, while the weighted decision curves are very similar to the decision curve on the full cohort (Fig. 6); therefore correctly showing that the model outperforms the screenall strategy. Precision for lower probability ranges was slightly higher for the matched NCC datasets. However, for all designs the width of the confidence intervals and bias of net benefit estimates substantially increased with increasing risk thresholds, because there were fewer subjects with higher risk predictions, and therefore more variability across different realizations of the NCC sampling.
As with the previous performance metrics, the thresholddependent metrics (SE, SP, PPV and NPV) were biased without weighting, and close to unbiased when weighted (Fig. 7). Unbiased estimates were also obtained for all weighted metrics when 2 controls were sampled per case; however, the precision of performance estimates was higher for most performance metrics (Supplementary Table 3, Supplementary Figs. 4 and 5). Lastly, model comparisons were also biased when unweighted metrics were used (Supplementary Fig. 6). Namely, using unweighted metrics to compare the BOADICEA model with the genetic component or without leads to a substantial underestimation of the benefit of the genetic component: the difference between the unweighted Cindexes of the two models was respectively 7% and 4% in the NCCNM and NCCMR scenarios, compared to 10% when the difference in Cindex was computed using weighted metrics in either scenario, or based on the full cohort.
Discussion
We systematically proposed how to validate prediction models in nested case–control data, by adjusting discriminative and calibration metrics, as well as calibration plots and decision curves. Despite their sample size being much smaller, we showed that NCC datasets can be used to obtain performance estimates that correspond to those of the original populationbased cohorts, as long as performance metrics are appropriately weighted. The weighting procedure consists of estimating the sampling weight [20] of each subject in the NCC dataset and using those weights when computing each performance metric.
We illustrated the importance and validity of the described weighted metrics in a realworld case study where we validated the BOADICEA model in NCC datasets derived from the Rotterdam study. Although the BOADICEA model had previously been externally validated in the Rotterdam study [18], we showed that, if the model input variables had not been available for the entire cohort, unbiased performance estimates could have also been obtained in an NCC subset of the original cohort: the median/mean of the estimated metrics in the NCC datasets corresponded to their values in the Rotterdam study (Figs. 4–7).
NCC datasets are therefore a more costeffective design for model validation, as they allow the estimation of unbiased performance metrics with a much smaller sample size: in this case less than 10% of the original cohort. This is of particular interest for evaluating the performance of models that require measuring biomarkers that are difficult to obtain in large cohorts, such as the polygenic risk scores included in the BOADICEA model. Of note, the smaller sample size of NCC datasets leads to higher uncertainty regarding the estimated performance metrics. If this is of concern, the number of controls sampled per case can be increased to increase precision of the estimates (Supplementary Table 3, Supplementary Figs. 4 and 5).
We observed two distinct patterns depending on whether metrics directly depend on the outcome prevalence or not. Metrics such as the O/E ratio, net benefit, PPV and NPV are easily flagged as implausible if unweighted. For example, the unweighted calibration plot suggests that the model substantially underestimated the risk of the outcome (Fig. 5). On the other hand, metrics such as the Cindex, calibration slope, sensitivity and specificity have plausible ranges even if unweighted.
However, we illustrated that using such metrics without weighting would lead to incorrect conclusions regarding model performance. In particular, we showed that when no weighting was applied, the Cindex of the BOADICEA model was consistently underestimated in the NCC datasets. This underestimation has been observed in other studies^{34,38}, and is, in fact, expected, as the NCC matching procedure (on time and on potential additional matching variables) frequently attenuates the association between the outcome and any model variables that might be associated with time/matching variables. This in turn decreases discriminative performance metrics. Conclusions regarding model calibration based on unweighted metrics would also be misleading: the unweighted calibration slope was lower than 1 in all scenarios, suggesting that the model was overfitted (Fig. 4), while the model was, in reality, underfitted. Furthermore, comparisons of the predictive ability of the different components of the model (nongenetic risk estimates, genetic risk estimates and combined model) would also lead to different conclusions if metrics are not weighted (Supplementary Fig. 6). Namely, in the NCC datasets matched based on risk factors, the improvement in predictive ability associated with adding the genetic component to the model was 4% based on unweighted Cindexes, compared to 10% based on the weighted or full cohort Cindexes. This observation is important as the NCC study design is frequently used to study the added value of biomarkers that are too expensive to collect in the full cohort. Our realworld example shows that it is essential to use weighted metrics to correctly estimate the improvement of performance metrics associated with novel biomarkers.
We have also shown that unbiased performance metrics can be obtained across different NCC sampling designs (with and without matching). This indicates that the effect of different types of biases introduced during cohort sampling can be mitigated during model validation by adjusting the performance metrics with sampling weights that account for the study design. Of note, while ignoring matching on administrative variables in the weight computation has a smaller impact than ignoring matching on variables associated with model risk predictions, matching on administrative variables should still be accounted for, if possible. Other studies that have investigated weighted metrics such as the Cindex and calibration slope in matched NCC designs have shown biased estimation under fine matching [4, 35]. We have not observed this in our study but the matching we employed was not as fine: there were only 3 categories when we matched based on entry into the Rotterdam study (NCCMNR), and 6 categories when based on nongenetic risk estimates (NCCMR), with subjects being well distributed among these categories. Moreover, Ganna et al. [4] evaluated a conditional logistic regression model that included the matching variables as predictors; meaning that they could not be included in the linear predictors used for risk estimation. This contributed to additional bias in the performance metrics obtained in the matched NCC design in their study. This was not the case with the BOADICEA model.
Finally, we have shown that unbiased performance metrics can be obtained using different methods for calculating the sampling weights (Supplementary Fig. 3), which means that weighted metrics are robust with respect to the computation method used for the weights. Similar conclusions regarding weight computation methods had already been demonstrated for the development of prediction models in NCC datasets [10, 20], but not for their validation.
This study has some limitations. We illustrated the importance of using weighted performance metrics in a large realworld cohort, where using a nested case–control design can substantially reduce the cost of validating a prediction model that requires expensive data collection. However, the use of a realworld dataset prevents us from studying the robustness of our conclusions to variations in parameters such as outcome incidence and the magnitude of the performance metrics of the model. A simulation study [35], which investigated this for a subset of the performance metrics, showed that the Cindex was consistent for different outcome incidence rates, but calibration metrics deteriorated for lower incidence rates (5%). This slightly contrasts with our study, where the outcome cumulative incidence was 4.4% (95% CI 3.7–5.1%) and all metrics could be unbiasedly estimated. However, we focused on external validation of the model while Lee et al. [35] reported fivefold crossvalidation metrics for the calibration metrics which might partially explain their observed decline in performance for lower outcome incidence.
We have extensively covered the most commonly recommended metrics to assess the discrimination, calibration and clinical utility of prediction models [23]; however, there are other metrics and extensions that could be used. For example, goodnessoffit tests such as Grønnesby and Borgan or Hosmer and Lemeshow’s test are employed to assess model calibration, although due to their limitations they are falling into disuse. The use of such tests in NCC datasets has been described in the literature [4, 8]. Quantification of the incremental value of additional predictors can be estimated with the net reclassification index, and a weighted version is provided in [4]. However, concerns have also been raised regarding this metric [36].
Furthermore, for simplicity, we did not account for competing risks in our realworld illustration, as this would require using performance metrics that are adjusted to the competing risk scenario [37], and modifying the sampling procedure of the NCC study to account for competing risks [38]. In fact, with strong competing risks or presence of multiple outcomes, the casecohort design is preferred to the NCC design. Weighting of performance metrics in this sampling design is also needed, although recommended methods to compute sampling weights in this design [39] are different than the ones described in our study.
Despite the above limitations, the described weighted metrics should be relevant for a wide audience (e.g., clinicians, machine learning practitioners, epidemiologists and biostatisticians), as they must be used for a correct performance evaluation of any type of model that estimates risk probabilities in an NCC study, from the widely used Cox model to machine learning or deep learning methods. Furthermore, difficulties in model performance evaluation, which resemble the ones we discussed, are addressed in the literature under different names (such as covariate shift in the machine learning field [40]). The NCC sampling design is a special case within these concepts, as it performs sampling in a costeffective way.
In summary, this study provides clear guidance on how prediction models should be validated in NCC studies using relevant performance metrics. Previously, most of this information was scattered in the literature and not available for all the metrics. Namely, weighted versions of performance metrics are available in different R packages [8], but we had to implement weight adjustments for thresholdbased metrics and decision curve analyses for survival data. The code to implement the adjustment of these performance metrics with sampling weights is now available in a GitHub repository (see section “code availability”) for easy implementation and should facilitate the adoption of weighted metrics also by nonspecialists.
Conclusions
Clinical prediction models can be validated in NCC studies if the performance metrics are appropriately adjusted using the sampling weights of the subjects in the NCC dataset. If performance metrics are not weighted, their estimates will be biased, with the magnitude of the bias being higher when matching variables are correlated with the model predictions. The choice of the method to compute sampling weights does not lead to large changes in the estimated weighted metrics, as long as the sampling design and all matching variables are considered in the computation. These results are particularly relevant for the validation of models that predict rare outcomes, and whose input variables cannot be measured in all subjects in the validation cohort.
Availability of data and materials
The dataset analyzed in this study is not publicly available due to privacy regulations; however data can be obtained upon request to the management team of the Rotterdam Study (datamanagement.ergo@erasmusmc.nl).
Abbreviations
 BOADICEA:

Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm
 CI:

Confidence Interval
 GAM:

Generalized additive model
 GLM:

Generalized linear model
 KM:

Kaplan–Meier
 NB:

Net benefit
 NCC:

Nested case–control
 NCCMNR:

An NCC design with incidence density sampling and matching on an administrative variable
 NCCMR:

An NCC design with incidence density sampling and matching based on the nongenetic risk estimates
 NCCNM:

Regular NCC design with incidence density sampling and without any matching variables
 NPV:

Negative predictive value
 O/E ratio:

Observedtoexpected events ratio
 PPV:

Positive predictive value
 PRS:

Polygenic risk score
 RF:

Risk factors
 SE:

Sensitivity
 SP:

Specificity
References
McCarthy CE, Bonnet LJ, Marcus MW, Field JK. Development and validation of a multivariable risk prediction model for head and neck cancer using the UK Biobank. Int J Oncol. 2020;57(5):1192–202.
Cederholm J, EegOlofsson K, Eliasson B, Zethelius B, Nilsson PM, Gudbjörnsdottir S. Risk prediction of cardiovascular disease in type 2 diabetes: a risk equation from the Swedish National Diabetes Register. Diabetes Care. 2008;31(10):2038–43.
Moons KGM, Kengne AP, Woodward M, Royston P, Vergouwe Y, Altman DG, et al. Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker. Heart. 2012;98(9):683–90.
Ganna A, Reilly M, De Faire U, Pedersen N, Magnusson P, Ingelsson E. Risk prediction measures for casecohort and nested casecontrol designs: An application to cardiovascular disease. Am J Epidemiol. 2012;175(7):715–24.
Salim A, Delcoigne B, Villaflores K, Koh WP, Yuan JM, van Dam RM, et al. Comparisons of risk prediction methods using nested casecontrol data. Stat Med. 2017;36(3):455–65.
Biesheuvel CJ, Vergouwe Y, Oudega R, Hoes AW, Grobbee DE, Moons KGM. Advantages of the nested casecontrol design in diagnostic research. BMC Med Res Methodol. 2008;8(1):1–7.
Moons KGM, Van Klei W, Kalkman CJ. Preoperative risk factors of intraoperative hypothermia in major surgery under general anesthesia. Anesth Analg. 2003;96(6):1843–4.
Choudhury PP, Maas P, Wilcox A, Wheeler W, Brook M, Check D, et al. iCARE: An R package to build, validate and apply absolute risk models. PLoS ONE. 2020;15(2):e0228198.
Kim RS. Analysis of Nested CaseControl Study Designs: Revisiting the Inverse Probability Weighting Method. Communications for Statistical Applications and Methods. 2013;20(6):455–66.
Borgan Ø, Keogh R. Nested case–control studies: should one break the matching? Lifetime Data Anal. 2015;21(4):517–41.
Zelic R, Zugna D, Bottai M, Andrén O, Fridfeldt J, Carlsson J, et al. Estimation of Relative and Absolute Risks in a CompetingRisks Setting Using a Nested CaseControl Study Design: Example from the ProMort Study. Am J Epidemiol. 2019;188(6):1165–73.
Delcoigne B, Colzani E, Prochazka M, Gagliardi G, Hall P, Abrahamowicz M, et al. Breaking the matching in nested case–control data offered several advantages for risk estimation. J Clin Epidemiol. 2017;1(82):79–86.
RentroiaPacheco B, Tokez S, Bramer EM, Venables ZC, Van De Werken HJG, Bellomo D, et al. Personalised decision making to predict absolute metastatic risk in cutaneous squamous cell carcinoma: development and validation of a clinicopathological model. eClinicalMedicine. 2023;63:102150.
Murphy JD, Olshan AF, Lin FC, Troester MA, Nichols HB, Butt J, et al. A Predictive Model of Noncardia Gastric Adenocarcinoma Risk Using Antibody Response to Helicobacter pylori Proteins and Pepsinogen. Cancer Epidemiol Biomark Prev. 2022;31(4):811–20.
Hoogeveen RM, Pereira JPB, Nurmohamed NS, Zampoleri V, Bom MJ, Baragetti A, et al. Improved cardiovascular risk prediction using targeted plasma proteomics in primary prevention. Eur Heart J. 2020;41(41):3998–4007.
Reps JM, Ryan PB, Rijnbeek PR, Schuemie MJ. Design matters in patientlevel prediction: evaluation of a cohort vs. casecontrol design when developing predictive models in observational healthcare datasets. Journal of Big Data. 2021;8(1):1–18.
Lee A, Mavaddat N, Wilcox AN, Cunningham AP, Carver T, Hartley S, et al. BOADICEA: a comprehensive breast cancer risk prediction model incorporating genetic and nongenetic risk factors. Genet Med. 2019;21(8):1708–18.
Lakeman IMM, RodríguezGirondo M, Lee A, Ruiter R, Stricker BH, Wijnant SRA, et al. Validation of the BOADICEA model and a 313variant polygenic risk score for breast cancer risk prediction in a Dutch prospective cohort. Genet Med. 2020;22(11):1803–11.
Zhou QM, Wang X, Zheng Y, Cai T. New weighting methods when cases are only a subset of events in a nested casecontrol study. Biom J. 2022;64(7):1240–59.
Støer NC, Samuelsen SO. MultipleNCC: Inverse probability weighting of nested casecontrol data. R Journal. 2016;8(2):5–18.
Langholz B, Richardson D. Are Nested CaseControl Studies Biased? Epidemiology. 2009;20(3):321–9.
Støer NC, Samuelsen SO. Comparison of estimators in nested casecontrol studies with multiple outcomes. Lifetime Data Anal. 2012;18(3):261–83.
Moons KGM, Altman DG, Reitsma JB, Ioannidis JPA, Macaskill P, Steyerberg EW, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): Explanation and elaboration. Ann Intern Med. 2015;162(1):W1–73.
Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating Second Edition. New York: NY: Springer Nature Switzerland; 2019. (Statistics for Biology and Health).
Vickers AJ, Elkin EB. Decision curve analysis: A novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565–74.
Parikh R, Mathai A, Parikh S, Sekhar GC, Thomas R. Understanding and using sensitivity, specificity and predictive values. Indian J Ophthalmol. 2008;56(1):45–50.
Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: From utopia to empirical data. J Clin Epidemiol. 2016;1(74):167–76.
McLernon DJ, Giardiello D, Van Calster B, Wynants L, van Geloven N, van Smeden M, et al. Assessing Performance and Clinical Usefulness in Prediction Models With Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models. Ann Intern Med. 2023;176(1):105–14.
Vickers AJ, Cronin AM, Elkin EB, Gonen M. Extensions to decision curve analysis, a novel method for evaluating diagnostic tests, prediction models and molecular markers. BMC Medical Inform Decis Mak. 2008;8:53.
Pfeiffer RM, Gail MH. Estimating the decision curve and its precision from three study designs. Biom J. 2020;62(3):764–76.
Wieczorek J, Guerin C, McMahon T. Kfold crossvalidation for complex sample surveys. In: In: Stat. John Wiley & Sons, Ltd; 2022. p. e454.
Arnold M, Morgan E, Rumgay H, Mafra A, Singh D, Laversanne M, et al. Current and future burden of breast cancer: Global statistics for 2020 and 2040. Breast. 2022;1(66):15–23.
Marmot M, Altman DG, Cameron DA, Dewar JA, Thompson SG, Wilcox M. The benefits and harms of breast cancer screening: an independent review. The Lancet. 2012;380(9855):1778–86.
Ikram MA, Brusselle G, Ghanbari M, Goedegebure A, Ikram MK, Kavousi M, et al. Objectives, design and main findings until 2020 from the Rotterdam Study. European J Epidemiol. 2020;35(5):483–517.
Lee M, ZeleniuchJacquotte A, Liu M. Empirical evaluation of subcohort sampling designs for risk prediction modeling. J Appl Stat. 2021;48(8):1374–401.
Pepe MS, Fan J, Feng Z, Gerds T, Hilden J. The Net Reclassification Index (NRI): A Misleading Measure of Prediction Improvement Even with Independent Test Data Sets. Stat Biosci. 2015;7(2):282–95.
Ramspek CL, Teece L, Snell KIE, Evans M, Riley RD, Van Smeden M, et al. Lessons learnt when accounting for competing events in the external validation of timeToevent prognostic models. Int J Epidemiol. 2022;51(2):615–25.
Wolkewitz M, Cooper BS, PalomarMartinez M, OlaecheaAstigarraga P, AlvarezLerma F, Schumacher M. Nested casecontrol studies in cohorts with competing events. Epidemiology. 2014;25(1):122–5.
Sanderson J, Thompson SG, White IR, Aspelund T, Pennells L. Derivation and assessment of risk prediction models using casecohort data. BMC Med Res Methodol. 2013;13:113.
Sugiyama M, Krauledat M, Müller KR. Covariate shift adaptation by importance weighted cross validation. J Mach Learn Res. 2007;8:985–1005.
Acknowledgements
The authors are grateful to the study participants, the staff from the Rotterdam Study and the participating general practitioners and pharmacists. We would like to thank Lara Pozza for her valuable comments on the manuscript.
Funding
PPP Allowance made available by Health ~ Holland, Top Sector Life Sciences & Health, to stimulate public–private partnerships. The funder had no role in the study design, data collection, analysis, interpretation or writing of the manuscript.
Author information
Authors and Affiliations
Contributions
BRP, LH and DvK conceived and designed the study. BRP performed the analyses. IMML contributed to data collection and interpretation. BRP wrote the first draft of the manuscript. DB, MW, LH and DvK supervised the study. All authors interpreted the data and critically reviewed the manuscript. All authors accept responsibility for the decision to submit this publication.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The Rotterdam Study complies with the Declaration of Helsinki and has been approved by the institutional review board (Medical Ethics Committee) of the Erasmus Medical Center and by the review board of The Netherlands Ministry of Health, Welfare and Sports. All participants provided written informed consent to participate in the study.
Consent for publication
Not applicable.
Competing interests
DB is employed by SkylineDx. Other authors have no conflict of interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
RentroiaPacheco, B., Bellomo, D., Lakeman, I.M.M. et al. Weighted metrics are required when evaluating the performance of prediction models in nested case–control studies. BMC Med Res Methodol 24, 115 (2024). https://doi.org/10.1186/s12874024022136
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874024022136