 Research article
 Open Access
 Published:
Accounting for treatment use when validating a prognostic model: a simulation study
BMC Medical Research Methodology volume 17, Article number: 103 (2017)
Abstract
Background
Prognostic models often show poor performance when applied to independent validation data sets. We illustrate how treatment use in a validation set can affect measures of model performance and present the uses and limitations of available analytical methods to account for this using simulated data.
Methods
We outline how the use of risklowering treatments in a validation set can lead to an apparent overestimation of risk by a prognostic model that was developed in a treatmentnaïve cohort to make predictions of risk without treatment. Potential methods to correct for the effects of treatment use when testing or validating a prognostic model are discussed from a theoretical perspective.. Subsequently, we assess, in simulated data sets, the impact of excluding treated individuals and the use of inverse probability weighting (IPW) on the estimated model discrimination (cindex) and calibration (observed:expected ratio and calibration plots) in scenarios with different patterns and effects of treatment use.
Results
Ignoring the use of effective treatments in a validation data set leads to poorer model discrimination and calibration than would be observed in the untreated target population for the model. Excluding treated individuals provided correct estimates of model performance only when treatment was randomly allocated, although this reduced the precision of the estimates. IPW followed by exclusion of the treated individuals provided correct estimates of model performance in data sets where treatment use was either random or moderately associated with an individual's risk when the assumptions of IPW were met, but yielded incorrect estimates in the presence of nonpositivity or an unobserved confounder.
Conclusions
When validating a prognostic model developed to make predictions of risk without treatment, treatment use in the validation set can bias estimates of the performance of the model in future targeted individuals, and should not be ignored. When treatment use is random, treated individuals can be excluded from the analysis. When treatment use is nonrandom, IPW followed by the exclusion of treated individuals is recommended, however, this method is sensitive to violations of its assumptions.
Background
Prognostic models have a range of applications, from risk stratification, to use in making individualized predictions to help counsel patients or guide healthcare providers when deciding whether or not to recommend a certain treatment or intervention [1,2,3]. Before prognostic models can be used in practice, their predictive performance (e.g. discrimination and calibration) in short, performance should be evaluated in a set of individuals who are representative of future targeted individuals. In studies that use independent data to validate a previously developed prognostic model, performance is often considerably worse than in the development set [4]. This may be due to, for example, overfitting of the model in the development data set [5, 6] or differences in casemix (between the development set and validation sets [7,8,9,10].
One aspect that can vary considerably between data sets used for model development and validation is the use of treatments or preventative interventions that affect (reduce) the occurrence of the outcomes under prediction. Although a difference in the use of treatments between a development and validation set is generally viewed as a difference in casemix characteristics, treatment use in a validation set can actually lead to further problems. When additional treatment use in a validation set (compared to the development set) results in a markedly lower incidence of the outcome under prediction, the predictive performance of the model will likely be affected. A challenge arises when a prognostic model has originally been developed in order to make predictions of “untreated risks”, i.e. predictions of an individual’s prognosis without certain treatments, to guide the decision to initiate those treatments in future targeted individuals. Ideally these models should be validated in data sets in which individuals remain untreated with those specific treatments throughout followup, socalled treatmentnaïve populations. However, the use of such treatmentnaïve populations is uncommon and poor performance of a prognostic model seen in a validation study could be directly attributed to treatment use in the validation data set [11, 12].
Ignoring the effects of treatment use in the development phase of a prognostic model for the prediction of untreated risks has already been shown to lead to a model that underestimates this risk in future targeted individuals [13]. However, it is not clear to what extent treatment use in a validation set might influence the observed performance of a prognostic model that was developed in a treatmentnaïve population, or how one can account for additional treatment use in a validation set in order to correctly estimate how a prognostic model would perform in its target (untreated) population using a treated validation set.
In this paper, we provide a detailed explanation of when and how treatment use in a validation set can bias the estimation of the performance of a prognostic model in future targeted (untreated) individuals and compare different analytical approaches to correctly estimate the performance of a model using a partly treated validation data set in a simulation study.
Methods
Problems with ignoring treatment use in a validation study
If individuals in a validation set receive an effective treatment during followup, their risk of developing the outcome will decrease. Figs. 1a and b show the effect of treatment use on the distribution of risks in data sets that represent data from a randomized trial (RCT) and a nonrandomized study (e.g. routine care data or data from an observational cohort study) in which treatment use was more likely in highrisk individuals. In the event of the use of an effective treatment, fewer individuals will develop the outcome than would have, had they remained untreated, and thus the observed outcome frequencies will be lower than the predicted “untreated” outcome frequencies. As a result, a prognostic model developed for making predictions of risk without that treatment (i.e. models used to guide the initiation of a certain treatment) will erroneously appear to overestimate risk in a partially treated validation set, regardless of how treatments have been allocated. As the aim, in this case, is to estimate the performance of the model when used for future, untreated individuals, measures of model discrimination and calibration will give a biased representation of the performance of the model when used in practice for making untreated outcome predictions, if treatment use in the validation set is ignored.
The effect that treatment use will have on measures of model performance in a validation study will depend on a number of factors, including the strength of the effect of treatment on the outcome risk, the proportion of individuals receiving treatment, and the underlying pattern of treatment use. If a treatment has a weak effect on the outcome risk or only a small proportion of individuals are treated in a validation set, the impact on model discrimination and calibration will be relatively small. Furthermore, the way in which treatments are allocated to individuals, whether treatment is allocated randomly, as in data from an RCT, or nonrandomly and treatment use is rather based on an individual’s riskprofile or according to strict treatment guidelines, will influence the impact that treatment use will have in a validation study. If, for example, highrisk individuals are selectively treated, we can anticipate an even greater impact of treatment use on measures of model performance. In this case, the distribution of observed risks will become narrower, due to the risklowering effects of treatment in the highrisk individuals (see Fig. 1b), making it more difficult for the model to discriminate between individuals who will or will not develop the outcome, and the calibration in highrisk individuals will be most greatly affected.
Methods to account for treatment use
In this section we describe possible approaches to account for treatment use in a validation study. For each method, the rationale, expected result of its use, and potential issues are outlined. A summary of the methods, including additional technical details can be found in Table 1.
Exclusion of treated individuals from the analysis
A common and straightforward approach to remove the effects of treatment is to exclude from the analysis individuals in the validation data set who received treatment. In doing this, one assumes that the untreated subset will resemble the untreated target population for the model.
As Fig. 2a shows, in settings where treatment is randomly allocated (Table 2, scenario 2), the exclusion of treated individuals will result in a validation set that is indeed still representative of the target population. As a result, measures of discrimination and calibration are the same as they would be had all individuals remained untreated, and thus are correct estimates of the performance of the model in its target population.. However, the effective sample size is reduced, (e.g. a 50% reduction in the case of an RCT with 1:1 randomization).
Figure 2b represents a study where treatment allocation was nonrandom and highrisk individuals had a higher probability of being treated (Table 2, scenario 1). If treatments were initiated between the moment of making a prediction and the assessment of the outcome, the exclusion of treated individuals results in a subset of individuals with a lower risk on average than in the untreated target population. As a result, the casemix (in terms of risk profile) in the data set will become more homogenous, and one can expect measures of discrimination to decrease [9, 14], underestimating the true discriminative ability of the model in future targeted individuals. While this approach may appear to provide correct estimates of calibration, the interpretation of these measures is limited due to the inherent selection bias. The nonrandomly untreated individuals only represent a portion of the total target population. Hence, estimates of model performance may provide little information about how well calibrated the model is for highrisk individuals, as these have been actively excluded.
Inverse probability weighting
An alternative approach for model validation in data sets with nonrandom treatment use would be to balance the data in such a way that it resembles that of an RCT. Inverse probability weighting (IPW) is a method applied in studies where the aim is to obtain an estimate of the causal association between an exposure and outcome, accounting for the influence of confounding variables on the effect estimate [15]. A “treatment propensity model” is first fitted to the validation data, regressing an indicator (yes/no) of treatment use (dependent variable) on any measured variables that may be predictive of treatment use (independent variables), including the predictors of the prognostic model that is being evaluated [16]. Subsequently this treatment propensity model is then used to estimate for each individual in the validation set the probability of receiving the treatment, based on his/her observed variables (risk profile). Following this, each individual is weighted by the inverse of their own probability of the actual treatment received [17], resulting in a distribution of risks in the validation set that resembles what would have been seen had treatments been randomly allocated, as shown by the similarity of the solid black line in Fig. 2a and the dashed black line in Fig. 2c. By excluding treated individuals after deriving weights, the resulting validation set should resemble the untreated target population, as seen in Fig. 2d. However, this will again result in a smaller effective sample size for the validation.
IPW is subject to a number of theoretical assumptions [15, 18, 19]. One example of a violation of these assumptions is practical nonpositivity (i.e. it may be that in some risk strata no subjects received the treatment) [20], which may arise if a subset of individuals has a contraindication for treatment or when guidelines already recommend that individuals above a certain probability threshold should receive treatment. This can lead to individuals receiving extreme weights, resulting in biased and imprecise estimates of model performance [15]. In addition, problems can occur due to incorrect specification of the treatment propensity model, for example due to the presence of unmeasured confounders predictors associated with both the outcome and the use of treatment in the validation set. Variants of the basic IPW procedure can be applied, such as weight truncation, which may improve the performance of this method in settings where the assumptions are violated [21].
Model recalibration
The incidence of the predicted outcome may vary between development and validation data sets. If this is the case, the predictions made by the model will not, on average, match the outcome incidence in the validation data set [22]. As discussed in section 2.1, use of an effective treatment in a validation data set will lead to fewer outcome events and thus a lower incidence than there would have been had the validation set remained untreated. One approach to account for this would be to recalibrate the original model using the partially treated validation data set. In a logistic regression model, a derivative of the incidence of the outcome is captured by the intercept term in the model, and thus a simple solution would seem to be to reestimate the model intercept using the validation data set [23, 24]. In doing this, the average predicted risk provided by the recalibrated model should then be equal to the (observed) overall outcome frequency in the validation set. Further details of this procedure are given in Table 1. Where treatment has been randomly allocated, intercept recalibration should indeed account for the risklowering effects, provided that the magnitude of the treatment effect does not vary depending on an individual’s risk and thus is constant over the entire predicted probability range. In nonrandomized settings, where treatment use by definition is associated with participant characteristics, a simple intercept recalibration is unlikely to be sufficient due to interactions between treatment use and patient characteristics that are predictors in the model.
However, although recalibration may seem a suitable solution for modelling the effects of treatment, when applying recalibration, concerns should also be raised over the interpretation of the estimated performance of the model. Differences in outcome incidence between the development data set and validation data set may not be entirely attributable to the effects of treatment use. By recalibrating the model to adjust for differences in treatment use and effects, we simultaneously adjust for differences in casemix between the development and validation set. As the aim of validation is to evaluate the performance of the original prognostic model, in this case in a treatmentnaïve sample, recalibration may actually lead to an optimistic impression of the accuracy of predictions made by the original model in the validation set. For example, if the validation set included individuals with a notably greater prevalence of comorbidities and thus were more likely to develop the outcome, recalibration prior to validation could mask any inadequacies of the model when making predictions in this subset of highrisk individuals. Thus recalibration is not an appropriate solution to the problem.
Incorporation of treatment in the model
A more explicit way to deal with treatment use would be to update the prognostic model with treatment use added as a new predictor. If effective, treatment can actually be considered to be a missing predictor in the original developed model. However, unlike other predictors, when validating a model in a nonrandomised data set, we cannot know whether a person in practice will indeed receive the treatment at the point of making a prediction. By adding a binary predictor for treatment use to the original prognostic model, one may aim to alleviate the misfit that results from the omission of this predictor, and get closer to the actual performance of the original model in the validation set, had individuals remained untreated.
There are a number of approaches to updating a model with a new predictor [23, 22, 25]. One option would be to incorporate an indicator for treatment on top of the prognostic model, keeping the original model coefficients fixed. However, in doing this we assume that there is no correlation between treatment use and the predictors in the model. Instead the model could be entirely refitted with the addition of an indicator term for treatment using the validation data set (for further details, see Table 1). It may be necessary to include statistical interaction terms in the updated model, where anticipated [26].
A challenge when considering this approach is the correct specification of the updated prediction model. Failure to correctly specify any interactions between treatment and other predictors in the validation set could mean that the effects of treatment are not completely taken into account. Furthermore, the addition of a term for treatment to the model that is to be validated may improve the performance beyond that of the original model due to the inclusion of additional predictive information. Thus, as with recalibration, we do not recommend this approach.
Outline of a simulation study
We assess the performance of different methods to account for the effects of treatment in fifteen scenarios using simulated data. The effectiveness of two methods described in section 2.2, model recalibration and the incorporation of a term for treatment use in the model, are not present, as their inferiority has already been discussed.
Details of the simulation study are provided in Table 2, which describes 15 scenarios that were studied. For each scenario, a development data set of 1000 individuals of whom all remained untreated throughout the study was simulated. A prognostic model was developed with two predictors using logistic regression analysis, specifying the model so it matched the data generating model. Fifteen validation sets of 1000 individuals were drawn using the same data generating mechanism as their corresponding development data sets, representing an ideal untreated validation set to estimate the model’s ability to predict untreated risks. Subsequently, 50% of the individuals in each validation set were simulated to receive a risklowering pointtreatment with a constant effect of a reduction in the outcome odds by 50%.
In scenarios 1, 3 and 4, an individual’s probability of receiving treatment was a function of their untreated risk of the outcome, representing observational data. In scenario 2, treatment was randomly allocated to individuals, simulating data from an RCT. In scenarios 1 and 3, there was a moderate positive association between risk and treatment allocation, and thus individuals with a more “risky” profile were more likely to receive treatment. In scenario 4 this association was large: treatment was allocated to most (95%) of the individuals with a predicted risk higher than 18%. In scenario 3, the relative treatment effect was allowed to increase with increasing risk. Using scenario 1 as a starting point, in scenarios 5–12, the effect of treatment on risk varied from strong to weak, and the proportion of individuals treated varied. In scenarios 13–15, an unobserved predictor with varying association (moderate negative, weak positive or strong positive) with the outcome was included in the data generating model.
The performance of the prognostic model was estimated in each of these data sets, first ignoring the effects of treatment, and again either by first excluding treated individuals from the analysis, or by applying IPW methods (as specified in Table 1). We applied standard IPW and IPW with weight truncation (at the 98th percentile). For scenarios 1–12, the treatment propensity model was correctly specified; for scenarios 13–15, the unobserved predictor was (by definition) omitted from the treatment propensity model.
In all simulated validation sets and for all methods being applied, performance was estimated in terms of the cindex (area under the ROC curve) and observed:expected (O:E) ratio. For scenarios 1–4 and 13–15 calibration plots were constructed. For IPW methods, calculated IPW weights were used to estimate weighted statistics (see Additional file 1 for further details). In order to obtain stable estimates of the cindex and O:E ratio, we repeated the process of data generation, model development and validation 10,000 times, calculating the mean and standard deviation (SD) of the distribution of the 10,000 estimates. Calibration plots were based on sets of 1 million individuals (equivalent to combining results from 1000 repeats in data sets with 1000 individuals) for each scenario. R code to reproduce the analyses can be found in Additional file 1.
Results
Results of the simulation study are presented below. A summary of the estimated performance measures in each scenario can be found in Tables 3 and 4, and calibration plots for scenarios 1–4 and 13–15 are depicted in Figs. 3 and 4, respectively.
Results were derived from development and validation sets of 1000 individuals. Performance estimates are the means (and standard deviations) of the distribution of O:E ratios from 10,000 simulation replicates. See Table 2 for details of the scenarios.
Results were derived from development and validation sets of 1000 individuals. Performance estimates are the means (and standard deviations) of the distribution of cindexes from 10,000 simulation replicates. See Table 2 for details of the scenarios.
Ignore treatment
Ignoring the effects of treatment resulted, as expected, in predicted risks that were always greater than the observed outcome frequencies, suggesting poor model calibration in all scenarios. This was exacerbated in nonrandomised settings, in which there appeared to be greater miscalibration in highrisk individuals. When treatment allocation was nonrandom, ignoring treatment led to an underestimation of the cindex by up to 0.08 (scenario 3), whereas the cindex did not noticeably change in the RCT scenario. As expected, when either the effectiveness of treatment or the proportion of individuals treated increased, both the O:E ratio and cindex were more severely underestimated.
Method 1: Exclude treated individuals
Excluding treated individuals resulted in calibration measures that appeared to reflect those of the untreated target population in most scenarios. However, as Fig. 3 shows, use of this approach when treatment allocation is dependent on an individual’s risk results in a loss of information about calibration in high risk individuals. When treatment allocation was random (scenario 2), this approach yielded a correct estimate of the cindex. As treatment allocation became increasingly associated with an individual’s risk across scenarios, this method yielded lower estimates for discrimination than observed in the untreated set, due to the selective exclusion of highrisk individuals, and consequently a narrower casemix. The estimates of the cindex and O:E ratio were constant as the treatment effect and proportion treated changed across scenarios 5–12. In the presence of a strong unmeasured predictor of the outcome associated with treatment use (scenarios 14–15), exclusion of treated individuals resulted in an underestimation of the performance of the model. In addition, in all scenarios the precision of estimates of both the O:E ratio and cindex decreased due to the reduction in effective sample size.
Method 2: Inverse probability weighting
Across all scenarios, IPW alone did not improve calibration, compared to when treatment was ignored, whereas IPW followed by the exclusion of treated individuals provided correct estimates for calibration. IPW alone or followed by the exclusion of treated individuals improved estimates of the cindex in all scenarios where the assumptions of positivity and no unobserved confounding were met. In scenario 4, where treatment allocation was determined by a strict riskthreshold and thus the assumption of positivity was violated, IPW was ineffective, and resulted in the worst estimates of discrimination across all methods. In addition, the extreme weights calculated in scenario 4 led to very large standard errors. In scenarios 13–15, the presence of an unobserved confounder led to the failure of IPW to provide correct estimates of the cindex. Weight truncation at the 98% percentile increased precision, but was less effective in correcting of the cindex for the effects of treatment.
Discussion
We showed that when externally validating a prognostic model that was developed for predicting “untreated” outcome risks, treatment use in the validation set may substantially impact on the performance of the model in that validation set. Treatment use is problematic, if ignored, regardless of how treatment has been allocated, though more challenging to circumvent when nonrandomized. While the risklowering effect of treatment seems to have little effect on model discrimination in randomised trial data, the model will appear to systematically overestimate risks (miscalibration). This effect worsens with greater dependency of treatment use on patient characteristics (e.g. baseline risk).
We present simple methods that could be considered when attempting to take the effects of treatment use into account. While the use of IPW in prediction model research is uncommon, the rationale behind using IPW in settings with nonrandomized treatments is motivated by its use to remove the influence of treatment on causal (risk) factoroutcome associations [27, 28]. Although the use of IPW prior to the exclusion of treated individuals is a promising solution in data where treatments are nonrandomly allocated, it should not be used when there are severe violations of the underlying assumptions, e.g. in the presence of nonpositivity (where some individuals had no chance of receiving treatment), or when there is an unobserved confounder, strongly associated with both the outcome and treatment use. There is thus a need to explore alternative methods to IPW to account for the effects of treatment use when validating a prognostic model in settings with nonrandom treatment use.
Although the results of our simulations support the expected behaviour of the methods described in section 2.2, some findings warrant further discussion. First, although excluding treated individuals when treatments use is nonrandom theoretically results in incorrect estimated of model performance, in our simulations, the impact on model discrimination was small in most scenarios. However, when the association between an individual’s risk profile and the chance of being treated increased (scenario 4), the selection bias due to excluding treated individuals resulted in a large decrease in the cindex, as expected. Second, in simulated scenarios in which an unobserved confounder of the treatmentoutcome relation was present, the performance of the model greatly decreased after excluding treated individuals, with or without IPW. This is likely due to the selective exclusion of individuals with a high value for the strongly predictive unobserved variable. This results in a narrower casemix distribution, and consequently lower model discrimination, as well as miscalibration due to the exclusion of a strong predictor of the outcome.
While it is unclear to what extent treatment use has affected existing prognostic model validation studies, findings from a systematic review of cardiovascular prognostic model studies indicate that changes in treatment use after baseline measurements in a validation study are rarely considered in the analysis [29]. While a number of studies excluded prevalent treatment users from their analyses, the initiation of risklowering interventions, such as statins, revascularization procedures and lifestyle modifications during followup was not taken into account. An equally alarming finding was that very few validation studies even reported information about treatment use during followup, raising concerns over the interpretation of the findings of these studies. Based on the findings of the present study, we suggest that information about the use of effective treatments both at the study baseline and during followup should be reported in future studies.
It must be noted that not all prediction model validation studies require the same considerations for treatment use. Although we have discussed prognostic models used for predicting the risk of an outcome without treatment, sometimes prognostic models are developed for making predictions in both treated and untreated individuals. If, for example, the treatments used in the validation set are a part of usual care, and are present in the target population for the model, then differences in the use of these treatments between the development and validation sets should be viewed as a difference in casemix and not as an issue that we need to remove. Furthermore, if the model adequately incorporates relevant treatments (e.g. through the explicit modelling of treatment use), differences in treatment use between the development and validation sets can again be viewed as a difference in casemix. In the event that treatments have not been modelled (e.g. because a new treatment has become readily available since the development of the model), the model could be updated through recalibration, or better yet by including a term for treatment in the updated model, leading to a completely new model, which in turn would require validation. Researchers must therefore first identify which treatments used in a validation data set could bias estimates of model performance, if ignored.
There are limitations to the guidance that we provide. First, we do not present a complete evaluation of all possible methods across a range of different settings, which would require at least an extensive simulation study. We argue, however, that the logical argumentation provided for each method forms a good starting point for further investigation. Furthermore, the list of methods that we present is by no means exhaustive and we encourage the consideration and development of new approaches for more complex settings, such as timetoevent settings, and where limited sample sizes pose a challenge. Second, we assumed for simplicity that a model has been developed in an untreated data set. In reality, it is likely that a model has been developed also in a partially treated set. The considerations for validation then remain the same, but it should be noted that failure to properly account for the effects of treatment in the development of a model can lead to a model that underestimates untreated risks [13]. Third, for simplicity we considered single point treatments in our simulated examples. Patterns of treatment use in reality are often complex, with individuals receiving multiple nonrandomized treatments, even in RCTs. Finally, we also recognize that while this paper discusses the validation of prognostic models, the same considerations for treatment use can, in some circumstances, be relevant to diagnostic studies (i.e. where treatment between index testing and outcome verification could lead to similar and even more serious problems).
Conclusion
When validating a previously developed prediction model for predicting risks without treatment in another data set, failure to properly account for (effective) treatment use in that validation sample will likely lead to poor performance of the prediction model and thus measures should be taken to remove the effects of treatment use. When validating a model with data in which treatments have been randomly allocated, simply excluding treated individuals is sufficient, at the cost of a loss of precision. In observational studies, where treatment allocation depends on patient characteristics or risk, inverse probability weighting followed by the exclusion of treated individuals can provide correct estimates of the actual performance of the model in its target population.
Abbreviations
 CVD:

Cardiovascular disease
 IPW:

Inverse probability weighting
 LP:

Linear predictor
 O:E:

Observed:expected
 OR:

Odds ratio
 PS:

Propensity score
 RCT:

Randomized trial
 ROC:

Receiver operator characteristic
 SD:

Standard deviation
References
 1.
Moons KG, et al. Risk prediction models: II. External validation, model updating, and impact assessment. Heart. 2012;98(9):691–8.
 2.
Moons KG, et al. Prognosis and prognostic research: what, why, and how? BMJ. 2009;338:b375.
 3.
Steyerberg EW, et al. Prognosis research strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013;10(2):e1001381.
 4.
Collins GS, et al. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol. 2014;14:40.
 5.
Harrell FE Jr, et al. Regression modelling strategies for improved prognostic prediction. Stat Med. 1984;3(2):143–52.
 6.
Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15(4):361–87.
 7.
Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med. 1999;130(6):515–24.
 8.
Debray TP, et al. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol. 2015;68(3):279–89.
 9.
Vergouwe Y, Moons KG, Steyerberg EW. External validity of risk models: use of benchmark values to disentangle a casemix effect from incorrect coefficients. Am J Epidemiol. 2010;172(8):971–80.
 10.
Riley RD, et al. External validation of clinical prediction models using big datasets from ehealth records or IPD metaanalysis: opportunities and challenges. BMJ. 2016;353:i3140.
 11.
Liew SM, Doust J, Glasziou P. Cardiovascular risk scores do not account for the effect of treatment: a review. Heart. 2011;97(9):689–97.
 12.
Muntner P, et al. Comment on the reports of overestimation of ASCVD risk using the 2013 AHA/ACC risk equation. Circulation. 2014;129(2):266–7.
 13.
Groenwold RH, et al. Explicit inclusion of treatment in prognostic modelling was recommended in observational and randomised settings. J Clin Epidemiol. 2016;78(2016):90–100.
 14.
Austin PC, Steyerberg EW. Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable. BMC Med Res Methodol. 2012;12:82.
 15.
Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008;168(6):656–64.
 16.
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.
 17.
Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–60.
 18.
Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Stat Med. 2015;34(28):3661–79.
 19.
Pfeffermann D. The role of sampling weights when modeling survey data. In: International Statistical Review/Revue Internationale de Statistique; 1993. p. 317–37.
 20.
Petersen ML, et al. Diagnosing and responding to violations in the positivity assumption. Stat Methods Med Res. 2012;21(1):31–54.
 21.
Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One. 2011;6(3):e18174.
 22.
Steyerberg E. Clinical prediction models: a practical approach to development, validation, and updating. New York: Springer Science & Business Media; 2008.
 23.
Janssen KJ, et al. Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol. 2008;61(1):76–86.
 24.
van Houwelingen HC. Validation, calibration, revision and combination of prognostic survival models. Stat Med. 2000;19(24):3401–15.
 25.
Su TL, et al. A review of statistical updating methods for clinical prediction models. In: Statistical methods in medical research; 2015.
 26.
van Klaveren D, et al. Estimates of absolute treatment benefit for individual patients required careful modeling of statistical interactions. J Clin Epidemiol. 2015;68(11):1366–74.
 27.
Wang Y, Fang Y. Adjusting for treatment effect when estimating or testing genetic effect is of main interest. J Data Sci. 2011;9(1):127–38.
 28.
Spieker AJ, Delaney JA, McClelland RL. Evaluating the treatment effects model for estimation of crosssectional associations between risk factors and cardiovascular biomarkers influenced by medication use. Pharmacoepidemiol Drug Saf. 2015;24(12):1286–96.
 29.
Pajouheshnia R. et al. Treatment use is not adequately addressed in prognostic model research: a systematic review (submitted).
Funding
Rolf Groenwold receives funding from the Netherlands Organisation for Scientific Research (project 917.16.430). Karel G.M. Moons receives funding from the Netherlands Organisation for Scientific Research (project 9120.8004 and 918.10.615). Johannes B. Reitsma is supported by a TOP grant from the Netherlands Organisation for Health Research and Development (ZonMw) entitled “Promoting tailored healthcare: improving methods to investigate subgroup effects in treatment response when having multiple individual participant datasets” (grant number: 91,215,058). The funding bodies had no role in the design, conduct or decision to publish this study and there are no conflicts of interest to declare.
Availability of data and materials
All data and analyses can be reproduced using the R code provided in Additional file 1.
Author information
Affiliations
Contributions
All authors contributed to the design of the study. RP conducted the analyses and drafted the first version of the manuscript. All authors were involved in the drafting of the manuscript and approved the final version.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1:
R code for data generation, methods and analyses (“codefile.R”). (R 6 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Pajouheshnia, R., Peelen, L.M., Moons, K.G.M. et al. Accounting for treatment use when validating a prognostic model: a simulation study. BMC Med Res Methodol 17, 103 (2017). https://doi.org/10.1186/s1287401703758
Received:
Accepted:
Published:
Keywords
 Prognostic Model
 Inverse Probability Weighting (IPW)
 Validation Data Set
 Model Discrimination
 Untreated Risk