This study is reported according to the TRIPOD guidelines for model development and the checklist is provided as Additional file 1 [14].
Study design
To exemplify the assessment of predictorrisk relations in practice, we reanalyzed a large registry study previously used to validate and update existing cardiovascular risk prediction models [15, 16] which represented our maximum (full) data availability. We simulated poorer data availabilities by gradually reducing the sample size by random subsampling.
Three authors were given information on the available predictors, outcomes and the expected event rate and received development datasets of various sample sizes. These three data analysts represented different modeling cultures according to their personal experience and training and developed models following their favorite ‘modeling paradigm’. In particular, CW (representing statistical modeling with generalized additive models), GD (ML with neural networks), and AA (ML with boosting) had to accomplish the following two tasks independently:

1)
To develop an analysis strategy following their favorite paradigm. Depending on sample size amendments to the analysis strategy were allowed.

2)
To develop prediction models and to provide a prediction tool for each of the models to facilitate individual calculation of predictions as usually required for bedside use.
Using the provided prediction tools, GH and DD assessed the resulting predictorrisk relation and evaluated predictive accuracy in an independent test set.
Study population
Our pseudonymized database comprised electronic health records from the Austrian preventive health screening program (1/2009–3/2014) including measurements on the predictors included in the Framingham 2008 CVD risk model and other known or assumed CVD risk predictors. These data were linked to data on hospitalizations (1/2008–3/2015) and causes of death (1/2009–3/2015) from the same individuals to determine if a CVD event had occurred after the first health screening. Data preparation steps have been reported previously [15, 16], and relevant additions to the current work are detailed in Additional file 2: Appendix 1. We applied the inclusion criteria of the Framingham 2008 CVD risk model, where individuals between 30 and 74 years who had no indication of CVD in the year prior to the health screening were included [1, 15, 16]. Moreover, we required participants to have at least 1 year of followup. The study protocol and the exempt from the need to obtain informed consent was approved by the Ethics Committee of the Medical University of Vienna (ECS 1232/2014).
Outcome
We used the occurrence of any cardiovascular event within 1 year after the health screening as our outcome variable. Cardiovascular events were defined in line with the Framingham 2008 CVD risk model as the diagnosis of any of: coronary heart disease (coronary death, myocardial infarction, coronary insufficiency, angina), cerebrovascular events (ischemic stroke, hemorrhagic stroke, transient ischemic attack), peripheral artery disease (intermittent claudication), or heart failure [1]. Appropriate ICD10 codes (10th revision of the International Classification of Diseases) for CVD were used to identify cases [15, 16]. Further information on the identification of CVD in our study cohort is detailed in Additional file 2: Appendix 1.
Predictors
Similar to the Framingham 2008 CVD risk model, we considered the following predictors: sex, age, total cholesterol (mg/dl), HDL cholesterol (mg/dl), systolic blood pressure (BP, mmHg), hypertensive drug intake (yes; no), diabetes (yes; no), and smoking status (yes; no). Moreover, the electronic health records also contained several other variables which we considered potentially relevant for cardiovascular risk prediction according to domain experts. These were: blood glucose (mg/dl), triglycerides (mg/dl), diastolic BP (mmHg), BMI score (kg/m^{2}), glucose in urine (positive; negative), protein in urine (positive; negative), waist circumference (categorical; too large: waist circumference ≥ 102 cm for men or ≥ 88 cm for women; normal: < 102 cm for men or < 88 cm for women), selfassessed physical activity (none; occasionally; regularly), ratio of total cholesterol and HDL cholesterol, BP classes (categorical; ideal: < 120/80; normal: 120–129/80–85; still normal: 130–139/85–89; hypertension stage 1: 140–179/90–109; hypertension stage 2: ≥180/110; isolated systolic hypertension: ≥140/< 90), and BMI classes (categorical; < 18.5; 18.5–24.9; 25.0–29.9; 30.0–34.9; 35.0–39.9; ≥40.0).
Starting with 2,159,616 individuals in the data base, we applied the inclusion criteria of Framingham 2008 CVD risk model and excluded individuals with missing values in any of the Framingham predictors or other potential predictors or outcome (Fig. 1). The resulting data set comprised 1,543,400 individuals with 17 predictors, and was randomly split into a training set (1,028,739 individuals) and a test set (514,661 individuals).
Data availability: varying the sample size
To mimic analysis scenarios with various sample sizes available for model development, we defined five levels of data availability as follows: the full dataset (N; n = 1,028,739), half of the dataset (N/2; n ≈ 513,370), onetenth (N/10; n ≈ 102,874), onetwentyfifth (N/25; n ≈ 41,150), and onehundredth of the dataset (N/100; n ≈ 10,287). For data availabilities of N/2, N/10, N/25 and N/100, the training set was split randomly into the corresponding number of approximately equally sized disjoint subsets. Hence, this resulted in two subsets at a data availability of N/2, and accordingly in ten, 25, and 100 subsets at N/10, N/25 and N/100. The full dataset and each of the 137 subsets were treated just like they were separate studies and served as training sets for model development.
Data analysis
Statistical modeling with generalized additive models (GAM)
Logistic regression models are used to relate a binary outcome like the occurrence of a CV event to a set of predictors [17]. Similarly to linear regression, predictor values are combined through their weighted sum resulting in a socalled ‘linear predictor’. The linear predictor is then transformed by the logistic function to produce a probability that is naturally bounded between 0 and 1. Generalized additive models (GAM) [18] transform continuous predictors via splines [2, 19] allowing for a smooth, flexible and nonlinear representation of these predictors in the linear predictor [20]. To introduce even more flexibility, interaction (product) terms can be added to the model. Model building may also include stepwise variable selection to reduce the number of estimated coefficients in a model [3, 21].
ML with neural networks
Neural networks connect predictors to the outcome through a network [5]. The edges of this network are the socalled weights similar to coefficients in a logistic regression model. A socalled singlelayer neural network consists of one layer, the ‘input layer’ of predictors, and is equivalent to a logistic regression model (SLNNLR). The idea can be extended by introducing a second, ‘hidden’ layer with ‘hidden units’ between the input layer and the outcome (multilayer neural network, MLNN). This introduces more flexibility and the possibility of complex interactions between predictors [5]. The number of hidden units can either be specified in advance or optimized using crossvalidation.
ML with boosting
Boosting is a general concept in ML in which a sequence of weak models is estimated such that each additional element in the sequence improves on the inaccuracies of its predecessors [22]. With extreme gradient boosted trees (XGBoost), simple decision trees are used as the elements in that sequence [23]. The final prediction is the average of the predictions made by all trees. The number of trees, the number of branches in each tree and other characteristics of the trees are socalled hyperparameters of the model.
Predictive performance as requirement for interpretability
Predictive performance
The predictive performance of the final prediction models was assessed in the test set by the Brier score measuring the accuracy of prediction [24], and by measures of discrimination and calibration. As age is the most important predictor of cardiovascular events [16], the Brier score was also computed for different ages. Discrimination was quantified by the discrimination slope [25], the area under the receiver operating characteristic curve (AUROC) and the area under the precisionrecall curve (AUPRC). Calibration was assessed visually by grouping the predicted probabilities by their permilles (1000quantiles) and plotting a loess smoother through the 1000 observed risks corresponding to these 1000 groups defined by the permilles.
Agreement of predictions
Spearman’s rank correlation coefficients between predictions in the test set obtained from the different analytical strategies were computed pairwisely. For data availabilities of N/2, N/10, N/25 and N/100, the correlation coefficients were averaged at each level of data availability.
Evaluation criterion: assessing the predictorrisk relation
We chose partial dependence plots (PDPs) and individual conditional expectation (ICE) plots as modelagnostic techniques to illustrate the direct effect of each predictor on the predictions of a model [9, 10, 23]. ICE plots show how predictions of a model result from varying the values of one predictor, keeping all other predictors fixed. For example, to construct an ICE plot for cholesterol, we used the combination of values of all other predictors observed for a particular patient, varied cholesterol across its observed range, and connected the resulting risk predictions. This was repeated for all patients of a reference population. PDP plots were then computed by averaging the ICE predictions at each cholesterol value and connecting the averages. In some PDP plots we fixed some important predictors at predefined values, e.g., we set age to 40, 50, 60 and 70 years, and sex to ‘female’ or ‘male’.
To reduce the computational burden, we generated a reference population of 10,000 individuals by randomly selecting 1000 men and 1000 women from each of five age groups (defined as 30–38; 39–47; 48–56; 57–66; 65–74 years) of the full training set. We then assigned weights to each sexage group according to the corresponding prevalence in the training set. We used these weights when averaging predictions for PDPs.