### Model I, II: Pooled Cohort Equations, all two-way interaction Cox PH

PCEs are four Cox PH based models, each of which is for a specific race and sex group (white male, white female, black male, black female). Cox PH models the probability an individual experiences the event during a small-time interval given the individual is free of an event at the beginning of the time interval [1], which is also known as hazard rate. Specifically, the hazard function can be expressed as the follows:

$${\lambda }_{i}\left(t\right)={\lambda }_{0}\left(t\right)\mathrm{exp}\left({\beta }_{1}{X}_{i1}+\dots +{\beta }_{p}{X}_{ip}\right)={\lambda }_{0}\left(t\right)\mathrm{exp}({{\varvec{X}}}_{i}^{T}{\varvec{\beta}}),$$

(1)

where \(t\) is the survival time, \({\lambda }_{0}\left(t\right)\) is the baseline hazard risk at time \(t\), \({{\varvec{X}}}_{i}={\left[{X}_{i1},\dots ,{X}_{ip}\right]}^{T}\) contains the covariates for individual \(i\), and \({\varvec{\beta}}={\left[{\beta }_{1},\dots ,{\beta }_{p}\right]}^{{\varvec{T}}}\) is the regression coefficient vector. The hazard function consists of two parts: baseline hazard \({\lambda }_{0}\left(t\right)\) and a hazard ratio or risk function \(\mathrm{exp}\left({{\varvec{X}}}_{i}^{T}{\varvec{\beta}}\right)\). Cox PH assumes that the relative risk for each covariate (\({\varvec{\beta}}\) in the equation) is constant over time. The estimate of \({\varvec{\beta}}\) is obtained by optimizing the Cox partial likelihood function as defined below:

$$l\left(\beta \right)={\sum }_{i:{\Delta }_{i}=1}\left({{\varvec{X}}}_{i}^{T}{\varvec{\beta}}-log{\sum }_{j:{Y}_{j}\ge {Y}_{i}}\mathrm{exp}\left({{\varvec{X}}}_{j}^{T}{\varvec{\beta}}\right) \right),$$

(2)

where \({\Delta }_{i}\) is the indicator for the occurrence of event and \({Y}_{j}\) is follow-up time for individual \(j\).

In the PCEs, seven predictors were selected based on demonstrated statistical utility using prespecified criteria [14]. These predictors include age at baseline, systolic blood pressure (SBP), diabetes medical history, treatment for hypertension, current smoker, high density cholesterol and total cholesterol. The interactions between age at baseline and the other predictors were tested based on *p*-values. Only interactions that had significant *p*-values (< 0.05) were kept in the model. The PCEs demonstrated acceptable performance in derivation samples, with C-statistics for 10-year risk prediction of 0.80 in white females, 0.76 in white males, 0.81 in black females, and 0.70 in black males in 10 × 10 cross-validation [14].

To capture more complex relationships between predictors and ASCVD outcome, in the Cox PH-TWI model, we included all the two-way interactions of the 7 predictors in the model for each race and sex. We then retained only the interaction terms that had significant *p*-values for each race and sex.

### Models III and IV: Deepsurv and Cox-nnet

Deepsurv and Cox-nnet are both adaptations of the standard Cox PH [10]. Instead of assuming the linear relationship between covariates and log-hazard, the Deepsurv and Cox-nnet models can automatically learn the non-linear relationship between risk factors and an individual’s risk of failure by its linear (i.e., multi-layer perceptron) and non-linear (activation functions) transformation. Specifically, the log-risk function \({{\varvec{X}}}_{i}^{T}{\varvec{\beta}}\) in the Cox equation as shown in Eq. (1) is replaced by the output from neural network \({h}_{w,{{\varvec{\beta}}}^{\boldsymbol{*}}}\left({{\varvec{X}}}_{i}\right)\), where \({{\varvec{\beta}}}^{\boldsymbol{*}}\) is the weight for the last hidden layer and \(w\) is the weight for other hidden layers for neural network (see Fig. 1A):

$${h}_{w,{{\varvec{\beta}}}^{\boldsymbol{*}}}\left({{\varvec{X}}}_{i}\right) = {\varvec{G}}{\left({\varvec{W}}{{\varvec{X}}}_{j}+{\varvec{b}}\right)}^{T}{{\varvec{\beta}}}^{\boldsymbol{*}}.$$

The neural network optimizes the log-partial likelihood function similar to the standard Cox model by tuning parameters \({\varvec{W}},{{\varvec{\beta}}}^{\boldsymbol{*}}\):

$$l\left({\varvec{W}},{{\varvec{\beta}}}^{\boldsymbol{*}}\boldsymbol{ }\right)={\sum }_{i:{\Delta }_{i}=1}\left({h}_{w,{\varvec{\beta}}\boldsymbol{*}}\left({{\varvec{X}}}_{i}\right)-log{\sum }_{j:{Y}_{j}\ge {Y}_{i}}(\mathrm{exp}\left({h}_{w,{{\varvec{\beta}}}^{\boldsymbol{*}}}\left({{\varvec{X}}}_{j}\right)\right)\right).$$

Cox-nnet was proposed to deal with high dimensional features especially in genomic studies. To avoid overfitting, Cox-nnet introduces a ridge regularization term and subsequently, the partial log likelihood in Eq. (2) is extended as the following:

$$l\left( {\varvec{W}},{{\varvec{\beta}}}^{\mathbf{*}}\right)={\sum }_{i:{\Delta }_{i}=1}\left({\varvec{G}}{\left({\varvec{W}}{{\varvec{X}}}_{i}+{\varvec{b}}\right)}^{T}{{\varvec{\beta}}}^{\mathbf{*}}-log{\sum }_{j:{Y}_{j}\ge {Y}_{i}}\mathrm{exp}\left({\varvec{G}}{\left({\varvec{W}}{{\varvec{X}}}_{j}+{\varvec{b}}\right)}^{T}{{\varvec{\beta}}}^{\mathbf{*}}\right) \right)+\lambda \left(\parallel {{\varvec{\beta}}}^{\mathbf{*}}{\parallel }_{2}+\parallel {\varvec{W}}{\parallel }_{2}\right),$$

In addition to L_{2}-regularizer, Cox-nnet also allows drop-out for regularization to avoid overfitting. The model is based on Theano framework, therefore, Cox-nnet can be run on a Graphics Processing Unit or multiple threads.

The Deepsurv model also allows the above-mentioned regularization techniques to avoid overfitting. In addition to that, Deepsurv adapted modern techniques to improve the training of the network such as introducing scaled Exponential Linear Units (SELU) as the activation function [8].

Although both the Cox-nnet and Deepsurv can learn the non-linear relationship between risk factors and the event risk, it is important to note that proportional hazard assumption still stands in the sense that the hazard ratio between any individual \(\mathrm{i}\) and \(j\) is constant over time.

### Model V: Nnet-survival

Nnet-survival is a fully parametric survival model that discretizes survival time. Nnet-survival is proposed to improve two main aspects of the neural network model that are adapted from Cox model: computational speed and the violation of the proportional hazard assumption. Neural network survival models that adapt from Cox model (e.g., Deepsurv, Cox-nnet) use partial likelihood function as the loss function to optimize. The partial likelihood function is calculated based on not only the current individual but also all the individuals that are at risk at the time point. This makes it difficult to use stochastic gradient descent or mini-batch gradient descent, both of which use a small subset of the whole dataset. Therefore, both Deepsurv and Cox-nnet may have slow convergence and cannot be applied to large datasets that run out of memory [9]. Nnet-survival was proposed to discretize time, which transforms the model into a fully parametric model and avoids the use of partial likelihood as the loss function. In Nnet-survival models, follow-up time is discretized to \(n\) intervals. Hazard \({h}_{j}\) is defined as the conditional probability of surviving time interval \(j\) given the individual is alive at the beginning of interval \(j\). Survival probability at the end of interval \(j\) can be then calculated as the following:

$${S}_{j}={\prod }_{i=1}^{j}\left(1-{h}_{i}\right).$$

The loss function is defined as the following:

$$L={h}_{j}{\prod }_{i=1}^{j-1}\left(1-{h}_{\left(i\right)}\right),$$

for individuals who fail at interval \(j\), and

$$L={\prod }_{i=1}^{j-1}\left(1-{h}_{\left(i\right)}\right),$$

for individuals who are censored at the second half of interval \(j-1\) or the first half of interval \(j\).

There are two main architectures of Nnet-survival: a flexible version and a proportional hazards version. In the flexible version, output layers have \(n\) neurons, where \(n\) is the number of intervals and each output neuron represents the survival probability at the specific time interval given an individual is alive at the beginning of the time interval. In the proportional hazard version, the final layer only has a single neuron representing \({{\varvec{X}}}_{i}^{T}{\varvec{\beta}}\):

$${h}_{\beta }\left({{\varvec{X}}}_{{\varvec{i}}}\right)={{\varvec{X}}}_{i}^{T}{\varvec{\beta}},$$

In our study, the flexible version is used, with its architecture of the flexible version shown in Fig. 1B.

### Statistical analysis

In this study, we used the harmonized, individual-level data from 6 cohorts in the Lifetime Risk Pooling Project, including Atherosclerosis Risk in Communities (ARIC) study, Cardiovascular Health Study (CHS), Framingham Offsprinig study, Coronary Artery Risk Development in Young Adults (CARDIA) study, the Framingham Original study, and the Multi-Ethnic Study of Atherosclerosis (MESA). The first 5 cohort data were used for model development and internal validation, and the MESA data was used for external validation. We included individuals that meet the following criteria: (i) age between 40 to 79; and (ii) free of a previous history of myocardial infarction, stroke, congestive heart failure, or atrial fibrillation. ASCVD was defined as nonfatal myocardial infarction or coronary heart disease death, or fatal or nonfatal stroke (see [14] for details of selection criteria). All study individuals were free of ASCVD at the beginning of the study and were followed up until the first ASCVD event, loss to follow up, or death, whichever came first. We fit PCE, Cox PH with all two-way interactions (Cox PH-TWI), Nnet-survival, Deepsurv, and Cox-nnet models in white male, white female, black male, and black female participants. For comparison purposes, for all the models, we included the same predictors as used in the PCEs: age at baseline, systolic blood pressure (SBP), diabetes medical history, treatment for hypertension, current smoker, high density cholesterol (HDL-C) and total cholesterol. The details of the study start and end dates, study settings, and how the covariates were collected can be found in [14]. Individuals who had missing data at baseline were excluded from the study.

To obtain high performance neural network survival models, we manually tuned various hyper-parameters including learning rate, number of layers, number of neurons, number of epochs, batch size, momentum, optimizer, learning rate decay, batch normalization, L_{2} regularization, and dropout. More specifically, to tune the hyperparameters in the development dataset (the five training/internal validation cohorts), we split the data into 10 folds. We used nine folds of the data for training and the rest one-fold for evaluation. We used grid search to search through a range of hyperparameters and select the hyperparameter combination that generates the highest C-index in the one-fold evaluation dataset. After selecting the optimal hyper-parameters, we evaluated model performance through internal validation with 10 × 10 cross validation and external validation with the MESA data. To perform 10 × 10 cross-validation, we randomly partitioned the pooled cohort data into 10 equal-sized subsamples. Of the 10 subsamples, 9 subsamples were used as training data and the remaining single subsample was retained as the validation data for testing the model. Each of the subsamples is used in turn as the validation data. We repeated this process 10 times, during which 100 models were built. The average C-statistics and calibration plot of the 100 models were used as the final 10 × 10 cross-validation result. For PCE and Cox PH-TWI models, we refit the models in each of the cross-validated training samples for the internal 10 × 10 cross-validation to avoid overfitting. In each refit, we kept the original structure of the original PCE models and only updated the coefficients of the models. In the calibration plots, the observed and predicted events were shown in deciles [11]. For the external validation, we trained the model in the whole harmonized dataset (not including MESA cohort) and evaluated the model discrimination and calibration in the external MESA cohort. To compare whether the differences among C-statistics were significant in neural network models vs. PCE models, we performed significant test using method proposed by Uno et al. [16]. MESA is a more contemporary cohort that had lower CVD event rate compared to the earlier cohorts [14]. This difference could cause models to have poor calibration in MESA. To overcome this, we performed recalibration on all models using the method proposed by Pennells et al. [17]. Briefly, we first calculated rescaling factors that were needed to bring predicted risks in line with observed risks using regression model in MESA dataset. We then applied the rescaling factors to the original predicted risk and got recalibrated risk estimates for all participants.

Nnet-survival, Deepsurv, and Cox-nnet were implemented in python, version 3.7.3. Cox PH model was conducted using the “survival” package in R, version 3.6.0. C-statistics and the significant test in C-statistics between two competing risk prediction models were calculated using the “survC1” package in R, version 3.6.0 [16]. We chose 0.05 as the statistical significance level. Regression model for recalibration was performed using “scikit-learn” module in python, version 3.7.3 [18].

All data were de-identified, and all study protocols and procedures were approved by the Institutional Review Board at Northwestern University with a waiver for informed consent. All methods were performed in accordance with the relevant guidelines and regulations.