Joint modelling of longitudinal lipids and time to coronary heart disease in the Jackson Heart Study

Background Multiple longitudinal responses together with time-to-event outcome are common in biomedical studies. There are several instances where the longitudinal responses are correlated with each other and at the same time each longitudinal response is associated with the survival outcome. The main purpose of this study is to present and explore a joint modeling approach for multiple correlated longitudinal responses and a survival outcome. The method will be illustrated using the Jackson Heart Study (JHS), which is one of the largest cardiovascular studies among African Americans. Methods Four longitudinal responses, i.e., total cholesterol (TC), high density lipoprotein (HDL) cholesterol, triglyceride (TG) and inflammation measured by high-sensitivity C-reactive protein (hsCRP); and time-to-coronary heart disease (CHD) were considered from the JHS. The repeated lipid and hsCRP measurements from a given subject overtime are likely correlated with each other and could influence the subject’s risk for CHD. A joint modeling framework is considered. To deal with the high dimensionality due to the multiple longitudinal profiles, we use a pairwise bivariate model fitting approach that was developed in the context of multivariate Gaussian random effects models. The method is further explored through simulations. Results The proposed model performed well in terms of bias and relative efficiency. The JHS data analysis showed that lipid and hsCRP trajectories could exhibit interdependence in their joint evolution and have impact on CHD risk. Conclusions We applied a unified and flexible joint modeling approach to analyze multiple correlated longitudinal responses and survival outcome. The method accounts for the correlation among the longitudinal responses as well as the association between each longitudinal response and the survival outcome at once. This helps to explore how the combination of multiple longitudinal trajectories could be related to the survival process.


Background
Longitudinal data together with time-to-event measurements are common in biomedical studies. In cardiovascular researches, for example, as we focus in this paper, lipid levels, i.e., total cholesterol (TC), low density lipoprotein (LDL) cholesterol, high density lipoprotein (HDL) cholesterol and triglyceride (TG) could be measured repeatedly over time along with time-to-event data such as time to coronary heart disease (CHD). Abnormalities in LDL cholesterol, HDL cholesterol, TC and TG levels have all been linked to increased risk for CHD. Furthermore, inflammation measured by high-sensitivity C-reactive protein (hsCRP) correlates with CHD status. As people age, body mass and composition as well as physical activity levels and diet tend to change which are often associated with changes in lipid levels.
There is an extensive literature available on separate analyses of longitudinal measurement data and time-toevent data including the popular linear mixed effects model for the former [1,2], and parametric (Weibull model or exponential model) or semiparametric (Cox) proportional hazards model for the latter [3]. However, separate analyses could be inappropriate in some applications as health status (for example, risk to CHD) is likely correlated with longitudinal lipid level trajectory. Furthermore, separate analysis may not adequately address the underlying research questions. Wang and Taylor [4] include the longitudinal marker as a time-dependent covariate in the (proportional hazards) survival model. It has been shown, however, that including longitudinal measurements as time-varying covariate may be inappropriate as the measurements could be subject to measurement error [5]. Several approaches of joint modeling of a longitudinal continuous response and a time-to-event outcome have been available. Tsiatis and Davidian [6] proposed a two-stage joint model in application to longitudinal CD4 count data as surrogate marker to HIV/AIDS survival whereby a repeated measures random components model is considered in the first stage for the longitudinal CD4 count data while a Cox model is employed in the second stage for estimating the parameters. Henderson et al. [7] proposed a flexible joint modelling framework such that the longitudinal and survival processes are connected via a latent bivariate Gaussian process. A fully Bayesian version of this approach, implemented via Markov chain Monte Carlo (MCMC) methods is presented in Guo and Carlin [8]. Ibrahim et al. [9] applied such joint modeling and explored why it is needed in cancer research. A joint model for longitudinal and survival data with nonparametirc multiplicative random effects approach has been studied by Ding and Wang [10].
The majority of the joint models in the literature focus on a single longitudinal response and a survival outcome.
Little attention has been given to modeling multiple longitudinal responses and a survival process simultaneously. Flexible semiparametric multivariate joint model that relate multiple longitudinal outcomes to a time-to-event have been proposed [11,12] with emphasis on the association between the longitudinal data and survival outcome. Such multivariate joint analysis is also needed when one aims to explore not only the association between an individual longitudinal outcome and a survival process but also the correlation between the multiple longitudinal responses. In the Jackson Heart Study (JHS), described in "Methods" section, longitudinal HDL, TC, TG and hsCRP measurements taken from each subject repeatedly over time are subject to measurement error, potentially correlated with each other, and at the same time each of these lipid and hsCRP profiles is likely associated with the subject's risk for CHD. It is then useful to assess these features by studying their association and simultaneous covariate effects, which can be addressed in joint modelling context. However, joint modeling of such multivariate longitudinal responses and a survival outcome poses computational challenges due to the high dimensionality in the random effects model. This paper aims to present a joint modeling approach for multiple longitudinal responses and a survival data. To deal with the high dimensionality due to multiple longitudinal components, a pairwise model fitting approach that was developed in the context of multivariate Gaussian random effects is adopted [13].
The rest of the paper is organized as follows. In "Study data" section, a motivating dataset is described with analysis reported in "Results" section. "Review of basic models" section focuses on brief review of models for separate and joint analysis of a continuous longitudinal response and a survival outcome. "Multivariate joint model" section focuses on a joint model for multivariate continuous longitudinal responses and a time-to-event outcome, followed by a likelihood based estimation approach along the lines of Fieuws and Verbeke [13] in "Estimation" section. "Simulation" section presents results of a simulation study. Discussion and concluding statements are given in "Discussion" and "Conclusion" sections, respectively.

Study data
The Jackson Heart Study (JHS) is a longitudinal community-based study that investigates the causes of cardiovascular disease among African Americans. JHS participants were recruited from urban and rural areas from the three counties that make up the Jackson metropolitan area (Hinds, Madison, and Rankin). The 5,306 enrolled participants received three clinical examinations (Exam 1, 2000-04; Exam 2, 2005-08; Exam 3, 2009-13) that generated a plethora of longitudinal data includ-ing, but not limited to, cardiovascular disease risk factors, socio-economic information and biochemical analytes. In addition to the clinic exams, JHS has kept track of the occurrence of events of interest (i.e. coronary heart disease, abbreviated as CHD) through adjudicated events. Details on the design methods and data collection of the JHS can be obtained from Taylor et al. [14] and Carpenter et al. [15]. This paper uses data from two sources of JHS: (1) the longitudinal clinical examinations where sex, diabetes status, hypertension status, statin use, body mass index, inflammation, and lipid levels are extracted from; and (2) the annual follow-up surveys where Coronary Heart Disease (CHD) adjudicated events through 2016 are obtained. We allow for right censoring, i.e., those who experienced CHD since the baseline year (2000) were considered as incident cases and those who have died, withdrew from the study or have not seen CHD until 2016 are treated as censored. The time variable (event or censoring) begins from the baseline year and we haven't assumed truncation as present. These variables are summarized in Table 1 by CHD status across the three measurement occasions where count and percentages are used for categorical variables and means and SDs for continuous variables. There are some notable highlights. Of the 4232 participants who have lipids and hsCRP measurements available at baseline, 236 (5.6%) experienced CHD event with a median survival time of 13.8 years. The mean age at baseline for non-CHD was 53.2 ±12 years while 62.0 ±10 years for those who had CHD. The proportion of diabetes as well as hypertension is higher in the CHD incident cases as compared to those without CHD across all visits. The mean HDL seems to increase over time, though the magnitude of mean level in non-CHD participants appears slightly higher when compared across the respective visits. The mean TG, however, tends to decline in both groups and non-CHD participants appear to have lower values. Incident cases are more likely to be men. Individual profiles plots of TG, hsCRP, HDL and TC for a random sample of one hundred participants are shown in Fig. 1. Clearly, there is evidence of between-subjects variability both at baseline and over time. On the other hand, Fig. 2 shows the average profiles plots of TG, hsCRP, HDL and TC for the full cohort. The average TG trajectory tends to decline while hsCRP and HDL seem to suggest an increasing average trend. TC appears to be more or less stable. A formal statistical test is needed to see the significance of these comparisons which will be dealt with later in "Results" section.
The Jackson Heart Study provides a unique opportunity to study the longitudinal trajectory of lipids and hsCRP and incident CHD. Therefore, we will assess the association between the longitudinal lipids and hsCRP with incident CHD as well as the correlation between the longitudinal lipids and hsCRP measurements in African American participants.

Review of basic models
In "Linear mixed model for longitudinal data", we will briefly review the linear mixed model for continuous longitudinal data, followed by a basic parametric model for survival data in "Model for time-to-event data" sections. "Joint model" section focuses on a joint model for a longitudinal response and time-to-event outcome.

Linear mixed model for longitudinal data
Let y i1 , y i2 , . . . y in i be longitudinal measurements from the i th subject at times t i1 , t i2 , . . . t in i . The model can be written as where x 1i t ij are regressors corresponding to unknown regression coefficients β, b 1i are random effects with  The subject-specific random effects b 1i and measurement error ε i are independent and assumed to follow normal distribution, i.e., b 1i ∼ N(0, D), and ε i ∼ N(0, i ). Details on linear mixed models can be found in Laird and Ware [1].

Model for time-to-event data
Suppose s i denotes the survival time for subject i, i = 1, 2, . . . , n. The event of interest may not be observed for some of the subjects until the end of the follow-up period and hence their event times are right censored. Let c i be the censoring time for subject i. For an indicator function Weibull and exponential models are commonly used parameteric models in survival data analysis. In a Weibull model the density for the survival time for the i th subject is given by where x 2i is the vector of covariates corresponding to the i th observation and ξ is a vector of regression coefficients. The Weibull model in (2) reduces to the exponential model when α = 1.

Joint model
The longitudinal response in "Linear mixed model for longitudinal data" and survival process in "Model for timeto-event data" sections can be correlated with each other. One possible route to study this association is through joint modeling, where the linear mixed model in (1) is linked with the survival model in (2) through shared random effects. Assuming a linear mixed effects model for the longitudinal part and a Weibull model for the survival part, the joint distribution of the longitudinal response and the time-to-event outcome conditional on the random effects is: 1i and b 2i are latent zero-mean bivariate Gaussian processes corresponding to the longitudinal measurements and events, θ is vector of association parameters, and the rest of the quantities are as defined in "Linear mixed model for longitudinal data" and "Model for time-to-event data" sections. For random intercepts b 0i and random slopes b 1i , for example, one possible model could be where θ 1 and θ 2 measure the association between the two submodels induced by the random intercepts and random slopes, respectively. Other ways of parametrization for the association of the two submodels can also be used as well [7]. Furthermore, b 1i = (b 0i , b 1i ) is assumed to follow bivariate normal distribution, possibly correlated, with mean 0 and variance-covariance matrix D given by where ρ is the correlation between the random intercepts and random slopes.

Multivariate joint model
The joint mode in "Joint model" section can be extended to a multivariate setting where k longitudinal continuous responses can be simultaneously modeled with time-toevent data. A multivariate joint model formulation of the k longitudinal responses and the time-to-event outcome follows from combining each single joint model of the longitudinal continuous response and the time-to-event outcome via random effects that are assumed to follow multivariate normal distribution. The general expression for the resulting multivariate joint model can be given by where f (y , t) is the single joint model corresponding to the th continuous response and the time-to-event data as described in "Joint model" section and φ is a k-dimensional Gaussian density for the random effects (b 1 , . . . , b k ), such that, for subject i, each b i is shared between the linear mixed submodel corresponding to the th longitudinal response and the time-to-event submodel through b 1i and b 2i , respectively, as shown in "Joint model" section. The likelihood contribution of subject where is the vector of fixed effect and covariance parameters in the joint model. The full multivariate joint model in (4) is complex due to the high dimensionality in the random effects and inference for is not feasible in the standard statistical softwares. An estimation approach to overcome this computational challenge and reduce the dimensionality along the lines of Fieuws and Verbeke [13] will be the topic of "Estimation" section.

Estimation
We employ the pairwise bivariate fitting to reduce the high dimensionality of (4) and estimate the model parameters in [13]. By pairwise fitting we mean that bivariate joint models of all possible pairwise combinations of the longitudinal continuous responses and a time-toevent data are fitted and duplicate parameter estimates are averaged across all pairs. Consider k random-intercept models that are specific to k single joint models where each joint model represents a longitudinal continuous responses and time-to-event outcome as described in "Joint model" section. The possible number of pairs out of the k continuous responses is P = k(k−1) 2 . For subject i, consider a vector of the r th and s th bivariate longitudinal responses and the time-to-event outcome Y ri , t i and Y si , t i , respectively, where Y ri = y ri1 , . . . , y rin ri and is vector of random effects corresponding to the r th and s th bivariate joint models with mean 0 and covariance matrix D.
The likelihood to be maximized corresponding to the r th and s th bivariate joint model of two longitudinal continuous responses the time-to-event data takes the form where f and φ are as defined in (4) and N is the total number of subjects. Fitting these bivariate models is computationally feasible as there are only two correlated random effects. The loglikelihood across all possible pairs can be given as P p=1 l Y p | p , where p = 1, . . . , P, and p is the vector of parameters corresponding to the p th bivariate joint model. All pair-specific parameter vectors p can be combined and the so resulting vector of parameters is denoted by . Assuming that p maximizes l Y p | p , would maximize l . An estimate for each parameter in of the full multivariate joint model defined in "Multivariate joint model" section is obtained by averaging all pair-specific estimates in . The resulting averages are asymptotically normal. However, the standard errors can't be directly computed by simple averaging of their corresponding pair-specific estimates. Details on the averaging of pair specific estimates and expression of the asymptotic normality of follows in a similar way as the multivariate linear mixed model and multivariate semi-continuous data and can be obtained in Fieuws and Verbeke [13]. This approach can be well generalized to random intercepts and random slopes model of course with additional computational complexity. Each bivariate pair consists of a longitudinal response and a time-toevent outcome as shown in (5) does not have a closed form solution and is fitted numerically with the SAS procedure NLMIXED using adaptive Gaussian quadrature based on quasi-Newton optimization technique. Other flexible non-linear optimizers can be employed as well. A SAS macro has been used to fit the P bivariate models and combine the results.

Results
We analyze the JHS data introduced in "Study data" section. A multivariate joint model of "Multivariate joint model" section will be fitted for the longitudinal measurements of high density lipoprotein (HDL) cholesterol, triglyceride (TG), total cholesterol (TC), high-sensitivity C-reactive protein (hsCRP), and time to coronary heart disease (CHD) to capture the association between each of these longitudinal response and the time-to-CHD as well as the correlation between HDL, TG, TC and hsCRP. Log-transformation was applied to the four longitudinal responses based on earlier exploratory analysis. Let y ij , = 1, . . . , 4 be the th log-transformed continuous response for subject i at time j and we model the mean μ ij as where A i baseline age for subject i, M i a gender indicator for subject i coded as (1: male; 0: female), D ij is a diabetic status indicator for subject i at time j coded as (1: diabetic; 0: non-diabetic), H ij is a hypertension status indicator for subject i at time j coded as (1: hypertensive; 0: non-hypertensive); T ij is the j measurement time for subject i in years from baseline; BMI ij is body mass index for subject i at time j, ST ij is a statin medication indictor for subject i at time j coded as (1: medication; 0: no medication), SM i is current smoking indicator for subject i at baseline coded as (1: smoker; 0: non-smoker) and b i are subject specific random intercepts for the th continuous response.
Turning to the time-to-CHD, we consider an exponential submodel in each joint model with the th continuous response, and further the hazard at time t is modeled as, where θ captures the association between the linear mixed submodel and the exponential submodel induced by the random effects. The correlation between any two pairs of longitudinal responses is captured by the correlation of the respective random intercepts.
Six pairs of bivariate models were fitted for the four responses using the pairwise fitting approach outlined in "Estimation" section. Results are shown in Table 2. Clearly, the estimated association parameter of TG and CHD is significant, suggesting evidence of strong negative association between the two submodels (θ = −0.4473, p = 0.009), and this implies that high level of TG is associated with poor survival or shorter time-to-CHD. The same is true for the association between hsCRP and CHD (θ = −0.2504, p = 0.005). On the contrary, the estimate corresponding to HDL (θ = 1.0472, p = 0.002) is positive and significant suggesting high HDL level is associated with better survival. On the other hand, the association parameter estimate corresponding to TC provides no evidence of association between total cholestol and CHD (p = 0.151). Consider testing hypotheses about the four association parameters jointly. One can employ a multivariate Wald test for the hypothesis H 0 = L(θ) = 0, where L is a 4×4 identity matrix corresponding to θ , where COV θ is the 4×4 covariance matrix of θ obtained from fitting the multivariate joint model. This test suggests joint significance of the association between the four longitudinal responses and CHD (χ 2 = 18.72, p = 0.0009). Another important output of this analysis is estimated correlation of the random intercepts. In this regard, we observe evidence of negative correlation between TG and HDL (−0.5026) as well as HDL and   But no significant association was observed between BMI and CHD survival. Likely due to the smaller number of repeated measurements per subject, incorporating an additional random slope term and association parameter θ 2 as described in "'Joint model" section did not improve model fit. This implies that the subject specific random intercepts are sufficient to describe the longitudinal aspect in these data. For comparisons purpose, separate joint model of "Joint model" section were fitted. The results are summarized in Table 3. When compared with their respective multivariate joint model estimates of Table 2, we notice slight differences in the association parameters estimates corresponding to TG and hsCRP as well as the intercept terms of the survival component. The rest of the parameter estimates appear to be similar. Furthermore, we fitted an exponential model for time-to-CHD by including only baseline values of lipids separately while adjusting for the same covariates of the survival submodel considered so far. The estimated regression coefficients (standard error) corresponding to baseline TG, HDL, hsCRP and TC are −0.2354 (0.0876), 0.6206 (0.1834), −0.1674 (0.0421), and −0.5710 (0.2344), respectively. Comparing these results with the association parameter estimates of joint multivariate model in Table 2, we clearly observe a remarkable difference in modeling the longitudinal trajectories versus the baseline measurements of lipids. We will explore this further in "Simulation" section with simulation.

Simulation
In this section, we report on a simulation study set up to examine the performance of the pairwise bivariate fitting in joint modeling of multiple longitudinal measurements and a time-to-event data. The estimates from the pairwise bivariate fitting will be compared with the full likelihood multivariate joint model in terms of bias and relative efficiency.
The simulated data were analyzed by the multivariate joint model using pairwise bivariate fitting and full likelihood trivariate joint model. All the 1000 simulations converged in the pairwise fitting while 985 simulations do so in the full trivariate model, i.e., nearly 1.5% failed to converge in the latter. For comparison purposes, results of the 985 successful simulations of the full multivariate joint model and the same number of simulations from the pairwise fitting, based on simulation ID number, were used. The average estimated values from the pairwise bivariate fitting (Mean P ) and full trivariate joint model (Mean F ) are summarized in Table 4. All of the mean estimates from the pairwise fitting are close to the true values and nearly unbiased for the proposed approach. Furthermore, as shown in Table 5, the Monte Carlo standard errors from the pairwise fitting (MC.SE P ) are close to those from the full trivariate joint model (MC.SE F ) with their ratios range from 0.997 to 1.056, and this suggests that no considerable    efficiency loss of the pairwise bivariate approach relative to the full likelihood multivariate model. Furthermore, the simulated data were analyzed by Weibull model of "Model for time-to-event data" section to investigate the impact of omitting the longitudinal aspect as well as the correlation between the longitudinal responses. This was achieved by including only basline measurements of the longitudinal responses as covariates in the regression model for the time-o-event outcome. Three separate models were fitted as Weibull(α , μ i ), where μ i = ξ 0 + ξ 1 male i + θ y i , y i is baseline values for the longitudinal response y ij and the rest of the quantizes remain defined earlier. The results are summarized in Table 6. Clearly, the impact of taking just the baseline values of y i and ignoring the longitudinal trajectories is remarkable. This can be clearly observed from the considerable bias of the regression coefficients including the association parameter θ.

Discussion
The method has been used to jointly analyze four multivariate longitudinal responses and a time-to-event outcome from a cardiovascular study on African Americans. The data analysis showed inverse relationship between HDL cholesterol and incidence of coronary heart disease, while elevated triglyceride level and high-sensitivity Creactive protein are associated with increased risk for CHD. Duncan et al. [16] who studied association between longitudinal lipid trajectories and atherosclerotic cardiovascular disease (ASCVD) including CHD reported a similar finding on the association of HDL cholesterol and ASCVD. This is inline with Wilikins et al. [17] and Skrettebegr et al. [18]. While Duncan's investigation did not find a significant association between triglyceride and ASCVD, we observe an association of triglyceride with CHD which is consistent with Sarwar et al. [19]. This is likely because in Ducan's study ASCVDE includes CHD, stroke, or transient ischemic attack, and intermittent claudiction while the present study and Sarwar's investigation focus on CHD. Jeong et al. [20] reported a similar finding on the association of triglyceride and CHD but reported a significant association between total cholesterol and CHD which is not the case for total cholesterol in the current study. This could be attributed to the fact that Jeong et al. [20] examined young adults (aged 20-39 years) while JHS participants are relatively older.
Our simulation study suggest that bivariate fitting of such multivariate joint model yields unbiased estimates which are as nearly efficient as the full likelihood model. These results are inline with the performance of the  [13] and of semi-continuous longitudinal data given in Kassahun-Yimer et al. [21]. Pairwise bivariate fitting is approximately unbiased in missing completely at random (MCAR) and missing at random (MAR) when outcomespecific parameters are considered, but these results may not necessarily be generalized to all missing data problems [21]. Details on missing data mechanisms can be obtained in Little and Rubin [22]. Alison [23] presented a survey of methods to handle missing data and indicated that maximum likelihood based inference for handling missing data have nearly optimal statistical properties under MAR. Ding and Wang [10], however, showed that in joint modeling of univariate longitudinal response and time-toevent modeling, informative dropout on the longitudinal component could result in serious bias. Multiple imputation is also an attractive approach in handling missing data in some applications despite the several questions posed in relation to the distributional choices, number of imputations and iterations and prior knowledge about the missingness pattern [22,23]. Fieuws and Verbeke [13] pointed out efficiency loss can be present when some of the parameters are shared by a set of outcomes. An additional simulation conducted to investigate the impact of taking just the baseline values of longitudinal response and including as covariates in survival regression indicated that parameter estimates could be severely biased. This is inline with earlier simulation studies. For example, Henderson et al. [7] and Ibrahim et al. [9] pointed out that severe bias could occur for some parameter estimates when there is ignored latent association between the time-to-event and longitudinal data. Apart from several advantages, our proposed approach has also limitations. First, right censoring is assumed for the time-to-event outcome, though other mechanisms of censoring could also be operating in practice. Second, we assumed the link between the longitudinal and the survival submodels is constant over time. Third, earlier simulations suggest that pairwise bivariate fitting is approximately unbiased in MCAR and MAR but this may not be the case for all longitudinal data with missingness. Hence, exploring the proposed approach in various missing data mechanisms and patters together with appropriate model diagnostic tools is a useful area of future work.

Conclusion
In this paper, we have presented and studied a multivariate joint model for multiple longitudinal Gaussian responses and a time-to-event outcome which follows from combining each single joint model of a longitudinal continuous response and the time-to-event outcome via random effects whereby a linear mixed submodel is used for the longitudinal continuous part while a time-to-event submodel is considered for the survival part. The model was illustrated using the Jackson Heart Study. The association between lipid levels and CHD risk obtained from the joint model that takes into account the longitudinal lipid trajectories is substantially higher than that estimated using the baseline lipid values only. The Weibull model (exponential model as its special case) is most common for time-to-event data due to its flexibility, but alternative distributions, such as, the log-normal, log-logistic and gamma could provide better fit to a particular dataset. The choice of one distribution from the other could be based on exploratory graphical tools or formal statistical tests. The pairwise bivariate model fitting approach employed for the Weibull model can be well extended to any other probability density function in the survival submodel.