A joint model of longitudinal and survival data consists of two component submodels: the longitudinal submodel and the survival submodel. We define a set of baseline covariates, *U*
_{
i
}, which can potentially differ between submodels. The longitudinal submodel allows us to model the trajectory of a repeatedly measured biomarker over time, adjusting for baseline covariates. The standard approach assumes a linear mixed effects model [14]. We observe

{Y}_{i}({t}_{\mathit{\text{ij}}})={W}_{i}({t}_{\mathit{\text{ij}}})+{\epsilon}_{\mathit{\text{ij}}},\phantom{\rule{2em}{0ex}}{\epsilon}_{\mathit{\text{ij}}}\sim \mathrm{N}(0,{\sigma}_{e}^{2})

(1)

with

{W}_{i}({t}_{\mathit{\text{ij}}})={X}_{i}^{\prime}({t}_{\mathit{\text{ij}}})\mathit{\beta}+{Z}_{i}^{\prime}({t}_{\mathit{\text{ij}}}){\mathit{b}}_{i}+{\mathit{u}}_{i}\mathit{\delta}

(2)

where *Y*
_{
i
}(*t*
_{
ij
}) is the observed longitudinal response for the *i*
^{th} patient measured at the *j*
^{th} time point. *W*
_{
i
}(*t*
_{
ij
}) is our true unobserved trajectory function consisting of design matrices {X}_{i}^{\prime}({t}_{\mathit{\text{ij}}}) and {Z}_{i}^{\prime}({t}_{\mathit{\text{ij}}}) for the fixed and random effects, *β* and *b*
_{
i
}, respectively, where *b*
_{
i
} ∼ MVN(0,**Σ**). We can incorporate flexibility here by allowing both {X}_{i}^{\prime}({t}_{\mathit{\text{ij}}}) and {Z}_{i}^{\prime}({t}_{\mathit{\text{ij}}}) to contain restricted cubic spline functions of measurement time [15]. We also have a vector of baseline covariates *u*
_{
i
} ∈ *U*
_{
i
}, and corresponding regression coefficients, *δ*. Finally, *ε*
_{
ij
} is our normally distributed measurement error with constant variance {\sigma}_{e}^{2}. We further assume that the random effects and error term are independent, and that cov (*ε*
_{
ij
},*ε*
_{
ik
}) = 0 (where *j* ≠ *k*).

The time-to-event submodel usually takes the form of a proportional hazards model

{h}_{i}(t)={h}_{0}(t)exp({\alpha}_{1}{W}_{i}(t)+\mathit{\varphi}{\mathit{v}}_{i})

(3)

with *h*
_{0}(*t*) the baseline hazard function and *v*
_{
i
} ∈ *U*
_{
i
} is a vector of baseline covariates with corresponding log hazard ratios, *ϕ*. The parameter *α*
_{1} is commonly named the association parameter, indicating the strength of association between the longitudinal biomarker and the time to event. If *α*
_{1} = 0, then the joint model reduces to the two separate models and fitting a joint model will not prove advantageous. This parameterisation assumes the hazard is dependent on the biomarker through its current value. This form of association is one of many ways to link the two component sub-models. The baseline hazard function, *h*
_{0}(*t*), can be modelled using a parametric distribution, most frequently the Weibull, or less restrictively using flexible parametric survival models [16], or of course can be left unspecified [17]. However, an unspecified baseline hazard function leads to underestimation of the standard errors of parameter estimates [18], and consequently bootstrapping is required to obtain appropriate standard errors.

For illustration, we let *W*
_{
i
}(*t*
_{
ij
}), the longitudinal submodel, be a linear function of time where the intercept and slope varies between subjects

{W}_{i}({t}_{\mathit{\text{ij}}})=({\beta}_{0}+{b}_{0i})+({\beta}_{1}+{b}_{1i}){t}_{\mathit{\text{ij}}}

(4)

giving a model with a random intercept and random linear slope. As an alternative way of linking the component models to that of Equation (3), we may link elements of the trajectory function, *W*
_{
i
}(*t*
_{
ij
}), to the hazard directly. For example, we can link the subject specific baseline biomarker values through the intercept association structure, where

{h}_{i}(t)={h}_{0}(t)exp\left[{\alpha}_{2}({\beta}_{0}+{b}_{0i})+\mathit{\varphi}{\mathit{v}}_{i}\right]

(5)

in this expression *α*
_{2} now estimates the strength of the association between the patient specific baseline biomarker values, as estimated by the longitudinal submodel, and the time-to-event. This way we can let the risk of event depend directly on the subject specific value of the biomarker at time *t* = 0.

If interest lies in prediction when a new patient is observed at baseline, the issue of measurement error can be accounted for through this approach. A benefit of this association structure also lies in the evaluation of the joint likelihood. Under most parametric survival submodels (e.g. Weibull distribution) and time-dependent association structures (eg. current value), numerical quadrature is required to integrate out not only the random effects, but under Equation (3), nested quadrature is also required to evaluate the cumulative hazard function. Under the time-independent association structure of Equation (5), we avoid this nested quadrature as the cumulative hazard function has an analytically tractable form, which provides computational benefits.

As discussed in the introduction, this model formulation can be an alternative to the standard approach of using the observed baseline biomarker value

{h}_{i}(t)={h}_{0i}(t)exp({\alpha}_{3}{Y}_{0i}+\mathit{\varphi}{\mathit{v}}_{i})

(6)

where *Y*
_{0i
} is the observed baseline biomarker value and *α*
_{3} is the log hazard ratio for a one unit increase in the observed baseline biomarker value. Although simple to fit, Equation (6) does not account for potential measurement error in *Y*
_{0i
}.

### Simulation study

In order to assess the performance of the standard approach of including observed biomarker values, compared to the full joint model described above, we evaluated both through simulation [19]. For ease of exposition we assume a longitudinal model with random intercept and slope, assuming a continuous biomarker of interest with

{W}_{i}({t}_{\mathit{\text{ij}}})=({\beta}_{0}+{b}_{0i})+({\beta}_{1}+{b}_{1i}){t}_{\mathit{\text{ij}}}

where *β*
_{0} = *β*
_{1} = 0, and *b*
_{0i
} ∼ N(0,1), *b*
_{1i
} ∼ N(0,0.25^{2}) with correlation between (*b*
_{0i
},*b*
_{1i
}) of 0.25. Observed measurements are then generated from {Y}_{\mathit{\text{ij}}}\sim \mathrm{N}({W}_{i}({t}_{\mathit{\text{ij}}}),{\sigma}_{e}^{2}), where *t*
_{
ij
} is the time of the *j*
^{th} measurement for the *i*
^{th} patient. We vary *σ*
_{
e
} from {0.1,0.5,1}.

We assume a Weibull baseline hazard function with *λ* = 0.1 and *γ* = 1.5. A binary variable, *X*
_{1} to represent treatment group was generated from Bin (1,0.5), with an associated log hazard ratio of *ϕ*
_{1} = -0.5. A continuous covariate, *X*
_{2}, to represent age at baseline was generated from N(65,12) with an associated log hazard ratio of *ϕ*
_{2} = 0.01. We then generate survival times from a Weibull distribution where the hazard is defined as *h*(*t*) = *h*
_{0}(*t*) exp(*α*
_{2}
*β*
_{0i
} + *ϕ*
_{1}
*X*
_{1} + *ϕ*
_{2}
*X*
_{2}), with *α*
_{2} the association parameter, indicating the effect of a one unit increase in the value of the subject specific intercept on the risk of event. We vary *α*
_{2} = {-0.5,0.25,0.25,0.5}. Each simulation contained 300 patients with up to 5 annual measurements (including baseline), and administrative censoring at 5 years. This corresponds to an approximate 18.9% survival proportion at 5 years (calculated at the mean of covariate values, {X}_{1}=\frac{1}{2},{X}_{2}=65 and *β*
_{0i
} = 0). To each dataset we fit a Weibull proportional hazards model including the observed baseline measurement, and a Weibull-based joint model with the random intercept association structure. We adjust for age and treatment in the survival submodel. Each scenario is simulated 1000 times.

To illustrate the varying measurement error standard deviations used in the simulation scenarios, we show in Figure 1 observed longitudinal measurements from the same 100 patients with *σ*
_{
e
} = {0.1,0.5,1}, and when *α* = 0.25. Figure 1 illustrates that as the measurement error standard deviation increases, the variability in the observed biomarker values increases.

### The GPRD cohort

The General Practice Research Database (GPRD) Group has obtained ethical approval from a Multi-centre Research Ethics Committee (MREC) for all purely observational research using GPRD data; namely, studies which do not include patient involvement The core work of the GPRD is covered by MREC approval granted by the Trent Multi- Research Ethics Committee (REC reference number 05/MRE04/87) and this study was approved by the GPRD Independent Scientific Advisory Committee (ISAC) (Protocol number 09_094). This study is based in part on data from GPRD obtained under licence from the UK Medicines and Healthcare Product Regulatory Agency (MHRA). However, the interpretation and conclusions contained in this study are those of the authors alone.

The example cohort used to illustrate the methods in this paper consists of 4,850 obese patients diagnosed with type 2 diabetes mellitus. We have 107,347 measurements of SBP, with maximum follow-up of 22 years. There were 278 stroke events observed.

In Figure 2 we show the observed SBP measurements for 9 randomly selected patients, who had at least 10 measurements, illustrating some nonlinear trajectories. To accommodate such nonlinearities we can use restricted cubic splines in the linear mixed effects submodel. In particular, we specify the following longitudinal submodel

\begin{array}{ll}\phantom{\rule{6.5pt}{0ex}}{W}_{i}({t}_{\mathit{\text{ij}}})& =({\beta}_{0}+{b}_{0i})+{\beta}_{1}{\text{age}}_{i}+{\beta}_{2}{\text{sex}}_{i}+{\beta}_{3}{\text{BMI}}_{i}\\ \phantom{\rule{1em}{0ex}}+({\mathit{\beta}}_{F}{s}_{F}({t}_{\mathit{\text{ij}}};{\mathbf{k}}_{F})+{\mathbf{b}}_{R}{s}_{R}({t}_{\mathit{\text{ij}}};{\mathbf{k}}_{R}))\end{array}

(7)

Where *s*
_{
F
}(*t*
_{
ij
};**k**
_{
F
}) is the restricted cubic spline basis of measurement time with corresponding fixed effects, *β*
_{
F
}, with knot locations **k**
_{
F
}, and *s*
_{
R
}(*t*
_{
ij
};**k**
_{
R
}) is the restricted cubic spline basis of measurement time with corresponding random effects, **b**
_{
R
}, and knot locations **k**
_{
R
}.

Prelimenary modelling of the longitudinal data can be conducted to guide model selection, in particular, the degrees of freedom for the spline terms capturing the underlying longitudinal trajectory over time.

To allow flexibility in the survival submodel we use the flexible parametric survival model [16, 20], which models the baseline log cumulative hazard function using resticted cubic splines. We can once again undertake seperate analysis of just the survival data to inform model selection. In particular, we can use the AIC and BIC to guide the selection of the number of degrees of freedom to capture the baseline hazard function, following Rutherford et al. (2013) [21]. Our final joint model is then

\begin{array}{ll}\phantom{\rule{6.5pt}{0ex}}log\left[{H}_{i}(t)\right]& =log\left[{H}_{0}(t)\right]+{\varphi}_{1}\phantom{\rule{0.3em}{0ex}}{\text{age}}_{i}+{\varphi}_{2}\phantom{\rule{0.3em}{0ex}}{\text{sex}}_{i}+{\varphi}_{3}\phantom{\rule{0.3em}{0ex}}{\text{BMI}}_{i}\\ \phantom{\rule{1em}{0ex}}+{\alpha}_{2}({\beta}_{0}+{b}_{0i})\end{array}

(8)

where

log\left[{H}_{0}(t)\right]=s(log(t);\gamma ,{\mathbf{\text{k}}}_{S})

(9)

where the baseline log cumulative hazard function, log[*H*
_{0}(*t*)], is expanded into a restricted cubic spline function of log(*t*), *s*(log(*t*);*γ*,**k**
_{
S
}), with knot locations **k**
_{
S
} and coefficient vector, *γ*. This framework has recently been incorporated into a joint model [22]. In each submodel we adjust for the baseline effects of age, sex and BMI. We fit the joint model with the random intercept association structure shown in Equation (5). For comparison, we also apply the standard flexible parametric survival model, adjusting for observed baseline SBP, age, sex and BMI.