 Research article
 Open Access
 Published:
Fitting a shared frailty illnessdeath model to lefttruncated semicompeting risks data to examine the impact of education level on incident dementia
BMC Medical Research Methodology volume 21, Article number: 18 (2021)
Abstract
Background
Semicompeting risks arise when interest lies in the timetoevent for some nonterminal event, the observation of which is subject to some terminal event. One approach to assessing the impact of covariates on semicompeting risks data is through the illnessdeath model with shared frailty, where hazard regression models are used to model the effect of covariates on the endpoints. The shared frailty term, which can be viewed as an individualspecific random effect, acknowledges dependence between the events that is not accounted for by covariates. Although methods exist for fitting such a model to rightcensored semicompeting risks data, there is currently a gap in the literature for fitting such models when a flexible baseline hazard specification is desired and the data are lefttruncated, for example when time is on the age scale. We provide a modeling framework and openly available code for implementation.
Methods
We specified the model and the likelihood function that accounts for lefttruncated data, and provided an approach to estimation and inference via maximum likelihood. Our model was fully parametric, specifying baseline hazards via Weibull or Bsplines. Using simulated data we examined the operating characteristics of the implementation in terms of bias and coverage. We applied our methods to a dataset of 33,117 Kaiser Permanente Northern California members aged 65 or older examining the relationship between educational level (categorized as: high school or less; trade school, some college or college graduate; postgraduate) and incident dementia and death.
Results
A simulation study showed that our implementation provided regression parameter estimates with negligible bias and good coverage. In our data application, we found higher levels of education are associated with a lower risk of incident dementia, after adjusting for sex and race/ethnicity.
Conclusions
As illustrated by our analysis of Kaiser data, our proposed modeling framework allows the analyst to assess the impact of covariates on semicompeting risks data, such as incident dementia and death, while accounting for dependence between the outcomes when data are lefttruncated, as is common in studies of aging and dementia.
Background
Semicompeting risks refers to the setting where interest lies in the timetoevent for some socalled nonterminal event, the observation of which is subject to some terminal event [1]. In contrast to standard competing risks, where each of the outcomes under consideration is typically terminal (e.g. death due to some cause or another), in the semicompeting risks setting it is possible to observe both events on the same study unit, so that there is at least partial information on their joint distribution [1, 2]. Take as an example the study of dementia among the elderly [3], a complex neurocognitive condition that is estimated to affect nearly 6 million individuals aged 65 and older in the US [4], a number that has been projected increase to 13.9 million by 2060 [5]. It is known that the risk of death is higher among those who are diagnosed with dementia [6]. As such, studies seeking to investigate risk factors for dementia must contend with death as a competing risk, which precludes the subsequent observation of dementia. However, it is possible to observe both outcomes among individuals who die following a diagnosis of dementia. This information can potentially increase efficiency of results and be used to assess the dependence between the nonterminal and terminal events.
Towards the analysis of semicompeting risks data, the statistical literature has focused on three broad frameworks that seek to exploit the joint information on the time to the nonterminal event and the time to the terminal event [7]: those based on copulas [1, 8–10]; those framed from the perspective of causal inference [11, 12]; and, those based on the illnessdeath multistate model [2, 13–16]. In this paper, we focus on the last of these approaches, for which the philosophical underpinning is that patients begin in some initial state at time zero and may transition into the nonterminal and/or terminal state [14, 16–18]. Analyses typically proceed through the development of models for transitionspecific hazard functions (which dictate the rate at which patients experience the events), often with the use of subjectspecific frailties, which can be viewed as individualspecific random effects that acknowledge the heterogeneity across individuals that is not accounted for by covariates [19, 20]. Moreover, the shared frailty accounts for dependence between the nonterminal and terminal events, which can be quantified from an estimable frailty variance parameter.
In the analysis of timetoevent outcomes, data are subject to lefttruncation or delayed entry when subjects are enrolled into a study after the time origin of interest. Lefttruncation is common in the study of aging and dementia, where the age scale is commonly taken to be the time scale [21–23]. In this setting, sampling is biased toward longer followup times since patients are typically only included in the study if they are dementiafree at study entry. The analysis of lefttruncated timetoevent data should apply statistical methods that account for this bias. Although current methods exist for analyzing lefttruncated semicompeting risks data via a standard illnessdeath model (without a shared frailty) [24, 25] and have been applied to Alzheimer’s disease [26–29], to our knowledge, there are no published methods in the literature for fitting an illnessdeath model with shared frailty to lefttruncated semicompeting risks data. The purpose of this paper is to fill this gap by providing a modeling framework and openly available code for implementation.
In this paper, we provide methods for fitting an illnessdeath model with shared frailty to lefttruncated semicompeting risks data. In “Methods” section, we present the model specification (“Model specification: Illnessdeath model” section), methods for estimation and inference (“Estimation and inference” section), a brief simulation study (“Simulation study” section), and an analysis of data from Kaiser Permanente Northern California examining the relationship between educational level and incident dementia and death (“Assessing the impact of education level on incident dementia in a large US cohort” section). We present results in “Results” section. We conclude with a discussion in “Discussion” section and conclusions in “Conclusions” section.
Methods
Model specification: Illnessdeath model
Semicompeting risks refers to the setting where interest lies in a nonterminal event that is subject to dependent censoring by a terminal event [1]. This paper focuses on modeling semicompeting risks data using the illnessdeath multistate model, where in the study of dementia we take the nonterminal and terminal events to be dementia diagnosis and death. Figure 1 illustrates the possible individual trajectories following time of study accrual. Let T_{1} and T_{2} denote the nonterminal and terminal event times, respectively, following a welldefined time origin. Following the multistate modeling literature, the illnessdeath model is characterized by the transition hazards:
Although a variety of hazard regression models can be adopted, including Coxtype multiplicative hazard regression models [30], additive models [31] and accelerated failure time models [32], this paper assumes the first of these which is prevalent in the multistate modeling literature:
where λ_{k,0}(·) are baseline hazard functions and β_{k} are loghazard ratio regression parameters, for k=1,2,3. We allow for two common specifications of λ_{3,0}(t_{2}T_{1}=t_{1}), a Markov model defined by λ_{3,0}(t_{2}T_{1}=t_{1})=λ_{3,0}(t_{2}) or a semiMarkov (‘clockreset’) model defined by λ_{3,0}(t_{2}T_{1}=t_{1})=λ_{3,0}(t_{2}−t_{1}). In each of expressions (4)(6), γ is a common subjectspecific frailty with mean 1.0 and variance θ, which serves two related purposes. First, similar to a random intercept in a generalized linear mixed model [33], the frailties serve to accommodate betweensubject heterogeneity that is not accounted for by the covariates included in the linear predictors. Second, the frailty induces dependence between the nonterminal and terminal events since a patient with frailty larger than one will be at higher risk of both the nonterminal and terminal events than the population average (conditional on covariates).
To complete the specification of the model, the baseline hazard functions and frailty distribution must be specified. For the latter, we adopt a Gamma distribution, which is commonly used because closedform expressions for the marginal likelihood contributions can be obtained. It is important to note that although the shared frailty term can only account for or represent positive dependence between the hazard functions [20], it is possible to have negative correlation between the hazard functions corresponding to T_{1} and T_{2}, e.g. if λ_{1,0}(t) is monotonically increasing and λ_{2,0}(t) is monotonically decreasing.
The baseline hazard functions can be specified parametrically [34], semiparametrically [15, 26], or nonparametrically [2]. In this paper, we consider Weibull baseline hazards of the form \(\phantom {\dot {i}\!}\lambda _{k,0}(t)=\alpha _{k} \kappa _{k} t^{\alpha _{k}  1}\), for k=1,2,3, which are commonly used in survival analysis, and flexible Bspline baseline hazard functions that satisfy logλ(t)=B(t), where B(t) is a polynomial Bspline function of degree d with unique knots at \(t_{0}< t_{1}< \dots < t_{K}< t_{K+1}\) and defined for t∈[t_{0},t_{K+1}]. For continuous timetoevent outcomes with support on the positive real line, we let t_{0}=0 and t_{K+1} to be the largest followup time. Note that the Bspline function B(t) is parametrically defined as a linear combination of Bspline basis functions B_{b,d}(t) of degree d,
where η_{b} are parameters, known as control points, and the Bspline basis functions are defined for t∈[t_{0},t_{K+1}] [35].
Estimation and inference
Observed data
Let T_{i1} and T_{i2} denote the nonterminal and terminal event times and X_{i} a vector of patientspecific covariates that includes X_{i1},X_{i2} and X_{i3} for individual i used in (4)(6). We assume that the observed data are subject to rightcensoring and lefttruncation. Let C_{i} and L_{i} denote the rightcensoring and lefttruncation times which we assume are independent of T_{i1} and T_{i2} conditional on X_{i}. Note that we assume the convention of Xu and colleagues [2], by setting T_{1}=∞ for individuals who experience the terminal event in the absence of the nonterminal event. In the analysis of dementia on the age scale, where the nonterminal and terminal events are dementia diagnosis and death and the lefttruncation time is study entry, prevalent dementia cases are typically excluded from the analysis since the primary outcome of interest is dementia [27], i.e. we require that T_{i1}>L_{i}. The observed data for the i^{th} individual is D_{i}={L_{i},Y_{i1},δ_{i1},Y_{i2},δ_{i2},X_{i}}, where Y_{i1}= min(T_{i1},T_{i2},C_{i}) with nonterminal event indicator δ_{i1}=I{T_{i1}≤ min(T_{i2},C_{i})},Y_{i2}= min(T_{i2},C_{i}) with terminal event indicator δ_{i2}=I{T_{i2}≤C_{i}}, and L_{i}<Y_{i1}.
Likelihood
Towards developing a form of the likelihood, we first present the joint density, (T_{1},T_{2}), in the absence of lefttruncation below [15, 25]:
where \(S_{k}(tX)=\exp \left \{\int _{0}^{t} \lambda _{k}(uX)~du\right \}\), for k=1,2, and either \(S_{3}(t_{2}t_{1},X)=\exp \left \{\int _{t_{1}}^{t_{2}} \lambda _{3}(ut_{1},X)~du\right \}\) if a Markov model is assumed for λ_{3}(·) or \(S_{3}(t_{2}t_{1},X)=\exp \left \{\int _{0}^{t_{2}t_{1}} \lambda _{3}(ut_{1},X)~du\right \}\) if a semiMarkov model is assumed.
Under the assumption of independent truncation, the joint density of lefttruncated semicompeting risks data, (L,T_{1},T_{2}), is:
which is a product of the conditional density of (T_{1},T_{2}) given L and the marginal density of L [24, 25, 36] in (8), where \(g_{T_{1},T_{2}}\) is given in (7), g_{L}(·) is the density function corresponding to L and \(S_{T_{1}}(tX)=P(T_{1}>tX)=S_{1}(tX)S_{2}(tX)\). When the distribution of lefttruncation times is unknown, the maximum likelihood estimate can be obtained from the conditional likelihood, ignoring the marginal likelihood of the lefttruncation times [24, 25, 36, 37].
Based on the observed data, D_{i}, there are four possible data scenarios and thus likelihood contributions corresponding to the distinct combinations of the nonterminal and terminal event indicators, δ_{1} and δ_{2}. Using (7) and the conditional likelihood expression (on the left) in (8), the observed data likelihood for model parameters ϕ=(ξ_{1},ξ_{2},ξ_{3},β_{1},β_{2},β_{3},θ), where ξ_{k} are the baseline hazard parameters and β_{k} are the vectors of regression coefficients for transition k=1,2,3 and θ is the Gamma frailty variance, is
where
and
Since the individualspecific frailty terms, γ_{i}, are not observed, we marginalize the likelihood with respect to γ_{i}. Detailed derivations of the marginal likelihood components in (9) are included in Section A of the Additional file 1.
Estimation and inference
We will use maximum likelihood for estimation and inference of model parameters using the marginal loglikelihood in Section A of the Additional file 1 [38]. Let \(\mathcal {U}(\phi) = \partial /\partial \phi \left (\log \mathcal {L}(\phi)\right)\) denote the score function. Using standard arguments, under certain regularity conditions and a correctly specified model, the maximum likelihood estimator of ϕ, denoted \(\widehat {\phi }\), is the solution to \(\mathcal {U}(\phi) = 0\) and is consistent for the true ϕ_{0} as n →∞. In addition, \(\sqrt {n}(\widehat {\phi } \phi _{0}) \longrightarrow _{d}\) MVN (0, Σ) as n → ∞, where Σ=I(ϕ_{0})^{−1} is the inverse of the expected information matrix:
Var[\(\widehat {\phi }\)] can be estimated via the inverse of the observed information matrix.
From our experience fitting the proposed model, optimization of the loglikelihood requires careful consideration of the numerical optimization algorithm used and the choice of starting values. For modeling fitting, we used a quasiNewton nonlinear numerical optimization algorithm [39] as implemented in the optim function in R [40] to maximize the loglikelihood. For Weibull baseline hazard models, starting values for the nonlinear optimization were generated by fitting the univariate Weibull regression models for each transition (4)(6). For Bspline parameterizations of the baseline hazard functions, we used the bSpline function in the splines2 package [41] in R [40] to generate Bspline basis functions and obtained starting values for hazard parameters as follows: fit univariate Cox models for each transition hazard model; smoothed the estimated cumulative baseline hazard functions using linear interpolation; obtained smoothed baseline hazards function via numerical differentiation followed by loess; and found the control points that minimized the distance between the smoothed loghazard functions and Bspline functions by least squares. The frailty variance θ was initialized to value 0.5.
Simulation study
We performed a set of simulations to investigate the operating characteristics (e.g., bias and coverage) of the model implementation for both Weibull and Bspline parameterizations of the baseline hazard functions.
Data were generated from the model (4)(6) assuming Weibull baseline hazard functions using simID from the SemiCompRisks package [42, 43] in R [40]. We set model parameters to those obtained from fitting the model to dementia data, where the nonterminal and terminal events were dementia diagnosis and death, the origin was age 65, the lefttruncation time was study entry, and a dichotomous variable was included as a covariate in all three regression models. The baseline hazard and frailty parameter values used in the simulations were: logα_{1}=1.05, logκ_{1}=−9.98, logα_{2}=1.15, logκ_{2}=−10.01, logα_{3}=0.92, logκ_{3}=−5.92, and logθ=−1.39. The dichotomous covariate was drawn from a Bernoulli distribution with probability 0.57 and the regression coefficients for the three transitions were β_{1}=−0.03,β_{2}=−0.33 and β_{3}=−0.11. Age at study entry was drawn from a uniform distribution with range between 65 and 78 years. We administratively censored at age 95 years.
Based on these set parameters, we simulated R=1,000 datasets of sample size n=5,000. For each simulated dataset, we fit both the model with Weibull and Bspline baseline hazard functions. For the fitted Weibull models, we reported the: mean of the parameter estimates; mean of the estimated analytical standard errors; standard deviation of the distribution of parameter estimates (empirical standard error); and the coverage probability (the proportion of Waldbased 95% confidence intervals that contained the true parameter). For the fitted Bsplinebased models, we reported the operating characteristics listed for the Weibullbased fitted models for the regression and frailty parameters only. To assess the Bspline fit of the transition baseline hazard functions, for each of the three transitions, we plotted the estimated baseline hazards from the fitted models, which we compared to the true baseline hazard. We also compared the estimated baseline hazard functions to kernel smoothed NelsonAalen baseline hazard estimates obtained from fitting separate Cox models to each transition using the coxphHaz function from the biostat3 package [44] in R [40]. Note that the univariate Cox models for the 1 and 2transitions accounted for lefttruncation and the Cox model on the 3transition was performed only among those who experienced the nonterminal event. To quantify the difference between the true baseline hazard function and those estimated from the Bspline model and the smoothed NelsonAalen, we calculated the median of the integrated squared error [26].
Assessing the impact of education level on incident dementia in a large US cohort
We applied the methodological approach presented in “Model specification: Illnessdeath model” and “Estimation and inference” sections to data from Kaiser Permanente Northern California (KPNC) where the goal of the analysis was to examine the association of education level on incident dementia. The initial analytic cohort consisted of 36,134 individuals who were at least 65 years old, dementiafree by 1997 and KPNC members as of January 1, 1996 (the earliest date when electronic medical records were reliably available defined as study entry) and participated in the Multiphasic Health Checkups which began in the 1960s and continued through 1996 during which information on participant educational attainment was captured. We categorized the exposure of educational attainment level based on the highest level of completion in the following groups: elementary or high school; trade degree, any college or college graduate; and postgraduate study. We focused the analyses on the four most prevalent race/ethnic subgroups, which were White, Black, Asian and Latinx, resulting in a final analytic dataset comprised of 33,117 KPNC members.
We incorporated the strong force of mortality (of all causes) of individuals aged 65 or greater in the analysis via the illnessdeath model in (1)(3), where the nonterminal event of interest was incident dementia diagnosis and the terminal event was death. We assumed the semiMarkov or ‘clockreset’ approach for the 3transition model (dementia → death). Followup was administratively censored at the earliest of age 90 or end of study defined by September 30, 2017. In our primary analysis, we used a shared frailty to account for the dependence between incident dementia and death that is not accounted for by covariates. We assumed that time was on the age scale, starting at age 65, and thus lefttruncated since members were not followed before age 65. For this time origin, lefttruncation was accounted for using the approach in “Estimation and inference” section, assuming Bspline parameterizations of the transition baseline hazard functions. When Bspline baseline hazards are assumed, model fitting requires the specification of the internal knots (number and placement) and polynomial degree of the Bspline function. We fit a suite of models that varied the number of internal knots (we considered 1, 2 or 3 knots), which were placed at equally spaced percentiles of the observed event times, and varied the Bspline polynomial degree (linear, quadratic, cubic). The final model was chosen based on the largest loglikelihood. Time was scaled by a factor of five so that the interpretation of hazard ratios are in terms of 5year increments. We controlled for sex (reference group: male) and race/ethnicity (reference group: White) in all hazard transition models via covariate adjustment.
In sensitivity analyses, we fit models: 1) without a shared frailty; 2) that defined study entry as the time origin (which we refer to as timeonstudy models); and 3) assumed Weibull baseline hazard functions. When the origin was taken to be study entry, for the 1 and 2transition hazard models we adjusted for age at study entry (centered at age 65); for the 3−transition hazard model, we adjusted for age at dementia (centered at the cohort mean which was 83 years). These models were fit using the SemiCompRisks package in R [42].
Note that this was an electronic medical record (EMR)based study. KPNC is an integrated health care delivery system of approximately 4.3 million members that adopted an EMR system as early as 1993. All healthcare utilization, including care visits (inpatient, outpatient (ambulatory), emergency department, telephone and video visits), pharmacy fills, etc., are captured in the EMR. Dementia diagnoses and death dates were pulled from the EMR. Dementia diagnoses were based on International Classification of Diseases (ICD) codes that were made during a member visit. Deaths were obtained from the KPNC mortality linkage file which includes data from multiple sources (internal reporting, California state death records, Social Security Administration). Followup was not prescribed within the cohort and could vary by individual. However, health system utilization is very high in this population with a median of 97 visits with a physician during the study period (minimum: 0.003 years, median: 15.4 years, maximum: 21.2 years). Details regarding the dataset, including outcome definition, exposure variable and covariates, can be found in the Additional file 1: Section C.1.
Results
Simulation study
The operating characteristics of the regression estimates and the frailty parameter are displayed in Table 1. Both modeling approaches resulted in regression estimates with little bias and good coverage. For the Weibullbased model, the estimates of the frailty variance parameter were slightly underestimated with conservative coverage. For the Bspline based model, the mean of the frailty variance parameter was upward downward. The simulation results for the baseline hazard parameters corresponding to the fitted Weibullbased models are displayed in Additional file 1: Table B.1 and exhibit negligible bias and good coverage. In Fig. 2, we display the true baseline hazard functions in red and the estimated baseline hazard functions from the fitted Bsplinebased models in the left column and the corresponding kernel smooth NelsonAalen estimates from separate univariate Cox models in the right column. Deviations in the estimated baseline hazard functions for both approaches are likely due to data sparsity in the latter half of the followup time. The median integrated squared error comparing each of the true baseline hazard functions to the estimated baseline hazard functions over the range of observed event times are displayed in Fig. 2 and indicate that the Bsplinebased model provides estimates of the baseline hazard functions that are close to the truth and are comparable, if not slightly better, than the kernel smooth NelsonAalen estimates corresponding to separate Cox models.
Assessing the impact of education level on incident dementia in a large US cohort
A description of the semicompeting risks data and demographic variables used in the applied analysis are presented in Table 2. Of the 33,117 members in our analytic dataset, elementary or high school was the highest level of education in 43% of members, followed by trade school, some college or a college degree in 40%, and postgraduate study in 16.7%. More than half (56.6%) were female and 72.0% were of White race. By the end of the study, nearly half (47.1%) of members died without a diagnosis of dementia, with 19.1% censored before dementia or death, 5.0% alive with dementia, and 28.9% died carrying a diagnosis of dementia. In examining the outcome prevalences among the three education level groups, there were more dementia diagnoses and allcause deaths among members in the elementary and high school group.
The regression estimates from the fitted models are presented in Table 3 and Table C.1 of the Additional file 1. Across all models and time origins, increasing education level was associated with a decreased risk of dementia. In the primary analysis (on the age scale with the inclusion of a shared frailty presented in the first two columns of Table 3), the estimated hazard ratio (HR) and 95% confidence interval (CI) was 0.87 (0.83, 0.92) comparing those with a trade degree, some college or a college degree to those who had an elementary or high school education. This protective effect was amplified when comparing those with postgraduate study to those who had an elementary or high school education (HR: 0.76, 95% CI: 0.71, 0.81). There was some variation in the estimated HR across different time origins and models. The hazard ratio comparing those with postgraduate study to those who had an elementary or high school education for the timeonstudy model with shared frailty (third column of Table 3) was closer to the null with value 0.80 and to a greater degree in the timeonstudy model without shared frailty (last two columns of Table 3) with an estimated HR of 0.83.
Similarly, across all models and time origins, an education level of some college or more compared with a trade degree or less was associated with a decreased risk of death, with estimated HR and 95% CI in the primary model of 0.84 (0.81, 0.88) comparing those with a trade degree, some college or a college degree to those who had an elementary or high school education and 0.68 (0.65, 0.72) comparing those with postgraduate education to those who had an elementary or high school education. In our primary model, increased education level was associated with a reduced risk of death following dementia diagnosis. However, the timeonstudy models (columns 3, 4, 7 and 8 of Table 3) provide no evidence of this association with point estimates that suggest an opposite effect.
In the models with a shared frailty and Bspline baseline hazard functions (first four columns of Table 3), the estimated frailty variance parameter, θ was 0.41 (0.37, 0.45) when age was the time scale and 0.25 (0.20, 0.31) for the timeonstudy analysis, suggesting a greater degree of dependence between timetodementia and allcause death when age is the time scale. When Weibull baseline hazards were assumed (presented in Table C.1 of the Additional file 1), the estimated frailty variance parameter, θ was comparable with value 0.36 (0.32, 0.40) when age was the time scale. However, the estimated frailty variance was null for the timeonstudy analysis. This finding is difficult to explain, but an examination of the estimated baseline hazard functions from the frailty models indicate differences in the underlying baseline risk assumed in the various models. Figure 3 presents the estimated baseline hazard functions under Bspline or Weibull parameterizations and a shared frailty term for two time origins: 1) age 65 (agescale, adjusted for lefttruncation), and 2) study entry (timeonstudy). After fitting a suite of models that varied the number of internal knots and the Bspline polynomial degree, the specification with the largest loglikelihood was defined by one internal knot with linear Bsplines for the 1 and 2transitions and two internals knots with cubic Bsplines for the 3transition. We observe that the baseline hazard for the 1 and 2transitions were similar for the Bspline and Weibull specifications and increasing over time, more so when time was on the age scale (left panel) than timeonstudy (right panel). For the 3transition (death following dementia), the estimated baseline hazard functions were quite different for the Bspline and Weibull parameterizations, as the Weibull can only accommodate baseline hazards of the form of a power function, while Bsplines can approximate a range of functional forms. For both Bspline and Weibull parameterizations, the estimated baseline hazard for the 3transition indicate that the risk of death is highest shortly after dementia diagnosis, which has been observed in a study of survival after dementia diagnosis in five racial/ethnic groups [45]. This is followed by either decreases then increases (based on the Bspline parameterization) or decreases over time (based on the Weibull parameterization).
Discussion
For the analysis of lefttruncated semicompeting risks data, we have provided methods and software for fitting an illnessdeath model with shared frailty assuming Weibull or Bspline baseline hazards. These methods were used to estimate the association of education level and dementia accounting for the competing risk of death in a cohort of 33,117 Kaiser Permanente Northern California members. We found a doseresponse relationship between educational attainment and incident dementia, with a decreased risk of dementia associated with increasing levels of education, after adjusting for sex and race/ethnicity. The impact of education level and incident dementia is still not well understood, with published studies reporting both protective and null effects [46, 47]. Our study supports that higher education is associated with a lower risk of dementia in a large US cohort.
Note that the conclusions drawn for the outcome of interest (dementia) from the illnessdeath model with shared frailty aligned with those from alternative models that we considered, which omitted the shared frailty or used study entry as the time origin (as shown in Table 3). We expect that there will be variation in the estimates from these models depending on the data at hand, as was observed in the estimates of education level on the risk of death following dementia diagnosis in our study.
In our examination of model fitting operating characteristics via simulated data, we found that the regression parameters were estimated with negligible bias and good coverage. However, on average the frailty variance parameter was slightly underestimated for the Weibull baseline hazard parameterization and overestimated for the Bspline baseline hazard parameterization. For the case of the Weibull baseline hazard parameterization, the coverage was conservative for a sample size 5,000, but was closer to 95% when the sample size was increased 10,000 (see Table B.2 in the Additional file 1). For the case of the Bspline baseline hazard parameterization, the coverage was lower than 95% due to the bias in the frailty variance parameter. It is important to note that primary interest in the methods presented in this paper are the regression parameters. The frailty, and corresponding frailty variance parameter, allow us to further account for the dependence between the nonterminal and terminal events beyond covariate adjustment, analogous to a random effect in a random effects model. Similar to a random effects model, the primary interest lies in the mean outcome model and regression estimates; the variance parameter of the normally distributed random effects are typically of secondary interest.
In the analysis of dementia diagnoses such as those presented in this paper, prevalent cases at study entry are typically excluded. However, the likelihood in “Likelihood” section can be easily updated (see Additional file 1: Section D) to include prevalent nonterminal cases. Note that in the literature, there are two approaches for handling prevalent nonterminal cases in the analysis of lefttruncated semicompeting risks data. The approach we take conditions on the history up to the left truncation time as in [25, 37] so that prevalent nonterminal cases only contribute to the estimation of λ_{3}. Estimation is straightforward assuming an illnessdeath model with shared frailty since the frailty term, γ, can be easily integrated out. Alternatively, Saarela and colleagues [24] provided methods for estimation that conditions on the lefttruncation time only, so that prevalent cases contribute to the estimation of all transition hazards. This approach is more efficient as it uses more of the data, but is computationally more intensive as it involves numerical integration. This approach does not accommodate an illnessdeath model with shared frailty well since the integration of the shared frailty term is not straightforward.
In our approach to fitting an illnessdeath model with shared frailty to lefttruncated semicompeting risks data, we have considered fullyparametric specifications of the baseline hazard functions, Weibull and Bspline. Both functional forms are flexible and can approximate a wide range of baseline hazard functions. At the time of submission, a preprint by Gorfine et al. [48] proposed a semiparametric approach to the illnessdeath model with shared frailty using a pseudolikelihood approach to estimating the regression parameters and baseline hazard functions that accommodates lefttruncated semicompeting risks data. While the illnessdeath model with shared frailty in this paper was formulated using hazard models that are conditional on the frailty in (4)(6), Gorfine et al. [48] focused on marginal Cox hazard models. This is analogous to the conditional and marginal approaches to modeling mean outcomes in the presence of clustering via mixed models [49] and generalized estimating equations [50], respectively. Thus our approaches are complementary, filling a gap in the literature and allowing the analyst options for fitting an illnessdeath model with shared frailty to lefttruncated semicompeting risks data.
One of the reviewers pointed out that dementia diagnoses may be subject to intervalcensoring. To explore the possibility and/or extent of intervalcensoring in the cohort, we looked at the patterns of inpatient and outpatient visits (during which dementia might be assessed) among two groups of members: those who were diagnosed with dementia during the study, and those who died without a dementia diagnosis. The concern is that long gaps between visits would lead to imprecise dementia diagnosis dates in the former and missed opportunities for dementia diagnoses in the latter. We found that among those who were diagnosed with dementia during the study, 81% had a visit with a physician within 60 days prior to the diagnosis date in the EMR. For those who died without a dementia diagnosis, 90% had a visit within 60 days prior to death. Plots of individuallevel visit patterns over the study period (see Supplementary File Figures C1 and C2) among members of these two groups illustrate that utilization is high in this cohort.
Based on these data, we believe that intervalcensoring is not of major concern in our data, as it is in prospective studies of Alzheimer’s disease or dementia, such as PAQUID [51, 52] and the Adult Changes in Thought Study [53, 54], where dementia screening can be years apart. While analyses of data from those prospective studies are indeed complicated by intervalcensoring, they were designed for the purpose of understanding incident dementia with identification of dementia cases based on a battery of neuropsychological testing and confirmation by a neurologist. An EMRbased analysis of a cohort of high care utilizers may avoid intervalcensoring, however, may capture dementia cases with less rigor. At KPNC, a similar set of EMR code used to identify dementia diagnoses was shown to have a sensitivity of 77% and a specificity of 95% compared with a consensus dementia diagnosis utilizing a neuropsychiatric battery, structured interviews, physical examination, and medical records review. If intervalcensoring were evident in our data, modeling should be updated to account for intervalcensoring. This can be done by updating the likelihood function, as in Touraine, et al. 2017 [52]. In the setting of a shared frailty illnessdeath model presented in this paper, this is an avenue for future research.
It is important to mention that we provide methods for fitting a shared frailty illnessdeath model subject to lefttruncated data when the covariate of interest is fixed with time. As one reviewer aptly pointed out, timevarying covariates may also be of interest to an analyst. We believe that the modeling framework specified in this paper can incorporate timedependent covariates. However, deriving the likelihood function for a shared frailty model requires marginalization (integration) of the frailty term, which is complicated when timedependent covariates are used and, such, beyond the scope of this paper. We intend to explore the implementation of the proposed extension with timevarying covariates in future work.
Conclusions
As illustrated by our analysis of Kaiser data, our proposed modeling framework allows the analyst to assess the impact of covariates on semicompeting risks data, such as incident dementia and death, while accounting for dependence between the outcomes when data are lefttruncated, as is common in studies of aging and dementia. This approach has the potential to be applied to a wide range of settings beyond the field of aging.
Availability of data and materials
The data used in this study contain Protected Health Information of KPNC members and cannot be shared. Code for generating simulated data used in the paper and that were similar to those analyzed in the data application are provided as additional files.
Abbreviations
 KPNC:

Kaiser Permanente Northern California
 HR:

Hazard ratio
 CI:

Confidence interval
References
 1
Fine J, Jiang H, Chappell R. On semicompeting risks data. Biometrika. 2001; 88(4):907–19.
 2
Xu J, Kalbfleisch J, Tai B. Statistical analysis of illness–death processes and semicompeting risks data. Biometrics. 2010; 66(3):716–25.
 3
JacqminGadda H, Blanche P, Chary E, Loubère L, Amieva H, Dartigues JF. Prognostic score for predicting risk of dementia over 10 years while accounting for competing risk of death. Am J Epidemiol. 2014; 180(8):790–8.
 4
Alzheimer’s Association, et al.2018 alzheimer’s disease facts and figures. Alzheimers Dement. 2018; 14(3):367–429.
 5
Matthews KA, Xu W, Gaglioti AH, Holt JB, Croft JB, Mack D, McGuire LC. Racial and ethnic estimates of Alzheimer’s disease and related dementias in the United States (2015–2060) in adults aged ≥65 years. Alzheimers Dement. 2019; 15(1):17–24.
 6
Brodaty H, Seeher K, Gibson L. Dementia time to death: a systematic literature review on survival time and years of life lost in people with dementia. Int Psychogeriatr. 2012; 24(7):1034–45.
 7
Varadhan R, Xue QL, BandeenRoche K. Semicompeting risks in aging research: methods, issues and needs. Lifetime Data Anal. 2014; 20(4):538–62.
 8
Peng L, Fine J. Regression modeling of semicompeting risks data. Biometrics. 2007; 63(1):96–108.
 9
Hsieh JJ, Wang W, Ding A. Regression analysis based on semicompeting risks data. J R Stat Soc B. 2008; 70(1):3–20.
 10
Lakhal L, Rivest LP, Abdous B. Estimating survival and association in a semicompeting risks model. Biometrics. 2008; 64(1):180–8.
 11
Egleston B, Scharfstein D, Freeman E, West S. Causal inference for nonmortality outcomes in the presence of death. Biostatistics. 2007; 8(3):526–45.
 12
Tchetgen Tchetgen E. Identification and estimation of survivor average causal effects. Stat Med. 2014; 33(21):3601–28.
 13
Liu L, Wolfe R, Huang X. Shared frailty models for recurrent events and a terminal event. Biometrics. 2004; 60(3):747–56.
 14
Putter H, Fiocco M, Geskus R. Tutorial in biostatistics: competing risks and multistate models. Stat Med. 2007; 26(11):2389–430.
 15
Lee KH, Haneuse S, Schrag D, Dominici F. Bayesian semiparametric analysis of semicompeting risks data: investigating hospital readmission after a pancreatic cancer diagnosis. J R Stat Soc C. 2015; 64(2):253–73.
 16
Lee C, Lee SJ, Haneuse S. Timetoevent analysis when the event is defined on a finite time interval. Stat Methods Med Res. 2019; 29(6):1573–91.
 17
Klein J, Keiding N, Copelan E. Plotting summary predictions in multistate survival models: probabilities of relapse and death in remission for bone marrow transplantation patients. Stat Med. 1993; 12(24):2315–32.
 18
de Wreede L, Fiocco M, Putter H. mstate: An R Package for the Analysis of Competing Risks and MultiState Models. J Stat Softw. 2011. 38(7).
 19
Wienke A. Frailty Models in Survival Analysis. Boca Raton: CRC Press, Taylor & Francis Group; 2010.
 20
Putter H, van Houwelingen H. Frailties in multistate models: Are they identifiable? do we need them?. Stat Methods Med Res. 2015; 24(6):675–92.
 21
Korn EL, Graubard BI, Midthune D. Timetoevent analysis of longitudinal followup of a survey: choice of the timescale. Am J Epidemiol. 1997; 145(1):72–80.
 22
Liestol K, Andersen PK. Updating of covariates and choice of time origin in survival analysis: problems with vaguely defined disease states. Stat Med. 2002; 21(23):3701–14.
 23
Lamarca R, Alonso J, Gomez G, Munoz A. J Gerontol A Biol Sci Med Sci. 1998; 53(5):337–43.
 24
Saarela O, Kulathinal S, Karvanen J. Joint analysis of prevalence and incidence data using conditional likelihood. Biostatistics. 2009; 10(3):575–87.
 25
VakulenkoLagun B, Mandel M. Comparing estimation approaches for the illness–death model under left truncation and right censoring. Stat Med. 2016; 35(9):1533–48.
 26
Joly P, Commenges D, Letenneur L. A penalized likelihood approach for arbitrarily censored and truncated data: application to agespecific incidence of dementia. Biometrics. 1998; 54(1):185–94.
 27
Commenges D, Letenneur L, Joly P, Alioum A, Dartigues JF. Modelling agespecific risk: application to dementia. Stat Med. 1998; 17(17):1973–88.
 28
Joly P, Commenges D, Helmer C, Letenneur L. A penalized likelihood approach for an illness–death model with intervalcensored data: application to agespecific incidence of dementia. Biostatistics. 2002; 3(3):433–43.
 29
Commenges D, Joly P, Letenneur L, Dartigues JF. Incidence and mortality of alzheimer’s disease or dementia using an illnessdeath model. Stat Med. 2004; 23(2):199–210.
 30
Cox D. Regression models and life tables (with discussion). J R Stat Soc Ser B (Methodol). 1972; 34:187–220.
 31
Aalen O. A linear regression model for the analysis of life times. Stat Med. 1989; 8(8):907–25.
 32
Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat Med. 1992; 11(1415):1871–9.
 33
McCulloch CE, Neuhaus JM. Generalized linear mixed models. Encycl Biostat. 2005; 4:220–46.
 34
Conlon A, Taylor J, Sargent D. Multistate models for colon cancer recurrence and death with a cured fraction. Stat Med. 2014; 33(10):1750–66.
 35
De Boor C. A Practical Guide to Splines. Applied Statistics, vol. 27. New York: Springer; 1978.
 36
Wang MC, Jewell NP, Tsai WY. Asymptotic properties of the product limit estimate under random truncation. Ann Stat. 1986; 14(4):1597–605.
 37
Guo G. Eventhistory analysis for lefttruncated data. Sociol Methodol. 1993; 23:217–243.
 38
Ferguson T. A Course in Large Sample Theory. London: Chapman and Hall; 1996.
 39
Nocedal J, Wright S. Numerical Optimization. 2nd ed. New York: Springer; 2006.
 40
R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2017. https://www.Rproject.org/.
 41
Wenjie W, Jun Y. splines2: Regression Spline Functions and Classes Too. 2017. R package version 0.2.5. https://CRAN.Rproject.org/package=splines2. Accessed 15 June 2018.
 42
Lee KH, Lee C, Alvares D, Haneuse S. SemiCompRisks: Hierarchical Models for Parametric and SemiParametric Analyses of SemiCompeting Risks Data. 2019. R package version 3.3. https://CRAN.Rproject.org/package=SemiCompRisks. Accessed 30 Jan 2019.
 43
Alvares D, Haneuse S, Lee C, Lee KH. SemiCompRisks: An R Package for the Analysis of Independent and Clustercorrelated Semicompeting Risks Data. R Journal. 2019; 11(1):376–400. https://doi.org/10.32614/RJ2019038.
 44
Karlsson A, Clements M. Biostat3: Utility Functions, Datasets and Extended Examples for Survival Analysis. 2019. R package version 0.1.4. https://CRAN.Rproject.org/package=biostat3. Accessed 8 Nov 2019.
 45
Mayeda ER, Glymour MM, Quesenberry CP, Johnson JK, PérezStable EJ, Whitmer RA. Survival after dementia diagnosis in five racial/ethnic groups. Alzheimers Dement. 2017; 13(7):761–9.
 46
Sharp ES, Gatz M. The relationship between education and dementia an updated systematic review. Alzheimer Dis Assoc Disord. 2011; 25(4):289.
 47
Xu W, Tan L, Wang HF, Tan MS, Tan L, Li JQ, Zhao QF, Yu JT. Education and risk of dementia: doseresponse metaanalysis of prospective cohort studies. Mol Neurobiol. 2016; 53(5):3113–23.
 48
Gorfine M, Keret N, Arie AB, Zucker D, Hsu L. Marginalized FrailtyBased IllnessDeath Model: Application to the UKBiobank Survival Data. J Am Stat Assoc. 2020. https://doi.org/10.1080/01621459.2020.1831922.
 49
Laird NM, Ware JH. Randomeffects models for longitudinal data. Biometrics. 1982; 38(4):963–74.
 50
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986; 73(1):13–22.
 51
Dartigues J, Gagnon M, Michel P, Letenneur L, Commenges D, BarbergerGateau P, Auriacombe S, Rigal B, Bedry R, Alperovitch A. The paquid research program on the epidemiology of dementia. methods and initial results. Rev Neurol. 1991; 147(3):225–30.
 52
Touraine C, Helmer C, Joly P. Predictions in an illnessdeath model. Stat Methods Med Res. 2016; 25(4):1452–70.
 53
Kukull WA, Higdon R, Bowen JD, McCormick WC, Teri L, Schellenberg GD, van Belle G, Jolley L, Larson EB. Dementia and alzheimer disease incidence: a prospective cohort study. Arch Neurol. 2002; 59(11):1737–46.
 54
Lee KH, Rondeau V, Haneuse S. Accelerated failure time models for semicompeting risks data in the presence of complex censoring. Biometrics. 2017; 73(4):1401–12.
Acknowledgements
The authors would like to thank Dr. Tracy Lieu, the director of the Division of Research, for her support.
Funding
This work was funded in part by the following National Institutes of Health grants: R01 CA18136001 and R01AG066132. The funder had no role in the study design, methods development, analyses and writing of the manuscript.
Author information
Affiliations
Contributions
CL and SH developed the statistical theory and contributed to successful model implementation. PG curated the analytic dataset, directed the analysis, and interpreted the findings. All authors drafted, reviewed and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study was approved by the Kaiser Permanente Northern California Institutional Review Board. This approval allowed for access to electronic health records and administrative and legacy databases for research purposes.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
Supplementary findings. File name: Lee_Gilsanz_HaneuseAdditional file.pdf. Includes: A: Marginal components of the likelihood referenced in “Likelihood” section; B: Additional simulation results; C: Additional information from the analysis of Kaiser data including: detailed description of the data from Kaiser Permanente Northern California and additional results from the analysis of Kaiser data; D: Likelihood expression when observed data include prevalent nonterminal cases referenced in “Discussion” section.
Additional file
r code for implementation. File names: fitmodels.txt and functions.txt provide code for generating simulated semicompeting risks data and fitting the shared frailty illnessdeath model to lefttruncated semicompeting risks data described in the manuscript.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Lee, C., Gilsanz, P. & Haneuse, S. Fitting a shared frailty illnessdeath model to lefttruncated semicompeting risks data to examine the impact of education level on incident dementia. BMC Med Res Methodol 21, 18 (2021). https://doi.org/10.1186/s12874020012038
Received:
Accepted:
Published:
Keywords
 Illnessdeath
 Lefttruncation
 Semicompeting risks
 Multistate models
 Bsplines
 Dementia