 Research article
 Open access
 Published:
Quasilinear Cox proportional hazards model with cross L_{1} penalty
BMC Medical Research Methodology volume 20, Article number: 182 (2020)
Abstract
Background
To accurately predict the response to treatment, we need a stable and effective risk score that can be calculated from patient characteristics. When we evaluate such risks from timetoevent data with rightcensoring, Cox’s proportional hazards model is the most popular for estimating the linear risk score. However, the intrinsic heterogeneity of patients may prevent us from obtaining a valid score. It is therefore insufficient to consider the regression problem with a single linear predictor.
Methods
we propose the model with a quasilinear predictor that combines several linear predictors. This provides a natural extension of Cox model that leads to a mixture hazards model. We investigate the property of the maximum likelihood estimator for the proposed model. Moreover, we propose two strategies for getting the interpretable estimates. The first is to restrict the model structure in advance, based on unsupervised learning or prior information, and the second is to obtain as parsimonious an expression as possible in the parameter estimation strategy with cross L_{1} penalty. The performance of the proposed method are evaluated by simulation and application studies.
Results
We showed that the maximum likelihood estimator has consistency and asymptotic normality, and the cross L_{1}regularized estimator has rootn consistency. Simulation studies show these properties empirically, and application studies show that the proposed model improves predictive ability relative to Cox model.
Conclusions
It is essential to capture the intrinsic heterogeneity of patients for getting more stable and effective risk score. The proposed hazard model can capture such heterogeneity and achieve better performance than the ordinary linear Cox proportional hazards model.
Background
Medical science has made dramatic progress in recent years and reached the stage of trying to develop treatment tailored to patients’ individual characteristics. In particular, it is becoming standard for doctors to prescribe selective therapeutic agents to cancer patients with specific oncogenes. Such personalized medicines are not only effective, but also economical and practical: if personalization fulfills its promise, patients no longer have to try expensive but ineffective treatments, or suffer from unnecessary side effects.
The idea of treatment individualization arose from the fact that patients often show different responses, in terms of both therapeutic and side effects, to the same specific treatments. Therefore, in order to realize individualized treatment, it is necessary to predict treatment risk accurately and carefully based on patients’ characteristics. Because such a prediction should be performed in an objective manner, we need some quantified measurement of risk. This is usually achieved by a risk score, estimated by a regression model derived from various types of datasets. Survival time datasets are among the most popular source of data in medical science because they focus on extension of timetoevent (in this case, an undesirable or bad event). Several types of events can be considered, including cancer prognosis or metastasis, myocardial infarction, and death. For timetoevent data with rightcensoring, it is standard to apply the relative risk model. The main feature of the model is the assumption that hazards are proportional; i.e., the hazard ratio of two different subjects depends only on their covariates. Despite requiring this somewhat strong assumption, this type of model is used in a broad range of applications. Although the relative risk model covers a wider range, the exponential relative risk model, known as Cox’s proportional hazards model, is most common for these applications. For simplicity, in the rest of this paper we assume that all covariates are timeindependent and that there are no event ties, but these assumptions can be relaxed. In the Cox model, it is assumed that the log hazard is decomposed into and timeindependent linear predictor of covariates vector x as log(h(tx)/h_{0}(t))=β^{⊤}x, where h(tx) is a hazard rate at time t, h_{0}(t)=h(t0) is called the baseline hazard function, and β is a coefficient vector.
However, the relationship between the hazard and covariates may not be common among subjects. For example, recent clinical studies focused on the fact that heterogeneity among such populations can result in different responses to the same treatment. Such complex relationships can no longer be described in a single hazard model. Therefore, we should consider a mixture hazard model to capture the different hazard patterns in the heterogeneous population. In fact, [1] and [2] introduced a general family of mixture hazard models to describe multimodal hazards, although these models were described in a limited situation under a parametric approach. Also, in the context of a cure model, a binary hazard mixture model was proposed in a semiparametric manner [3]. In this paper, we propose a natural extension of Cox’s proportional hazards model using a quasilinear predictor that leads to a proportional model with mixture hazards.
The rest of the article is organized as follows. In “Methods” section, we derive the mixture hazard model via the quasilinear predictor and two strategies are developed to obtain parsimonious expression: the restricted quasilinear model and the cross L_{1}penalty estimation. In “Results” section, we investigate the estimators’ asymptotic properties. Moreover, we present numerical simulations and applications to real data sets, respectively. The proofs for all propositions and theorems given as Appendix are available in Additional file 1.
Methods
Quasilinear Cox Model
Formulations
Let t be the survival time of a subject with baseline covariate vector x. Then we define the quasilinear Cox model as
Here, f_{Q} is a quasilinear predictor function defined by the logsum exp averages of K linear predictors [4] as
where π^{⊤}=(π_{1},⋯,π_{K}) is a vector of mixing proportion with \(\sum _{k=1}^{K} \pi _{k} = 1\), and \(\boldsymbol {\beta }^{\top }=(\boldsymbol {\beta }_{1}^{\top }, \cdots, \boldsymbol {\beta }_{K}^{\top })\) is the coefficient vector. The relationship between hazard and covariates differs among subpopulations; the parameter K relies on the total number of subpopulations that satisfy this condition. We find that the quasilinear Cox model can be understood as a mixture hazard model because from (1) and (2)
The underlying hazard model in (3) is described as \(h(t\boldsymbol {x},\boldsymbol {\pi }, \boldsymbol {\beta }) = \sum _{k=1}^{K} \pi _{k} h_{k}(t\boldsymbol {x}, \boldsymbol {\beta }_{k}),\) where \(h_{k} (t\boldsymbol {x},\boldsymbol {\beta }_{k}) = h_{0k}(t) \exp (\boldsymbol {\beta }_{k}^{\top } \boldsymbol {x})\) with the assumption that h_{0k}(t)=h_{0}(t) for any k. Thus the proposed model can be understood as the special case of the mixture of Cox’s proportional hazards models. The assumption about equality of the baseline hazard function may seem somewhat stronger, but this simplifies the model formulation and interpretability. Simulations and application studies in “Simulations” section and “Application” section show that model (3) has sufficient predictive ability. A more general model that removes the assumption of equality of the baseline hazard function is discussed in the “Discussion” section.
Partial likelihood and maximum likelihood estimator
Consider the data (x_{i},t_{i},δ_{i}) (i=1,2,⋯,n) from n subjects, where x_{i} is a pdimensional covariates vector, t_{i} is observed survival or censored time, and δ_{i} is an event indicator which takes a value of 1 if the sample experiences the event by t=t_{i} and 0 otherwise. We assume that t_{i} and δ_{i} are independent for all subjects. Let θ^{⊤}=(π^{⊤},β^{⊤}) and \(\boldsymbol {\theta }_{k}^{\top }=({\pi }_{k},\boldsymbol {\beta }_{k}^{\top })\) for any k. Then, the partial loglikelihood function of the parameter θ is written as
where R(t_{i})={l∈{1,⋯,n}t_{i}≤t_{ℓ}} and \(\eta _{i}(\boldsymbol {\theta }_{k})=\pi _{k} \exp {(\boldsymbol {\beta }_{k}^{\top } \boldsymbol {x}_{i})}\). The R(t_{i}) denotes the risk set at time t_{i}. The maximum partial likelihood estimator \(\hat {\boldsymbol {\theta }}\) of (4) has consistency and asymptotic normality, as shown in “Asymptotic properties” section. Because we cannot get the estimates analytically, as with the Cox’s proportional hazards model, we need some numerical optimization method. As an example of such a method, the outline of the MinorizationMaximization (MM) algorithm [5] is shown here. The convergence property of the algorithm is demonstrated in Appendix A.
First, the score function of the partial likelihood (4) consists of the following elements:
and
where
We remark that the function l(θ) is rather complicated relative to the one of Cox’s model, which typically contains summands in the logarithmic function. In fact, when K=1, (4) is reduced to the standard form of the partial loglikelihood function l(β_{1}) in which the gradient vector becomes
Compared with (7), repeated calculation of (5) and (6) is computationally hard. We therefore consider a simpler function as
which has only feasible terms of log hazard and log cumulative hazard functions. Thus, we observe that
and
which leads to \(\frac {\partial }{\partial \boldsymbol {\theta }}l(\boldsymbol {\theta })= \frac {\partial }{\partial \boldsymbol {\theta }} G(\boldsymbol {\theta },\boldsymbol {\theta }_{0})_{\boldsymbol {\theta }_{0}=\boldsymbol {\theta }}.\) Exploring these properties, we propose a learning algorithm \(\left \{\boldsymbol {\theta }^{(s)}=(\boldsymbol {\pi }^{(s)},\boldsymbol {\beta }^{(s)}) s \in {\mathcal {S}}\right \}\) for the maximum partial likelihood estimator of θ by sequential maximization of G(θ,θ_{0}) as θ^{(s+1)}=argmax_{θ}G(θ,θ^{(s)}) for all s in \({\mathcal {S}}= \{1, 2, \cdots, S\}\), where θ^{(1)} is an initial value and S denotes a stopping time. By definition, we obtain G(θ^{(s+1)},θ^{(s)})≥G(θ^{(s)},θ^{(s)}). It follows from (8) and (9) that the iteration step is given by \(\boldsymbol {\theta }^{(s+1)}=\left (\tilde {\pi }_{k}\left (\boldsymbol {\theta }^{(s)}\right),\tilde {\boldsymbol {\beta }}_{k}\left (\boldsymbol {\theta }^{(s)}\right)\right)\), where
where
We observe that estimating equation in (11) is a weighted variant of standard partial likelihood equation. Furthermore, we observe that the minus Hessian matrix of G(θ,θ_{0}) with respect to θ is positivedefinite, which guarantees that \(\boldsymbol {\theta }^{(s+1)}=\left (\tilde {\pi }_{k}\left (\boldsymbol {\theta }^{(s)}\right),\tilde {\boldsymbol {\beta }}_{k}\left (\boldsymbol {\theta }^{(s)}\right)\right)\), as discussed above, is the unique minimizer of G(θ,θ^{(s)}) in θ. We observe the basic property of the learning algorithm, \(\left \{\boldsymbol {\theta }^{(s)}=\left (\pi ^{(s)},\boldsymbol {\beta }^{(s)}\right)\big  s \in {\mathcal {S}}\right \}\), as follows.
Proposition 1
Let \(\left \{\boldsymbol {\theta }^{(s)}=\left (\pi ^{(s)},\boldsymbol {\beta }^{(s)}\right)\big  s \in {\mathcal {S}}\right \}\) be the fixedpoint algorithm defined by the iteration rules (10) and (11). Then, the partial loglikelihood function l(θ) increases on the sequence \(\left \{\boldsymbol {\theta }^{(s)}  s \in {\mathcal {S}}\right \}\) as l(θ^{(s+1)})≥l(θ^{(s)}) for any s=1,⋯,S−1.
The proof of Proposition 1 is given in Appendix B. The convergence of the algorithm {θ^{(s)}:s≥1} to the maximum partial likelihood estimator \(\hat {\boldsymbol {\theta }}\) is not directly connected to Proposition 1. We need to make some assumption about the model in order to guarantee convergence, similar to that of expectation–maximization (EM) algorithm [6] for the analytic conditions. For example, we assume that l(θ) is unimodal, with θ^{∗} being the only stationary point. We note that ∂G(θ,θ_{0})/∂θ is continuous for θ and θ_{0}. Thus, the sequence {θ^{(s)}} converges to the unique maximizer θ^{∗} [7]. In fact, the partial likelihood function l(θ) is expressed as a difference of two concave functions ψ_{1}(θ) and ψ_{2}(θ), where \(\psi _{1}(\boldsymbol {\theta }) = \sum _{i=1}^{n} \delta _{i} \log \sum _{k=1}^{K} \pi _{k} \exp \left (\boldsymbol {\beta }_{k}^{\top } \boldsymbol {x}_{i}\right)\) and \(\psi _{2}(\boldsymbol {\theta }) = \sum _{i=1}^{n} \delta _{i} \log \sum _{j\in R(t_{i})} \sum _{k=1}^{K} \pi _{k} \exp \left (\boldsymbol {\beta }_{k}^{\top } \boldsymbol {x}_{j}\right)\). Hence, the assumption for the unimodality is necessary for convergence.
Parsimonious Model
The quasilinear Cox model consists of a relatively large number of parameters. Moreover, each covariate has multiple roles in every linear predictor. These complexities compromise the stability of parameter estimation and the interpretability of the model overall. Accordingly, we need a more parsimonious expression as shown in Fig. 1. To obtain a more parsimonious model, we propose a variant of the proposed model and parameter estimation procedure: restricted quasilinear Cox model and cross L_{1}penalty method. The former idea relies on restricting the model structure in advance based on prior information. If there is prior knowledge that some factors strongly depends on the hazard of subpopulation and weakly on the hazards of other subpopulations, then we use the restricted quasilinear form to insert the knowledge into consideration. If this is not the case, then a penalty is needed to bring full model (3) closer to the parsimonious model (13). We achieve this by using cross L_{1} penalty introduced in “Quasilinear Cox model with cross L_{1} penalty” section.
Restricted quasilinear Cox model
The first strategy is to use the idea of disjoint sets of covariates, as proposed by [4]. In the strategy, we assume that we know the disjoint decomposition of x_{i} as x_{i(1)},⋯,x_{i(K)} with a fixed group size K, and that this is identical among individuals. We denote the size of x_{i(k)} as p_{k}, where \(\sum _{k=1}^{K} p_{k}=p\). We note that such decomposition is given by prior knowledge about the disjoint structure of x. The disjoint sets of covariates yield the restricted quasilinear predictor defined by
The mixture hazard model (1) is modified by replacing f_{Q} with \(f_{Q}^{Res}\) as
The maximum partial likelihood estimator is calculated by the fixedpoint algorithm proposed for the nonrestricted version, with easy modifications.
Quasilinear Cox model with cross L_{1} penalty
In the second strategy, we regularize the loglikelihood function by cross L_{1} penalty defined by
where λ_{c} is a regularization parameter and β_{kj} and \(\hat {\beta }_{kj}\) are the jth component of kth coefficient vector β_{k} and corresponding maximum partial likelihood estimator, respectively. When λ_{c} goes to infinity, the estimated parameter of the β_{k}’s would be crosssparse; if β_{ℓj}≠0, then β_{kj}=0 for any k≠ℓ, and the estimated model belongs to the class of restricted quasilinear models (13). We define the regularized loglikelihood function with cross L_{1} penalty as
where P_{2}(β)=nλβ^{⊤}β, known as the L_{2} penalty. We note that an additional regularization factor such as L_{1} penalty yields the elastic nettype regularization l^{pen∗}(θ)=l(θ)−νP_{1}(β)−(1−ν)P_{2}(β)−P_{c}(β). We refer to the maximizer \((\boldsymbol {\tilde {\pi }},\boldsymbol {\tilde {\beta }})\) of l^{pen}(θ) as CLASSO (Cross least absolute shrinkage and selection operator) estimator. The penalty (14) is a variant of adaptive L_{1} penalty originally introduced by [8]. The adaptive weights \(\hat {\beta }_{\ell j} \hat {\beta }_{m j}\) are needed to equip CLASSO estimator with rootn consistency (see Theorem 2 in “Asymptotic properties” section). We make the following proposition regarding the CLASSO estimator.
Proposition 2
Let \({\mathcal {R}}_{c}\) be a region in \(\mathbb {R}^{pK}\) defined as \({\mathcal {R}}_{c} = \left \{\boldsymbol {\beta } \in \mathbb {R}^{pK}  P_{c}(\boldsymbol {\beta })\leq c, P_{2}(\boldsymbol {\beta }) \leq c\right \}\). Then the region \({\mathcal {R}}_{c}\) is a convex set.
Due to the convexity of the sum of cross L_{1} and  L_{2} penalties, the CLASSO estimator also has consistency and asymptotic normality as shown in Theorem 2. Empirically, however, we do not need to regularize the partial loglikelihood by P_{2}(β) for stable estimation. Therefore, we consider only the CLASSO penalty in the empirical studies in the Simulations and Applications sections.
To get the CLASSO estimator, we use the full gradient algorithm [9] in the updating step for each β_{k} (9). For each sth iteration, we need to update β^{(s−1)} to get β^{(s)} by gradient algorithm. Let S_{s} be the stopping time in the sth iteration step. The initial value \(\boldsymbol {\beta }_{k}^{(s+1,0)}=\boldsymbol {\beta }_{k}^{(s,S_{s})}\) is repeatedly updated as
where
and
Here
for j=1,⋯,p, where sign(·) is a sign function defined by setting sign(z) equal to 1 for z>0, 0 for z=0 and −1 for z<0. In each step, t_{opt} provides the optimal solution of the gradient descent algorithm, and t_{edge} controls the direction of the gradient so as not to change the signs of parameters.
For all analysis in Simulations and Applications sections, the initial values of parameters were set to the equal probability weighting parameters π_{k}=1/K for k=1,2,⋯,K and the coefficient vectors of Cox’s proportional hazard models estimated from random Ksamples sets on the parameter estimation of the quasilinear Cox model. The tuning parameter λ_{c} is determined by any model selection criteria such as AIC or estimated test AUC from bootstrap estimates other than BIC. In this paper, we use Bayes Information Criteria (BIC) [10] because (i) it is one of the most popular information criteria, (ii) it is computationally easy to calculate compared with the estimator which relies on the bootstrap sampling and (iii) it has the consistency in model selection.
Results
Asymptotic properties
In this section, we provide an asymptotic property of the maximum likelihood estimator \(\boldsymbol {\hat {\theta }}\) and the coefficient part of CLASSO estimator \(\boldsymbol {\tilde {\beta }}\). In the following, our discussion is based on the stochastic process. For the ith individual, let \(\phantom {\dot {i}\!}N_{i}(t)=1_{\{t_{i}\leq t, \delta _{i}=1 \}}(t)\) be the rightcontinuous counting process, where each N_{i}(t) counts the number of observed events on (0,t], and let \(\phantom {\dot {i}\!}Y_{i}(t)=1_{\{t_{i} \geq t, c_{i} \geq t\}}(t)\) be the leftcontinuous atrisk process that shows the observation status at time t, where c_{i} and t_{i} are censoring and true survival times. Here 1_{E} is an indicator function defined by setting 1_{E}(t) equal to 1 for t∈E and equal to 0 for t∉E. Let us denote \({\mathcal F}_{t}=\sigma \left \{N_{i}(u),Y_{i}(u^{+});i=1,\cdots,n;0\leq u \leq t\right \}\) as the σalgebra generated by all N_{i}(u) and Y_{i}(u),0≤u≤t. Then, the corresponding intensity process of N_{i}(t) is defined by \(\Lambda _{i}(t)dt=P(dN_{i}(t)=1{\mathcal F}_{t_{}})\) and the proposed model is described as
where \(\boldsymbol {\theta }_{0}=(\boldsymbol {\pi }_{0}^{\top }, \boldsymbol {\beta }_{0}^{\top })^{\top }\). Then we get two theorems about the maximum partial likelihood estimator and CLASSO estimator.
Theorem 1
Let \(\hat {\boldsymbol {\theta }}=(\hat {\boldsymbol {\pi }}^{\top },\hat {\boldsymbol {\beta }}^{\top })^{\top }\) be the maximum partial likelihood estimator in the quasilinear Cox model. Assume that the regularity conditions AD (in Appendix C) hold. Then it follows that

1.
(Consistency) \(\hat {\boldsymbol {\theta }} \stackrel {p}{\longrightarrow } \boldsymbol {\theta _{0}}\)

2.
(Asymptotic Normality) \(\sqrt {n}(\boldsymbol {\hat {\theta }}\boldsymbol {\theta _{0}})\stackrel {d}{\longrightarrow } \mathrm {N}(\boldsymbol {0},I^{1}(\boldsymbol {\theta }_{0}))\)
Theorem 2
Let \(\tilde {\boldsymbol {\theta }}=(\tilde {\boldsymbol {\pi }}^{\top },\tilde {\boldsymbol {\beta }}^{\top })^{\top }\) be the CLASSO estimator in the quasilinear Cox model with cross L_{1} penalty. Assume that condition AD (in Appendix C) hold. If \(\sqrt {n}\lambda _{n} \rightarrow \infty \), then \(\tilde {\boldsymbol {\beta }} \boldsymbol {\beta }_{0} = O_{p} (n^{1/2})\).
Theorem 1 shows that the partial maximum likelihood estimator has the consistency and asymptotic normality. Theorem 2 shows that by choosing a proper sequence λ_{n}, there exists a \(\sqrt {n}\)consistent CLASSO estimator. The proofs are given in Appendix C and D.
Simulations
Settings
We conducted the simulations described in this section with two objectives in mind. The first was to ascertain whether the consistency of the maximum partiallikelihood estimator can be observed empirically. The second was to ascertain whether tuning parameter λ^{(c)} selection can be performed efficiently using the BIC.
In all simulation studies introduced here, the inverse function method was used for data generation. First, the covariates x were generated from the multivariate normal distribution, the apparent censored time T_{1} was generated from the exponential distribution with mean 1000, and the random variable U was generated from the uniform distribution on [0,1]. Let the baseline survival time be followed the exponential distribution with mean 100. Then, the true survival time corresponding to the log relative risk function \(f_{Q}^{Res}(\boldsymbol {x}; \boldsymbol {\theta })\) was given as \(T_{2} = (\log (U)/100)\exp \left (f_{Q}^{Res}(\boldsymbol {x}; \boldsymbol {\theta })\right)\). Based on T_{1} and T_{2}, let T= min(T_{1},T_{2}) be the observational survival time and δ=I(T_{1}<T_{2}) be the censored indicator before the event time.
Sample size was set to N=400 in all scenarios, and it was assumed that the true number of the groups were known for all settings. The tuning parameter λ of cross L_{1} penalty was determined by BIC. We note that the maximum candidate value of the tuning parameter λ was controlled sufficiently to achieve the restricted quasilinear form for every setting. We had the following options:

Independent Setting (IS) or Dependent Setting (DS)
Each covariate vector x_{i} was sampled from the standard normal distribution N(0_{p},Σ), where \(\boldsymbol {a}_{p} = (a,a,\cdots,a) \in \mathbb {R}^{p}\). In IS, Σ=2^{2}I_{p}, where I_{p} is an identity matrix of size p. In DS, \(\Sigma = (s_{ij}) \in \mathbb {R}^{p \times p}\), where s_{ij}=2^{2}×0.7^{i−j}.

Group Size and Coefficients
A number of combined linear predictors, namely group size, was set to K=2 or K=3. A number of covariates, namely dimension size, was set to p=2,p=3, or p=5. A coefficients vector was set to crosssparse (Scenario 1,3,5) or overlapped (Scenario 2,4,6) according to the following settings.

1.
K=2,p=2,π^{⊤}=(0.3,0.7) and \(\boldsymbol {\beta }^{\top }=(\boldsymbol {\beta }_{1}^{\top },\boldsymbol {\beta }_{2}^{\top })=((1,0),(0,1.5))\)

2.
K=2,p=2,π^{⊤}=(0.3,0.7) and \(\boldsymbol {\beta }^{\top }=(\boldsymbol {\beta }_{1}^{\top },\boldsymbol {\beta }_{2}^{\top })=((1,0.5),(0,1.5))\)

3.
K=2,p=5,π^{⊤}=(0.3,0.7) and \(\boldsymbol {\beta }^{\top }=(\boldsymbol {\beta }_{1}^{\top },\boldsymbol {\beta }_{2}^{\top })=((1,1,1,0,0),(0,0,0,1.5,1.5))\)

4.
K=2,p=5,π^{⊤}=(0.3,0.7) and \(\boldsymbol {\beta }^{\top }=(\boldsymbol {\beta }_{1}^{\top },\boldsymbol {\beta }_{2}^{\top })=((1,1,1,0,0.5),(0,0.25,0.5,1.5,1.5))\)

5.
K=3,p=3,π^{⊤}=(0.2,0.3,0.5) and \(\boldsymbol {\beta }^{\top }=(\boldsymbol {\beta }_{1}^{\top },\boldsymbol {\beta }_{2}^{\top },\boldsymbol {\beta }_{3}^{\top })=((1,0,0),(0,1.5,0),(0,0,1))\)

6.
K=3,p=3,π^{⊤}=(0.2,0.3,0.5) and \(\boldsymbol {\beta }^{\top }=(\boldsymbol {\beta }_{1}^{\top },\boldsymbol {\beta }_{2}^{\top },\boldsymbol {\beta }_{3}^{\top })=((1,0.5,0),(0,1.5,0.5),(0.5,0,1))\)

1.
Simulation results
For each scenario, the mean values of the estimated coefficients and the meansquared errors (MSEs) are shown in Table 1. Throughout the whole scenario, all coefficients were estimated with only a little bias. In particular, in the disjoint setting (Scenario 1,3,5 for IS and DS) we could almost certainly distinguish the zero coefficients from nonzero coefficients, indicating that BIC and cross L_{1} penalty work well in these scenarios. Dependent situations did not have a strong effect on parameter estimation, although a slightly larger MSE was observed in comparison with the independent situations. Especially in Scenario 6 for DS, we had moderate biases in estimation of the coefficients of the first linear predictor (β_{11} and β_{12}). This is because the other two groups had enough information for fitting the model to the data. In fact, the estimated risk scores between the estimated and true parameters were almost equal. This shows that the loss of model identifiability sometimes yields bias in parameter estimation for the overlapped situation; however, the predictive performance of the estimated score is sufficient.
Application
In this section, we show the results of application studies for the breast cancer dataset in order to evaluate the performance of the quasilinear Cox model. To evaluate the predictive ability of the learned model, we calculated the Area under the curve (AUC) of timedependent ROC [11] using test dataset. The predictive performance was compared between Cox’s proportional hazard model and the quasilinear Cox model. A dataset from [12] was used as the training data, and a dataset from [13] as the test data. These datasets include expression levels of 70 genes and survival time with some censors. Except for samples with missing values, there were 75 samples in the training dataset and 220 samples in the test dataset. In this application, we extracted the top 10 relevant genes to evaluate the model performance. Such marker preselection has been performed in many studies [14].
Since we had no prior knowledge for these genes, we applied the proposed model with cross L_{1} penalty introduced in “Quasilinear Cox model with cross L_{1} penalty” section. The number of groups K and the regularization parameter of the cross L_{1} penalty λ_{c} were determined using BIC from K∈{2,3,4,5} and λ_{c}∈{0,0.1,0.2,⋯,5.0}. All gene expressions are standardized to have mean zero and variance one among the training sample to compare the estimated coefficients. The same transformation was applied for the test sample.
As a result, a group size K=2 was selected. The time series of test AUCs and the bootstrap 95% intervals for each year are shown in Figs. 2 and 3. For every time point, the test AUC of the quasilinear relative risk model was larger than that of Cox’s proportional hazard model. The estimated coefficients for linear and quasilinear Cox’s proportional hazard models are shown in Fig. 4. While the overall trend was not much different between linear and quasilinear models, the cross L_{1} penalty gave contrast to the fitted quasilinear model. Five out of ten genes have zero coefficient in the hazard function in the first or second group. As a representative of them, we focus on the set of genes with the first (“gene6") and second largest (“gene4") coefficients in the first group. Those are called NUSAP1 (Nucleolar And Spindle Associated Protein 1) and TSPYL5 (TestisSpecific YEncodedLike Protein 5), respectively. TSPYL5 is a wellknown prognostic factor of poor outcome in breast cancer patients. Higher expression of TSPYL5 suppresses p53 protein levels causing damage to mammary cells. It may be important that those two genes have zero coefficient in the second group. Interestingly, it was reported that the TSPYL5 and NUSAP1 are biologically correlated with the same hallmarks of cancer: limitless replication potential [15]. While further investigation is needed, the proposed model thus suggests there are roughly two subpopulations which have different hazards.
Discussion
We developed a mixture hazard model, an extension of Cox’s proportional hazards model, via a quasilinear predictor. Theoretical discussion revealed that the maximum partiallikelihood estimator has properties of consistency and asymptotic normality. Furthermore, we showed combining the cross L_{1} penalty makes the estimated model stable and interpretable. Empirical simulations and applications confirm these superior properties, and BIC have been shown to work well as a measure for selecting the number of groups and the tuning parameter of cross L_{1} penalty.
We will discuss the relationship between our study and previous work. First, the quasilinear predictor was proposed in [16] to extend the logistic regression model for capturing heterogeneous structure in biomarkers. The quasilinear logistic model was motivated by a Bayes riskconsistent predictor in binary classification between the mixture normal distribution and single normal distributions. The quasilinear predictor enables us to model intrinsic heterogeneity using some linear predictors, and can be used as an extension from the standard to the heterogeneous setting of several models that rely on the linear predictor. Second, as introduced in the Introduction, several studies have proposed a mixture hazard model [1, 2], but these were limited to parametric ones. Our proposed method is one extension that is possible without assuming a specific distribution. Also, several mixture distribution model proposals have been developed in past studies [17–19]. We note that the concepts of the mixture hazards model and mixture density model are completely distinct. In fact, while the mixture density model gives the simple weighted average of each survival function as the whole survival function \(\tilde {S}(t)=\sum _{k=1}^{K} \tilde {\pi }_{k} \tilde {S}_{k}(t)\), the mixture hazard model gives the weighted average of logsurvival function of each survival function as the whole logsurvival function: \(\log S(t) = \sum _{k=1}^{K} \pi _{k} \log S_{k}(t)\). In this context, the survival function is understood as the geometric mean of each survival function, \(S(t)=\prod _{k=1}^{K} S_{k}^{\pi _{k}}\). We note that the density function is analogue to the probability while the hazards function is also, i.e. instantaneous rate of mortality. In this sense, both models can be regarded as the special case of latent variable models. It is not yet well understood what such a formal difference yields for the modeling in survival analysis, and it will be very important for our future work.
In this paper, we restricted the baseline hazard functions to be identical. There are three reasons for the restriction. First, it stabilizes estimation of model parameters. In addition to the high computational cost of the mixture model, it will be difficult to estimate the separate baseline hazard functions. Second, it improves the interpretability of the model. The assumption of different baseline hazards functions may seem a somewhat strange idea. This is because baseline hazard function refers to a hazard function when all observed covariate values are considered to be a reference for all patients. When a different baseline hazard function is required for each group, it means that there is heterogeneity that cannot be observed with the dataset in question. Instead, the quasilinear Cox model enables us to model the intrinsic but observable heterogeneity. Third, regardless of such restrictions, the proposed model empirically had better predictive ability than the standard Cox’s proportional hazards model. We thus achieved simultaneous modeling of groupwise proportional hazards models. On the other hand, although a stratified Cox model focuses on the heterogeneity for hazards in the population with different baseline hazards, it assumes the same relative risk function among groups. These two models thus have dualistic roles to capture hazard heterogeneity in the population. Finally, we note that in theory we should be able to loose these restriction for baseline hazard function in the proposed model, based on similar ideas for density mixture models proposed by [19].
Conclusions
In this paper, we focused on hazards mixture model. The quasilinear Cox proportional hazards model was naturally derived by the quasilinear predictor. It is essential to capture the intrinsic heterogeneity of patients for getting more stable and effective risk score. The proposed hazard model can capture such heterogeneity and achieve better performance than the ordinary linear Cox proportional hazards model.
Availability of data and materials
All data used to perform the application described in this paper are freely available. The data of van’t Veer et al. is available on the Gene Expression Omnibus data base [https://www.ncbi.nlm.nih.gov/geo/], series GSE2990. The data of Buyse et al. is available on the European Bioinformatics Institute ArrayExpress database [http://www.ebi.ac.uk/arrayexpress/], accession number ETABM77.
Abbreviations
 MM:

Minorizationmaximization
 AUC:

Area under curve
 BID:

Bayes information criteria
 CLASSO:

Cross least absolute shrinkage and selection operator
 DS:

Dependent setting
 EM:

Expectation–maximization
 IS:

Independent setting
 MSE:

Meansquared error
References
LouzadaNeto F, Mazucheli J, Achcar JA. Mixture hazard models for lifetime data. Biom J. 2002; 44:3–14.
Hilton RP, Zheng Y, Serban N. Modeling heterogeneity in healthcare utilization using massive medical claims data. J Am Stat Assoc. 2018; 113(521):111–21.
Fang HB, Li G, Sun J. Maximum likelihood estimation in a semiparametric logistic/proportionalhazards mixture model. Scand J Stat. 2005; 32(1):59–75.
Omae K, Komori O, Eguchi S. Quasilinear score for capturing heterogeneous structure in biomarkers. BMC Bioinformatics. 2017; 18(1). https://doi.org/10.1186/s128590171721x.
Hunter DR, Lange K. A tutorial on mm algorithms. Am Stat. 2004; 58(1):30–7.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B. 1977; 39:1–38.
McLachlan GJ, Krishnan T. The EM Algorithm and Extensions, 2nd edn. In: Wiley series in probability and statistics. New Jersey: Wiley: 2008.
Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006; 101(476):1418–29.
Goeman JJ. L_{1} penalized estimation in the Cox proportional hazards model. Biom J. 2010; 52:70–84.
Schwarz G. Estimating the dimension of a model. Ann Stat. 1978; 6:461–4.
Heagerty PJ, Lumley T, Pepe MS. Timedependent roc curves for censored survival data and a diagnostic marker. Biometrics. 2000; 56(2):337–44.
van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, et al.Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002; 415:530–6.
Buyse M, Loi S, van’t Veer L, Viale G, Delorenzi M, Glas A, et al.Validation and clinical utility of a 70gene prognostic signature for women with nodenegative breast cancer. J Natl Cancer Inst. 2006; 98:1183–92.
Dettling M, Bühlman P. Boosting for tumor classification with gene expression data. Bioinformatics. 2003; 19(9):1061–9. https://doi.org/10.1093/bioinformatics/btf867.
Tian S, P R, van’t Veer LJ, Bernards R, De Snoo F, Glas AM. Biological functions of the genes in the mammaprint breast cancer profile reflect the hallmarks of cancer. Biomark Insights. 2010; 5:6184.
Omae K, Komori O, Eguchi S. Reproducible detection of diseaseassociated markers from gene expression data. BMC Med Genomics. 2016; 9(1). https://doi.org/10.1186/s1292001602145.
Elmahdy EE, Aboutahoun AW. A new approach for parameter estimation of finite Weibull mixture distributions for reliability modeling. Appl Math Model. 2013; 37:1800–10.
Zhang Q, Hua C, Xu G. A mixture Weibull proportional hazard model for mechanical system failure prediction utilising lifetime and monitoring data. Mech Syst Signal Process. 2014; 43:103–12.
You N, He S, Wang X, Zhu J, Zhang H. Subtype classification and heterogeneous prognosis model construction in precision medicine. Biometrics. 2018; 74(3):814–22. https://doi.org/10.1111/biom.12843.
Acknowledgements
The authors are thankful to the reviewers for their valuable suggestions to improve the article further.
Funding
This work is partly supported by the Project Promoting Clinical Trials for Development of New Drugs (18lk020106Xt0003) from the Japan Agency for Medical Research and Development, AMED, and is supported by JSPS KAKENHI Grant Number 18H03211. The funding bodies had no role in the design and the conclusions of the presented study.
Author information
Authors and Affiliations
Contributions
KO and SE designed the methods of this article. KO carried out the simulation study and data analysis, and wrote the paper. Both authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Technical Derivations. In this file, we give proofs of Proposition 1, Proposition 2, Theorem 1 and Theorem 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Omae, K., Eguchi, S. Quasilinear Cox proportional hazards model with cross L_{1} penalty. BMC Med Res Methodol 20, 182 (2020). https://doi.org/10.1186/s12874020010632
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874020010632