AFT model
Let us consider a censored random sample containing data (yi,δi,xi),i=1,…,n, where yi= log(ti) is a log-lifetime or log-censoring-time if the censoring indicator δi=1 or δi=0, respectively and xi=(1,xi1,…,xir,…,xip)T is a (p+1)-dimensional vector of covariates. The location-scale family model describes the relationship between the survivor function S(yi|xi) and a set of covariates x as
$$\begin{array}{*{20}l} S(y_{i}|\boldsymbol{x}_{i})=S_{0}\Big(\frac{y_{i}-u(\boldsymbol{x}_{i}\boldsymbol{\beta})}{b}\Big), \;\; -\infty< y<+\infty \end{array} $$
where S0(z) is the survivor function of a standardized random variable, u(xi,β)=βTxi denotes location parameter and b the scale parameter, and β=(β0,β1,…,βr,…,βp)T is a vector of regression coefficients [16]. The model can also be written in the following form to describe the relationship between covariate and log-survival time:
$$\begin{array}{*{20}l} y_{i}=u(\boldsymbol{x}_{i}\boldsymbol{\beta})+bZ_{i}, \end{array} $$
(1)
where Zi is a random variable with a standard distribution in (−∞,∞). The covariate effectively alters the log-time scale in additive form (or original time-scale in multiplicative form) and hence the model is referred to as accelerated failure time (AFT) model. The above equation represents as a family of models for which Z belong to a standard location-scale family of distributions (i.e., extreme value, logistic, normal distribution) while the survival time T belonging to the log-location scale family of distributions (i.e., Weibull, log-logistic, log-normal), respectively.
Maximum likelihood estimation for AFT model
The maximum likelihood method is commonly used to estimate the parameter of the model given in Eq. (1). Setting zi=(yi−ui)/b with p.d.f \(f_{0}(z)=-S^{'}_{0}(z)\); ui=u(xi;β) and \(m=\sum _{i=1}^{n}\delta _{i}\), the log-likelihood function for the location-scale family model (1) can be written as:
$$ \begin{aligned} \ell(\boldsymbol{\beta},b)=-m\log b+\sum_{i=1}^{n}\left[\delta_{i}\log f_{0}(z_{i})+(1-\delta_{i})\log S_{0}(z_{i})\right]. \end{aligned} $$
(2)
Let θ=(βT,b)T be the (p+2)-dimensional vector of parameters. The maximum likelihood estimate of θ is the solution of the estimating equations U(β,b)=0 simultaneously, where U(β,b) is the score function defined as
$$\begin{aligned} U(\boldsymbol{\beta},b)=&\partial \ell(\boldsymbol{\theta})/\partial \boldsymbol{\theta}\\=&[\!U_{0}(\boldsymbol{\beta},b),\! U_{1}\!(\boldsymbol{\beta},\!b), \dots\!, U_{r}(\boldsymbol{\beta},b),\ldots\!, U_{p}(\boldsymbol{\beta},b), U_{b}(\boldsymbol{\beta},b)]^{T}\!. \end{aligned} $$
If X is an n×(p+1) matrix having rows \(\boldsymbol {x}_{i}^{T}=(1, x_{i1}, \ldots, x_{ir}, \ldots, x_{ip})\), then ∂zi/∂βr=−xirb−1; ∂zi/∂b=−zib−1. The r-th and last components of the score functions are given by,
$$ \begin{aligned} U_{r}(\boldsymbol{\beta, b})&=-\frac{1}{b}\sum_{i=1}^{n}\Big[\delta_{i} \frac{\partial \log f_{0}(z_{i})}{\partial z_{i}}+(1-\delta_{i})\frac{\partial \log S_{0}(z_{i})}{\partial z_{i}}\Big]x_{ir}, \end{aligned} $$
(3)
$$ \begin{aligned} U_{b}(\boldsymbol{\beta, b})&=-\frac{r}{b}-\frac{1}{b}\sum_{i=1}^{n}\left[\delta_{i} \frac{\partial \log f_{0}(z_{i})}{\partial z_{i}}+(1-\delta_{i})\frac{\partial \log S_{0}(z_{i})}{\partial z_{i}}\right]z_{i}. \end{aligned} $$
(4)
The (p+2)×(p+2) observed information matrix is given by:
$$\begin{aligned} \boldsymbol{I}(\boldsymbol{\beta},b) =& \left(\begin{array}{cc} -\partial^{2}l/\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{T} & -\partial^{2}l/\partial\boldsymbol{\beta}\partial b \\ -\partial^{2}l/\partial b\partial\boldsymbol{\beta} & -\partial^{2}l/\partial b^{2} \end{array}\right) \\ =& \frac{1}{b^{2}}\!\left(\!\begin{array}{cc} -\sum_{i=1}^{n}A_{i}\boldsymbol{x}_{i}\boldsymbol{x}_{i}^{T} & -\sum_{i=1}^{n}(A_{i}z_{i} + B_{i})\boldsymbol{x}_{i} \\ \!-\!\sum_{i=1}^{n}(A_{i}z_{i} \!+\! B_{i})\boldsymbol{x}_{i}^{T} &\! -[m \!+\! \sum_{i=1}^{n}(A_{i}z_{i}^{2} +2B_{i} z_{i})] \end{array}\!\right), \end{aligned} $$
where
$$ \begin{aligned} A_{i} &= \delta_{i} \frac{\partial^{2} \log f_{0}(z_{i})}{\partial z_{i}^{2}}+(1-\delta_{i})\frac{\partial^{2} \log S_{0}(z_{i})}{\partial z_{i}^{2}},\\ B_{i} &= \delta_{i} \frac{\partial \log f_{0}(z_{i})}{\partial z_{i}}+(1-\delta_{i})\frac{\partial \log S_{0}(z_{i})}{\partial z_{i}}. \end{aligned} $$
Firth’s penalized likelihood method for AFT model
In order to remove the first order bias O(n−1) in the MLE of the regression parameter, say θ, of the generalized linear models, [10] introduced a penalized log-likelihood function by adding a penalty term 1/2 log|I(θ)| to the original log-likelihood function ℓ(θ). Without loss of generality, Firth’s procedure can be directly applied to the likelihood function of the AFT models given in Eq. (2). For the AFT model (Eq. 1) with a (p+2)-dimensional parameter vector θ=(βT,b)T, the penalized log-likelihood function with Firth’s penalty term is given by
$$\begin{array}{*{20}l} \ell^{*}(\boldsymbol{\beta},b)=\ell(\boldsymbol{\beta},b)+\frac{1}{2}\log|I (\boldsymbol{\beta},b)|, \end{array} $$
(5)
where |I(β,b)|1/2 is the Jeffreys invariant prior, whose influence is asymptotically negligible. By adding the penalty term with the original likelihood function given in Eq. (2) one can derive an explicit form of the penalized likelihood function. In addition, according to Firth’s principle, corresponding modified score functions for the r-th regression βr and scale parameter b can be written by adding the penalty term in Eqs. (3) and (4) respectively as follows :
$$\begin{array}{*{20}l} {U}_{r}^{*}(\boldsymbol{\beta}, b)&={U}_{r}(\boldsymbol{\beta}, b)+\frac{1}{2}\text{tr}\Big[\boldsymbol{I}(\boldsymbol{\beta},b)^{-1}\Big(\frac{\partial \boldsymbol{I}(\boldsymbol{\beta},b)}{\partial \beta_{r}}\Big)\Big]\\ &={U}_{r}(\boldsymbol{\beta}, b)+\frac{\partial} {\partial \beta_{r}} \Big[\frac{1}{2}\log |\boldsymbol{I}(\boldsymbol{\beta},b)| \Big], \end{array} $$
(6)
and
$$\begin{array}{*{20}l} {U}^{*}_{b}(\boldsymbol{\beta}, b)&={U}_{b} (\boldsymbol{\beta},b)+\frac{1}{2}\text{tr}\Big[\boldsymbol{I}(\boldsymbol{\beta},b)^{-1}\Big(\frac{\partial \boldsymbol{I}(\boldsymbol{\beta},b)}{\partial b}\Big)\Big]\\ &={U}_{b}(\boldsymbol{\beta}, b)+\frac{\partial} {\partial b} \Big[\frac{1}{2}\log |\boldsymbol{I}(\boldsymbol{\beta},b)| \Big]. \end{array} $$
(7)
By expanding the above two Eqs. (6–7) one can derive an explicit analytical form of the score equations which ensure finite estimates of both the β and b while solving them using Newton-Raphson method. However, in numerical optimization, the Newton-Raphson’s method can be computationally tedious and inefficient for expansive and complex non-linear problems if the Hessian (second order derivative of the objective function) is directly calculated iteratively. Moreover, Newton-Raphson’s method might not work properly if the Hessian is singular at any iteration. Therefore, rather than solving the modified score equation by the Newton-Raphson algorithm, we directly optimized the penalized likelihood function given in Eq. (7) using a quasi-Newton method referred to the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, which also ensures finite estimates of both the β and b. It is computationally cheaper and more efficient than the Newton’s method and approximates the Hessian matrix using the gradient (first order derivative of the objective function) at each step rather than iteratively computing it [2]. Thus the penalized likelihood estimates θ∗=(β∗,b∗) can be obtained from the optimization as follows:
$$\begin{array}{*{20}l} \boldsymbol{\theta}^{*} = \operatorname*{arg\,max}_{\boldsymbol{\beta}^{*},b^{*}} \ell^{*}(\boldsymbol{\beta},b). \end{array} $$
The corresponding standard error of the estimator can be obtained from the approximated Hessian matrix.
Intercept and scale parameter correction
Firth’s penalized likelihood method is known to reduce first order bias in the estimate of the model parameter by shrinking the estimate towards the true value. However, incorporating the penalty term in the likelihood may cause greater shrinkage of the intercept and scale parameter, which in turn may provides bias in the estimated survival probabilities given by
$$\begin{array}{*{20}l} S(t)=S_{0}\left(\frac{\log t - u(\boldsymbol{x} ; \boldsymbol{\beta}^{*})}{b^{*}}\right)=S_{0}\left(\frac{\log t - \boldsymbol{\beta}^{*T} \boldsymbol{x}}{b^{*}}\right). \end{array} $$
Therefore, a post-hoc adjustment of the intercept and scale parameter estimates in the AFT model has been performed by keeping these parameters out of penalization. The adjustment was performed by following the procedure described in a recent study by [22] for correcting the intercept term in the Firth’s logistic regression. The intercept and scale parameter corrections in AFT models can be administered as follows:
-
(i)
Estimate the coefficients as \(\boldsymbol {\hat {\theta }}_{F}=(\hat {\beta }_{F,0},\hat {\beta }_{F,1},\dots,\hat {\beta }_{F,p}, \hat {b}_{F})\) by Firth’s penalization.
-
(ii)
Calculate the linear predictors \(\hat {\eta _{i}}=\hat {\beta }_{F,1}x_{i1} + \hat {\beta }_{F,2}x_{i2} + \dots + \hat {\beta }_{F,p}x_{ip}\), omitting the intercept.
-
(iii)
Determine the ML estimate \(\hat {\beta }_{0}\) of the intercept and \(\hat {b}\) of the scale parameter for the AFT model \(Y=\beta _{0}+ \hat {\eta _{i}} + bZ,\) containing only a single predictor \(\hat {\eta _{i}}\) with regression coefficient equal to one. This can be achieved by including an offset in a standard procedure or by direct maximum likelihood estimation.
-
(iv)
The resulting estimate \(\boldsymbol {\hat {\theta }}_{C} = (\hat {\beta _{0}},\hat {\beta }_{F,1},\dots, \hat {\beta }_{F,p}, \hat {b})\) is then considered as the corrected Firth’s estimate with the intercept and scale parameter replaced by \(\hat {\beta _{0}}\) and \(\hat {b}\) respectively in the original estimates \(\boldsymbol {\hat {\theta }}_{F}\).
This post-hoc adjustment is only required if the interest is to use the model for survival prediction, which is often a primary objective of many studies in clinical prediction research. In the following sub-section, the performance of the Firth’s estimates of the model parameters is investigated using an extensive simulation study and compared the results to those obtained for the standard maximum likelihood techniques.