AFT model
Let us consider a censored random sample containing data (y_{i},δ_{i},x_{i}),i=1,…,n, where y_{i}= log(t_{i}) is a loglifetime or logcensoringtime if the censoring indicator δ_{i}=1 or δ_{i}=0, respectively and x_{i}=(1,x_{i1},…,x_{ir},…,x_{ip})^{T} is a (p+1)dimensional vector of covariates. The locationscale family model describes the relationship between the survivor function S(y_{i}x_{i}) and a set of covariates x as
$$\begin{array}{*{20}l} S(y_{i}\boldsymbol{x}_{i})=S_{0}\Big(\frac{y_{i}u(\boldsymbol{x}_{i}\boldsymbol{\beta})}{b}\Big), \;\; \infty< y<+\infty \end{array} $$
where S_{0}(z) is the survivor function of a standardized random variable, u(x_{i},β)=β^{T}x_{i} denotes location parameter and b the scale parameter, and β=(β_{0},β_{1},…,β_{r},…,β_{p})^{T} is a vector of regression coefficients [16]. The model can also be written in the following form to describe the relationship between covariate and logsurvival time:
$$\begin{array}{*{20}l} y_{i}=u(\boldsymbol{x}_{i}\boldsymbol{\beta})+bZ_{i}, \end{array} $$
(1)
where Z_{i} is a random variable with a standard distribution in (−∞,∞). The covariate effectively alters the logtime scale in additive form (or original timescale in multiplicative form) and hence the model is referred to as accelerated failure time (AFT) model. The above equation represents as a family of models for which Z belong to a standard locationscale family of distributions (i.e., extreme value, logistic, normal distribution) while the survival time T belonging to the loglocation scale family of distributions (i.e., Weibull, loglogistic, lognormal), respectively.
Maximum likelihood estimation for AFT model
The maximum likelihood method is commonly used to estimate the parameter of the model given in Eq. (1). Setting z_{i}=(y_{i}−u_{i})/b with p.d.f \(f_{0}(z)=S^{'}_{0}(z)\); u_{i}=u(x_{i};β) and \(m=\sum _{i=1}^{n}\delta _{i}\), the loglikelihood function for the locationscale family model (1) can be written as:
$$ \begin{aligned} \ell(\boldsymbol{\beta},b)=m\log b+\sum_{i=1}^{n}\left[\delta_{i}\log f_{0}(z_{i})+(1\delta_{i})\log S_{0}(z_{i})\right]. \end{aligned} $$
(2)
Let θ=(β^{T},b)^{T} be the (p+2)dimensional vector of parameters. The maximum likelihood estimate of θ is the solution of the estimating equations U(β,b)=0 simultaneously, where U(β,b) is the score function defined as
$$\begin{aligned} U(\boldsymbol{\beta},b)=&\partial \ell(\boldsymbol{\theta})/\partial \boldsymbol{\theta}\\=&[\!U_{0}(\boldsymbol{\beta},b),\! U_{1}\!(\boldsymbol{\beta},\!b), \dots\!, U_{r}(\boldsymbol{\beta},b),\ldots\!, U_{p}(\boldsymbol{\beta},b), U_{b}(\boldsymbol{\beta},b)]^{T}\!. \end{aligned} $$
If X is an n×(p+1) matrix having rows \(\boldsymbol {x}_{i}^{T}=(1, x_{i1}, \ldots, x_{ir}, \ldots, x_{ip})\), then ∂z_{i}/∂β_{r}=−x_{ir}b^{−1}; ∂z_{i}/∂b=−z_{i}b^{−1}. The rth and last components of the score functions are given by,
$$ \begin{aligned} U_{r}(\boldsymbol{\beta, b})&=\frac{1}{b}\sum_{i=1}^{n}\Big[\delta_{i} \frac{\partial \log f_{0}(z_{i})}{\partial z_{i}}+(1\delta_{i})\frac{\partial \log S_{0}(z_{i})}{\partial z_{i}}\Big]x_{ir}, \end{aligned} $$
(3)
$$ \begin{aligned} U_{b}(\boldsymbol{\beta, b})&=\frac{r}{b}\frac{1}{b}\sum_{i=1}^{n}\left[\delta_{i} \frac{\partial \log f_{0}(z_{i})}{\partial z_{i}}+(1\delta_{i})\frac{\partial \log S_{0}(z_{i})}{\partial z_{i}}\right]z_{i}. \end{aligned} $$
(4)
The (p+2)×(p+2) observed information matrix is given by:
$$\begin{aligned} \boldsymbol{I}(\boldsymbol{\beta},b) =& \left(\begin{array}{cc} \partial^{2}l/\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{T} & \partial^{2}l/\partial\boldsymbol{\beta}\partial b \\ \partial^{2}l/\partial b\partial\boldsymbol{\beta} & \partial^{2}l/\partial b^{2} \end{array}\right) \\ =& \frac{1}{b^{2}}\!\left(\!\begin{array}{cc} \sum_{i=1}^{n}A_{i}\boldsymbol{x}_{i}\boldsymbol{x}_{i}^{T} & \sum_{i=1}^{n}(A_{i}z_{i} + B_{i})\boldsymbol{x}_{i} \\ \!\!\sum_{i=1}^{n}(A_{i}z_{i} \!+\! B_{i})\boldsymbol{x}_{i}^{T} &\! [m \!+\! \sum_{i=1}^{n}(A_{i}z_{i}^{2} +2B_{i} z_{i})] \end{array}\!\right), \end{aligned} $$
where
$$ \begin{aligned} A_{i} &= \delta_{i} \frac{\partial^{2} \log f_{0}(z_{i})}{\partial z_{i}^{2}}+(1\delta_{i})\frac{\partial^{2} \log S_{0}(z_{i})}{\partial z_{i}^{2}},\\ B_{i} &= \delta_{i} \frac{\partial \log f_{0}(z_{i})}{\partial z_{i}}+(1\delta_{i})\frac{\partial \log S_{0}(z_{i})}{\partial z_{i}}. \end{aligned} $$
Firth’s penalized likelihood method for AFT model
In order to remove the first order bias O(n^{−1}) in the MLE of the regression parameter, say θ, of the generalized linear models, [10] introduced a penalized loglikelihood function by adding a penalty term 1/2 logI(θ) to the original loglikelihood function ℓ(θ). Without loss of generality, Firth’s procedure can be directly applied to the likelihood function of the AFT models given in Eq. (2). For the AFT model (Eq. 1) with a (p+2)dimensional parameter vector θ=(β^{T},b)^{T}, the penalized loglikelihood function with Firth’s penalty term is given by
$$\begin{array}{*{20}l} \ell^{*}(\boldsymbol{\beta},b)=\ell(\boldsymbol{\beta},b)+\frac{1}{2}\logI (\boldsymbol{\beta},b), \end{array} $$
(5)
where I(β,b)^{1/2} is the Jeffreys invariant prior, whose influence is asymptotically negligible. By adding the penalty term with the original likelihood function given in Eq. (2) one can derive an explicit form of the penalized likelihood function. In addition, according to Firth’s principle, corresponding modified score functions for the rth regression β_{r} and scale parameter b can be written by adding the penalty term in Eqs. (3) and (4) respectively as follows :
$$\begin{array}{*{20}l} {U}_{r}^{*}(\boldsymbol{\beta}, b)&={U}_{r}(\boldsymbol{\beta}, b)+\frac{1}{2}\text{tr}\Big[\boldsymbol{I}(\boldsymbol{\beta},b)^{1}\Big(\frac{\partial \boldsymbol{I}(\boldsymbol{\beta},b)}{\partial \beta_{r}}\Big)\Big]\\ &={U}_{r}(\boldsymbol{\beta}, b)+\frac{\partial} {\partial \beta_{r}} \Big[\frac{1}{2}\log \boldsymbol{I}(\boldsymbol{\beta},b) \Big], \end{array} $$
(6)
and
$$\begin{array}{*{20}l} {U}^{*}_{b}(\boldsymbol{\beta}, b)&={U}_{b} (\boldsymbol{\beta},b)+\frac{1}{2}\text{tr}\Big[\boldsymbol{I}(\boldsymbol{\beta},b)^{1}\Big(\frac{\partial \boldsymbol{I}(\boldsymbol{\beta},b)}{\partial b}\Big)\Big]\\ &={U}_{b}(\boldsymbol{\beta}, b)+\frac{\partial} {\partial b} \Big[\frac{1}{2}\log \boldsymbol{I}(\boldsymbol{\beta},b) \Big]. \end{array} $$
(7)
By expanding the above two Eqs. (6–7) one can derive an explicit analytical form of the score equations which ensure finite estimates of both the β and b while solving them using NewtonRaphson method. However, in numerical optimization, the NewtonRaphson’s method can be computationally tedious and inefficient for expansive and complex nonlinear problems if the Hessian (second order derivative of the objective function) is directly calculated iteratively. Moreover, NewtonRaphson’s method might not work properly if the Hessian is singular at any iteration. Therefore, rather than solving the modified score equation by the NewtonRaphson algorithm, we directly optimized the penalized likelihood function given in Eq. (7) using a quasiNewton method referred to the BroydenFletcherGoldfarbShanno (BFGS) algorithm, which also ensures finite estimates of both the β and b. It is computationally cheaper and more efficient than the Newton’s method and approximates the Hessian matrix using the gradient (first order derivative of the objective function) at each step rather than iteratively computing it [2]. Thus the penalized likelihood estimates θ^{∗}=(β^{∗},b^{∗}) can be obtained from the optimization as follows:
$$\begin{array}{*{20}l} \boldsymbol{\theta}^{*} = \operatorname*{arg\,max}_{\boldsymbol{\beta}^{*},b^{*}} \ell^{*}(\boldsymbol{\beta},b). \end{array} $$
The corresponding standard error of the estimator can be obtained from the approximated Hessian matrix.
Intercept and scale parameter correction
Firth’s penalized likelihood method is known to reduce first order bias in the estimate of the model parameter by shrinking the estimate towards the true value. However, incorporating the penalty term in the likelihood may cause greater shrinkage of the intercept and scale parameter, which in turn may provides bias in the estimated survival probabilities given by
$$\begin{array}{*{20}l} S(t)=S_{0}\left(\frac{\log t  u(\boldsymbol{x} ; \boldsymbol{\beta}^{*})}{b^{*}}\right)=S_{0}\left(\frac{\log t  \boldsymbol{\beta}^{*T} \boldsymbol{x}}{b^{*}}\right). \end{array} $$
Therefore, a posthoc adjustment of the intercept and scale parameter estimates in the AFT model has been performed by keeping these parameters out of penalization. The adjustment was performed by following the procedure described in a recent study by [22] for correcting the intercept term in the Firth’s logistic regression. The intercept and scale parameter corrections in AFT models can be administered as follows:

(i)
Estimate the coefficients as \(\boldsymbol {\hat {\theta }}_{F}=(\hat {\beta }_{F,0},\hat {\beta }_{F,1},\dots,\hat {\beta }_{F,p}, \hat {b}_{F})\) by Firth’s penalization.

(ii)
Calculate the linear predictors \(\hat {\eta _{i}}=\hat {\beta }_{F,1}x_{i1} + \hat {\beta }_{F,2}x_{i2} + \dots + \hat {\beta }_{F,p}x_{ip}\), omitting the intercept.

(iii)
Determine the ML estimate \(\hat {\beta }_{0}\) of the intercept and \(\hat {b}\) of the scale parameter for the AFT model \(Y=\beta _{0}+ \hat {\eta _{i}} + bZ,\) containing only a single predictor \(\hat {\eta _{i}}\) with regression coefficient equal to one. This can be achieved by including an offset in a standard procedure or by direct maximum likelihood estimation.

(iv)
The resulting estimate \(\boldsymbol {\hat {\theta }}_{C} = (\hat {\beta _{0}},\hat {\beta }_{F,1},\dots, \hat {\beta }_{F,p}, \hat {b})\) is then considered as the corrected Firth’s estimate with the intercept and scale parameter replaced by \(\hat {\beta _{0}}\) and \(\hat {b}\) respectively in the original estimates \(\boldsymbol {\hat {\theta }}_{F}\).
This posthoc adjustment is only required if the interest is to use the model for survival prediction, which is often a primary objective of many studies in clinical prediction research. In the following subsection, the performance of the Firth’s estimates of the model parameters is investigated using an extensive simulation study and compared the results to those obtained for the standard maximum likelihood techniques.