Skip to main content

On estimation for accelerated failure time models with small or rare event survival data



Separation or monotone likelihood may exist in fitting process of the accelerated failure time (AFT) model using maximum likelihood approach when sample size is small and/or rate of censoring is high (rare event) or there is at least one strong covariate in the model, resulting in infinite estimates of at least one regression coefficient.


This paper investigated the properties of the maximum likelihood estimator (MLE) of the regression parameters of the AFT models for small sample and/or rare-event situation and addressed the problems by introducing a penalized likelihood approach. The penalized likelihood function and the corresponding score equation is derived by adding a penalty term to the existing likelihood function, which was originally proposed by Firth (Biometrika, 1993) for the exponential family models. Further, a post-hoc adjustment of intercept and scale parameters is discussed keeping them out of penalization to ensure accurate prediction of survival probability. The penalized method was illustrated for the widely used log-location-scale family models such as Weibull, Log-normal and Log-logistic distributions and compared the models and methods uisng an extensive simulation study.


The simulation study, performed separately for each of the log-location-scale models, showed that Firth’s penalized likelihood succeeded to solve the problem of separation and achieve convergence, providing finite estimates of the regression coefficients, which are not often possible by the MLE. Furthermore, the proposed penalized method showed substantial improvement over MLE by providing smaller amount of bias, mean squared error (MSE), narrower confidence interval and reasonably accurate prediction of survival probabilities. The methods are illustrated using prostate cancer data with existence of separation, and results supported the simulation findings.


When sample size is small (≤ 50) or event is rare (i.e., censoring proportion is high) and/or there is any evidence of separation in the data, we recommend to use Firth’s penalized likelihood method for fitting AFT model.

Peer Review reports


It is now well established in generalized linear model literature that maximum likelihood estimation provides consistent estimates of the regression parameters when sample size is large. However, it may fail to provide a finite or unbiased estimate for at least one regression parameter of the model if sample size is small [10]. The small-sample consequences arise frequently and become worse if there exists separation in the data [12]. The problem of separation or monotone likelihood, first introduced by [1] in binary regression model is a special condition in a dataset which breakdown the standard maximum likelihood method during fitting process of the model resulting in non-existence (or infinite values) of the maximum likelihood estimates. The situation of separation or monotone likelihood may arise when dataset is small in size as well as if event or non-event of interest can be separated mostly by a binary covariate or a linear combination of several covariates [1, 3, 12, 13]. In the presence of separation, the traditional maximum likelihood estimation approach may provide highly biased, even infinite estimates for the regression coefficients of one or more covariates and hence provide Wald confidence intervals of infinite width [11, 12, 17, 25]. A number of studies discussed finite sample bias correction in the maximum likelihood estimate of the regression coefficient and provided solution to separation [6, 7, 18, 24]. To address the problem of separation and infinite estimates, Firth’s preventive method [10] is widely used in statistical inference as it eliminates the first order term O(n−1) in the asymptotic bias of the estimated parameters by solving the modified estimating equation resulted from the addition of a Jeffrey’s invariant prior based penalty term to the original likelihood function. The performance of Firth’s approach in proving the bias reduced estimates and resolving the problem of separation has been demonstrated for the logistic regression model [13] and other models under exponential family of the distributions [4, 15, 19].

The separation or monotone likelihood issue in the context of survival data were first introduced by [12], where it was argued that the pattern of such problem and its consequences in survival data is very similar as in the binary data. However, one may visualize it differently for survival data context. For example, for a single covariate, this occurs when, at each failure time, the covariate value for the failed subjects is the largest of all covariate values in the risk set at that time, or when it is always the smallest. It also occurs when the same is true for a linear combination of covariates. As discussed by Heinze and Schemper, the problem of separation is likely to occur if the sample size is small, percentage of censoring is high, and/or there are one or more strong covariates, particularly binary covariates. The chance of separation increases with the degree of imbalance in the distribution of binary covariate. However, the separation is rarely occur with the continuous covariate and uncensored data. Whatever the reasons for occurring separation in survival data, the situation is by no means negligible as it creates several consequences. To overcome the problems due to small sample, or rare events, or separation in analyzing survival data, [12] suggested a penalized likelihood function for semi-parametric Cox proportional hazard model [8] by incorporating the Firth’s bias preventive principle into the partial likelihood function. However, an appropriately fitted parametric survival model with correctly specified distribution always yields consistent and more efficient estimates of the parameters of interest than the estimates obtained from a semi-parametric model [9, 20] and have intuitive interpretation through a direct connection with the failure time. Moreover, the estimation technique under a parametric model is computationally more flexible and provides precise estimates since both the survival and censoring times are used directly to construct the likelihood function. To provide bias corrected estimate in the small sample situation, [21] applied Firth’s penalized approach to exponential survival model, which, however, has limited use in practice because of inapplicability of constant failure rate assumption in most real world applications. Therefore, under the parametric framework, it is obvious to focus on widely used accelerated failure time (AFT) model, which is a general framework of a range of parametric survival models under log-location scale family of distributions. However, the performance of maximum likelihood estimation technique to estimate the parameters of the AFT models has not been investigated yet when sample size is small or event of interest is rare or if there exists separation in the survival data. In this study, an attempt has been made to examine performance of maximum likelihood estimators in such situations through conducting extensive simulation studies. To address the problems in the MLEs, this paper proposed penalized likelihood estimation for AFT survival models by incorporating the Firth’s penalty term to the original likelihood function, as motivated by [12]. The empirical performance of the newly proposed approach was studied through simulations, where penalized likelihood function was optimized by using quasi-Newton-algorithm.

Though the main purpose of the AFT survival models is to examine how the covariates influence the survival times, estimate of scale parameter is of interest as estimates of both regression and scale parameters are required to predict the survival quantities such as survival probabilities and hazard functions. However, imposing Jeffrey’s prior for penalization may result in further shrinkage of the estimate of scale and intercept parameters, which may lead to biased estimates of survival quantities. This is motivated by the recent study of [22] who identified that Firth’s penalization for logistic regression resulted in inaccurate prediction of overall probability and proposed a correction in the intercept term to ensure accurate prediction. Following the first modification of Firth’s procedure suggested by [22] in logistic regression, this paper also proposed a post-hoc adjustment of the intercept and scale parameters estimated by the penalized likelihood method in the AFT model by keeping them out of penalization.

The paper is organized as follows. “Methodology” section describes the methodology starting with a brief discussion on AFT model and maximum likelihood estimation procedure, which is followed by the application of Firth’s principle to derive the penalized likelihood function for AFT model and optimization procedure and ended by the post-hoc adjustment of the scale parameters. Comprehensive simulation studies conducted under different scenarios to compare the performance of estimates obtained through the maximum likelihood and penalized likelihood functions are given in “Simulation study” section. An illustration of the proposed methods using data on prostate cancer patients [5] is discussed in “Illustration using prostate cancer data” section. This paper concludes with a brief discussion on findings, limitations, and further scope of this study in “Discussion” section.


AFT model

Let us consider a censored random sample containing data (yi,δi,xi),i=1,…,n, where yi= log(ti) is a log-lifetime or log-censoring-time if the censoring indicator δi=1 or δi=0, respectively and xi=(1,xi1,…,xir,…,xip)T is a (p+1)-dimensional vector of covariates. The location-scale family model describes the relationship between the survivor function S(yi|xi) and a set of covariates x as

$$\begin{array}{*{20}l} S(y_{i}|\boldsymbol{x}_{i})=S_{0}\Big(\frac{y_{i}-u(\boldsymbol{x}_{i}\boldsymbol{\beta})}{b}\Big), \;\; -\infty< y<+\infty \end{array} $$

where S0(z) is the survivor function of a standardized random variable, u(xi,β)=βTxi denotes location parameter and b the scale parameter, and β=(β0,β1,…,βr,…,βp)T is a vector of regression coefficients [16]. The model can also be written in the following form to describe the relationship between covariate and log-survival time:

$$\begin{array}{*{20}l} y_{i}=u(\boldsymbol{x}_{i}\boldsymbol{\beta})+bZ_{i}, \end{array} $$

where Zi is a random variable with a standard distribution in (−,). The covariate effectively alters the log-time scale in additive form (or original time-scale in multiplicative form) and hence the model is referred to as accelerated failure time (AFT) model. The above equation represents as a family of models for which Z belong to a standard location-scale family of distributions (i.e., extreme value, logistic, normal distribution) while the survival time T belonging to the log-location scale family of distributions (i.e., Weibull, log-logistic, log-normal), respectively.

Maximum likelihood estimation for AFT model

The maximum likelihood method is commonly used to estimate the parameter of the model given in Eq. (1). Setting zi=(yiui)/b with p.d.f \(f_{0}(z)=-S^{'}_{0}(z)\); ui=u(xi;β) and \(m=\sum _{i=1}^{n}\delta _{i}\), the log-likelihood function for the location-scale family model (1) can be written as:

$$ \begin{aligned} \ell(\boldsymbol{\beta},b)=-m\log b+\sum_{i=1}^{n}\left[\delta_{i}\log f_{0}(z_{i})+(1-\delta_{i})\log S_{0}(z_{i})\right]. \end{aligned} $$

Let θ=(βT,b)T be the (p+2)-dimensional vector of parameters. The maximum likelihood estimate of θ is the solution of the estimating equations U(β,b)=0 simultaneously, where U(β,b) is the score function defined as

$$\begin{aligned} U(\boldsymbol{\beta},b)=&\partial \ell(\boldsymbol{\theta})/\partial \boldsymbol{\theta}\\=&[\!U_{0}(\boldsymbol{\beta},b),\! U_{1}\!(\boldsymbol{\beta},\!b), \dots\!, U_{r}(\boldsymbol{\beta},b),\ldots\!, U_{p}(\boldsymbol{\beta},b), U_{b}(\boldsymbol{\beta},b)]^{T}\!. \end{aligned} $$

If X is an n×(p+1) matrix having rows \(\boldsymbol {x}_{i}^{T}=(1, x_{i1}, \ldots, x_{ir}, \ldots, x_{ip})\), then zi/βr=−xirb−1; zi/b=−zib−1. The r-th and last components of the score functions are given by,

$$ \begin{aligned} U_{r}(\boldsymbol{\beta, b})&=-\frac{1}{b}\sum_{i=1}^{n}\Big[\delta_{i} \frac{\partial \log f_{0}(z_{i})}{\partial z_{i}}+(1-\delta_{i})\frac{\partial \log S_{0}(z_{i})}{\partial z_{i}}\Big]x_{ir}, \end{aligned} $$
$$ \begin{aligned} U_{b}(\boldsymbol{\beta, b})&=-\frac{r}{b}-\frac{1}{b}\sum_{i=1}^{n}\left[\delta_{i} \frac{\partial \log f_{0}(z_{i})}{\partial z_{i}}+(1-\delta_{i})\frac{\partial \log S_{0}(z_{i})}{\partial z_{i}}\right]z_{i}. \end{aligned} $$

The (p+2)×(p+2) observed information matrix is given by:

$$\begin{aligned} \boldsymbol{I}(\boldsymbol{\beta},b) =& \left(\begin{array}{cc} -\partial^{2}l/\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{T} & -\partial^{2}l/\partial\boldsymbol{\beta}\partial b \\ -\partial^{2}l/\partial b\partial\boldsymbol{\beta} & -\partial^{2}l/\partial b^{2} \end{array}\right) \\ =& \frac{1}{b^{2}}\!\left(\!\begin{array}{cc} -\sum_{i=1}^{n}A_{i}\boldsymbol{x}_{i}\boldsymbol{x}_{i}^{T} & -\sum_{i=1}^{n}(A_{i}z_{i} + B_{i})\boldsymbol{x}_{i} \\ \!-\!\sum_{i=1}^{n}(A_{i}z_{i} \!+\! B_{i})\boldsymbol{x}_{i}^{T} &\! -[m \!+\! \sum_{i=1}^{n}(A_{i}z_{i}^{2} +2B_{i} z_{i})] \end{array}\!\right), \end{aligned} $$


$$ \begin{aligned} A_{i} &= \delta_{i} \frac{\partial^{2} \log f_{0}(z_{i})}{\partial z_{i}^{2}}+(1-\delta_{i})\frac{\partial^{2} \log S_{0}(z_{i})}{\partial z_{i}^{2}},\\ B_{i} &= \delta_{i} \frac{\partial \log f_{0}(z_{i})}{\partial z_{i}}+(1-\delta_{i})\frac{\partial \log S_{0}(z_{i})}{\partial z_{i}}. \end{aligned} $$

Firth’s penalized likelihood method for AFT model

In order to remove the first order bias O(n−1) in the MLE of the regression parameter, say θ, of the generalized linear models, [10] introduced a penalized log-likelihood function by adding a penalty term 1/2 log|I(θ)| to the original log-likelihood function (θ). Without loss of generality, Firth’s procedure can be directly applied to the likelihood function of the AFT models given in Eq. (2). For the AFT model (Eq. 1) with a (p+2)-dimensional parameter vector θ=(βT,b)T, the penalized log-likelihood function with Firth’s penalty term is given by

$$\begin{array}{*{20}l} \ell^{*}(\boldsymbol{\beta},b)=\ell(\boldsymbol{\beta},b)+\frac{1}{2}\log|I (\boldsymbol{\beta},b)|, \end{array} $$

where |I(β,b)|1/2 is the Jeffreys invariant prior, whose influence is asymptotically negligible. By adding the penalty term with the original likelihood function given in Eq. (2) one can derive an explicit form of the penalized likelihood function. In addition, according to Firth’s principle, corresponding modified score functions for the r-th regression βr and scale parameter b can be written by adding the penalty term in Eqs. (3) and (4) respectively as follows :

$$\begin{array}{*{20}l} {U}_{r}^{*}(\boldsymbol{\beta}, b)&={U}_{r}(\boldsymbol{\beta}, b)+\frac{1}{2}\text{tr}\Big[\boldsymbol{I}(\boldsymbol{\beta},b)^{-1}\Big(\frac{\partial \boldsymbol{I}(\boldsymbol{\beta},b)}{\partial \beta_{r}}\Big)\Big]\\ &={U}_{r}(\boldsymbol{\beta}, b)+\frac{\partial} {\partial \beta_{r}} \Big[\frac{1}{2}\log |\boldsymbol{I}(\boldsymbol{\beta},b)| \Big], \end{array} $$


$$\begin{array}{*{20}l} {U}^{*}_{b}(\boldsymbol{\beta}, b)&={U}_{b} (\boldsymbol{\beta},b)+\frac{1}{2}\text{tr}\Big[\boldsymbol{I}(\boldsymbol{\beta},b)^{-1}\Big(\frac{\partial \boldsymbol{I}(\boldsymbol{\beta},b)}{\partial b}\Big)\Big]\\ &={U}_{b}(\boldsymbol{\beta}, b)+\frac{\partial} {\partial b} \Big[\frac{1}{2}\log |\boldsymbol{I}(\boldsymbol{\beta},b)| \Big]. \end{array} $$

By expanding the above two Eqs. (67) one can derive an explicit analytical form of the score equations which ensure finite estimates of both the β and b while solving them using Newton-Raphson method. However, in numerical optimization, the Newton-Raphson’s method can be computationally tedious and inefficient for expansive and complex non-linear problems if the Hessian (second order derivative of the objective function) is directly calculated iteratively. Moreover, Newton-Raphson’s method might not work properly if the Hessian is singular at any iteration. Therefore, rather than solving the modified score equation by the Newton-Raphson algorithm, we directly optimized the penalized likelihood function given in Eq. (7) using a quasi-Newton method referred to the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, which also ensures finite estimates of both the β and b. It is computationally cheaper and more efficient than the Newton’s method and approximates the Hessian matrix using the gradient (first order derivative of the objective function) at each step rather than iteratively computing it [2]. Thus the penalized likelihood estimates θ=(β,b) can be obtained from the optimization as follows:

$$\begin{array}{*{20}l} \boldsymbol{\theta}^{*} = \operatorname*{arg\,max}_{\boldsymbol{\beta}^{*},b^{*}} \ell^{*}(\boldsymbol{\beta},b). \end{array} $$

The corresponding standard error of the estimator can be obtained from the approximated Hessian matrix.

Intercept and scale parameter correction

Firth’s penalized likelihood method is known to reduce first order bias in the estimate of the model parameter by shrinking the estimate towards the true value. However, incorporating the penalty term in the likelihood may cause greater shrinkage of the intercept and scale parameter, which in turn may provides bias in the estimated survival probabilities given by

$$\begin{array}{*{20}l} S(t)=S_{0}\left(\frac{\log t - u(\boldsymbol{x} ; \boldsymbol{\beta}^{*})}{b^{*}}\right)=S_{0}\left(\frac{\log t - \boldsymbol{\beta}^{*T} \boldsymbol{x}}{b^{*}}\right). \end{array} $$

Therefore, a post-hoc adjustment of the intercept and scale parameter estimates in the AFT model has been performed by keeping these parameters out of penalization. The adjustment was performed by following the procedure described in a recent study by [22] for correcting the intercept term in the Firth’s logistic regression. The intercept and scale parameter corrections in AFT models can be administered as follows:

  1. (i)

    Estimate the coefficients as \(\boldsymbol {\hat {\theta }}_{F}=(\hat {\beta }_{F,0},\hat {\beta }_{F,1},\dots,\hat {\beta }_{F,p}, \hat {b}_{F})\) by Firth’s penalization.

  2. (ii)

    Calculate the linear predictors \(\hat {\eta _{i}}=\hat {\beta }_{F,1}x_{i1} + \hat {\beta }_{F,2}x_{i2} + \dots + \hat {\beta }_{F,p}x_{ip}\), omitting the intercept.

  3. (iii)

    Determine the ML estimate \(\hat {\beta }_{0}\) of the intercept and \(\hat {b}\) of the scale parameter for the AFT model \(Y=\beta _{0}+ \hat {\eta _{i}} + bZ,\) containing only a single predictor \(\hat {\eta _{i}}\) with regression coefficient equal to one. This can be achieved by including an offset in a standard procedure or by direct maximum likelihood estimation.

  4. (iv)

    The resulting estimate \(\boldsymbol {\hat {\theta }}_{C} = (\hat {\beta _{0}},\hat {\beta }_{F,1},\dots, \hat {\beta }_{F,p}, \hat {b})\) is then considered as the corrected Firth’s estimate with the intercept and scale parameter replaced by \(\hat {\beta _{0}}\) and \(\hat {b}\) respectively in the original estimates \(\boldsymbol {\hat {\theta }}_{F}\).

This post-hoc adjustment is only required if the interest is to use the model for survival prediction, which is often a primary objective of many studies in clinical prediction research. In the following sub-section, the performance of the Firth’s estimates of the model parameters is investigated using an extensive simulation study and compared the results to those obtained for the standard maximum likelihood techniques.

Simulation study

Simulation design

Let us consider survival time Ti for the ith observation (i=1,…,n) which follows a probability distribution belonging to log-location scale family of distributions (e.g., Weibull, log-normal and log-logistic) and right censored time Ci which is independent of covariates. We considered two covariates of which one is continuous (Xc) and the other is binary (Xb). The continuous covariate was generated from a standard normal distribution and the binary covariate was generated from a Bernoulli distribution with probability of event π. Then the survival time from the log-location scale family of distributions has been generated as follows:

$$T_{i} = \exp (\beta_{0} + \beta_{c} X_{ic} + \beta_{b} X_{ib} + bZ_{i})$$

where, β0 denotes the intercept, βc and βb represent the regression coefficients associated with the continuous and binary covariates, respectively, b is the scale parameter and Zi is the error term generated from location-scale family of distributions (e.g., Gumbel, normal and logistic distributions). To generate survival times Ti from Weibull, or log-normal or log-logistic distribution (which one we needed), we considered generating Zi from Gumbel, or Normal or Logistic distribution, respectively. Further the censoring times (Ci) were generated independently from the same distribution from where the survival time were generated but by replacing the βxi by a constant term, say λ, referred to the parameter of the censoring distribution. The value of λ control the desired percentage of censoring in the observed data. The observed time-to-event was then defined as ti= min(Ti,Ci) and the indicator as δi=I(TiCi).

Three simulation series were performed separately by considering three different (but widely used) survival distributions such as Weibul, log-normal and log-logistic. Under each distribution (model), the data were generated as described above. The true values of regression and scale parameters were fixed at β=(β0,βb,βc)=(3,1.2,0.7) and b=0.67 respectively for Weibull and log-logistic distribution. A small change was made in the regression parameters β=(β0,βb,βc)=(1,1.2,0.7) in case of log-normal, considering the same scale parameter. As the data were generated randomly, the percentage of censoring was not exactly the same for all simulated datasets. In order to generate data with a specified censoring proportion,the parameter of the censoring times distribution were determined by iterative algorithm so that the specified censoring proportion would be achieved for the selected parameter values [23]. In the simulation, we reported average of the censoring percentages over 1000 simulations.

For each of the three models, several simulation scenarios were considered by varying the sample size n as 30, 50, and 100 and the percentage of censoring as 10, 20, 30, 40, 50, 60, 70, and 80 for each sample size scenario, except for n=30 for which it was administered up to 60 because extremely high percentage censoring for very small sample raised serious convergence problem in fitting procedure of the model due to lack of event. Again, fixing the censoring percentage at 20%, we varied the sample size as 15,30,45,60,75,90,105 and 120 to examine the finite sample properties of the estimates under a fixed proportion of censoring.

Further simulation was performed considering a scenario with separation, caused by an influential covariate in the model. In order to create separation, a comparatively larger value of βb (1.9) was attributed to the binary covariate Xb than the continuous covariate Xc (0.5) to increase its influence so that it can create separation. As discussed by [12], separation have been considered in a survival dataset if, at each failure time, the covariate value for the subjects who were failed is the greatest (or always smallest) among all the all covariate values in the risk set at that time point. It may happen at most failure time points if the influence of the covariate is strong. If separation happens for a binary covariate, according to the definition by Heinze and Schemper, the covariate value separate (fully or partially) the event from non-event (censored) in the data, resulting in a large difference in both survival curves and median survival times between two groups of subjects with respect to covariate values (1/0). Therefore, to explore the existence of separation in a simulated dataset, we tried to mimic the situation by producing a 2×2 contingency table of the censoring status and the binary covariate Xb and we considered as separated data if there is at least one cell with 0 frequency and/or the median survival times between the subjects with Xb=0 and those with Xb=1 is significantly different with p-value < 0.01. However, not all datasets over the number of simulations have such condition, but a number of datasets have at least one of the cells contains frequency less than or equal to 5 and/or the median survival times between these two groups of subjects is significant with p-value lies between 0.01 and 0.05, which we termed as “near-to-separation" (i.e., partially separated data). The condition of the near-to-separation was discussed in other studies for binary data [19]. We examined the effect of separation or near-to-separation for sample size 50 and censoring percentage 20, 50 and 80.

Model fitting and evaluating the performance

Under each scenario, we fitted model using both maximum likelihood and Firth’s penalized approaches and evaluated the properties such bias, mean-squared error (MSE) and length of confidence interval (CI). We reported the estimates of the parameters of the model as the average of 1000 simulations. The bias was calculated as the difference between the estimate and true value and the mean squared error as the mean of the squared differences between the estimates in each simulated data and the true value. We also reported both the analytical standard error as the mean of the standard error obtained during model fitting in each simulated data over the number of simulation sets and simulation standard error as the standard deviation of the estimates obtained in each simulation over the number of simulations. All computations were conducted with R statistical software of the version 3.5.3. The standard AFT models with MLE were fitted using the survreg function of the survival package and a self-written function aft.firth was applied to optimize the penalized likelihood function. The R-code of the aft.firth function is available as supplementary document of the article.

Simulation results

For the Weibull AFT model, the results suggests that both the coefficients associated with continuous (βc) and binary (βb) covariates are overestimated, in general, by MLE (Table 1). The degree of overestimation increases with the increasing percentage of censored observation. On the contrary, Firth’s penalized method showed some improvements by reducing bias and MSE for the estimates of both βc and βb (Figs. 1 and 2). The mean width of Wald based confidence interval is also narrower for Firth’s estimates in all cases, providing more precision in high censored situations. Table 2 shows that the intercept and scale parameter are generally underestimated by the MLE in most cases, except for intercept in high censored cases where it is highly overestimated. After making the post-hoc adjustment to the intercept and scale parameter, the Firth’s penalized method showed improvement by providing estimates relatively closer to the true value.

Fig. 1
figure 1

Bias associated with the estimates of regression coefficients (βc for continuous covariates and βb for binary covariates) obtained from both MLE and Firth procedure for Weibull AFT model

Fig. 2
figure 2

MSE associated with the estimates of regression coefficients (βc for continuous covariates and βb for binary covariates) obtained from both MLE and Firth procedure for Weibull AFT model

Table 1 Results of both MLE and Firth’s Penalized Likelihood Estimation for both βb and βc under Weibull Distribution. Each cell represents mean and standard deviation of estimates over number of valid cases (removing the simulations that were failed to achieve convergence) out of 1000 simulations. The maximum number of convergence failure for MLE is 60 when sample sizze is 50 and censoring rate 80%
Table 2 Results of both β0 and b from Maximum Likelihood Estimation and Firth’s Penalized Likelihood Estimation under Weibull Distribution. Each cell represents mean and standard deviation of estimates from valid cases out of 1000 simulations. The maximum number of convergence failure for MLE is 60 when sample sizze is 50 and censoring rate 80%

When the performance of the methods were examined for the scenario with existence of separation, the results depicts that the MLE provides infinitely large estimates in the presence of separation particularly for the regression coefficient (βb) associated with the binary covariate that created separation (Table 3). Conversely, Firth’s penalized method showed significant improvement by providing finite estimates of both the coefficient and its SE. It is notable that the values for MLEs are extremely large for high censoring, whereas the Firth’s procedure succeeds to provide finite estimates in such an extreme case. The amount of improvement is greater for the regression coefficient (βb) associated with the binary covariates than those associated with the continuous covariate. The simulation results for the near-to-separation scenario is similar to that of the separation, but with lower amount of bias. It is also reported that, in the presence of separation, the MLE failed to achieve convergence proving infinitely large value. The rate convergence-failure was 26% when sample size was 50 and censoring was 80%, and the rate decreased to 10% when sample size was 100 with the same level of censoring percentage (results not shown). The convergence-failure rate also decreased with the decreasing censoring percentage and it is very low (often negligible) while there was near-to-separation. In contrast, Firth’s penalized method achieved convergence in all simulation scenarios. Further, the penalized method with ad-hoc adjustment of the intercept and scale parameter outperformed the MLE when it was used for prediction for the survival probabilities (Table 4, Fig. 3). The penalized method provided very close prediction of the true survival probability at the 1st, 2nd, and 3rd quartiles of the survival time (Table 4) and over the whole follow-up time (Fig. 3) in comparison with the MLE.

Fig. 3
figure 3

Estimated mean survival probabilities over 500 simulations by both MLE and Firth procedures under Weibull distribution for sample size 30 with censoring percentage C=50%

Table 3 Estimates, standard error (SE) and simulation standard error (Sim.SE) of β0 and b from Maximum Likelihood Estimation and Firth’s Penalized Likelihood Estimation under Weibull Distribution in case of separation and near-to separation. Maximum convergence failure by MLE is 26.6% when separation occur over 1000 simulations in case of 80% censoring
Table 4 Estimates of survival probabilities (mean over 500 simulations) at the 1st, 2nd and 3rd quantile of survival times of Weibull distribution with different values of binary covariates and the mean value of continous covariate for sample size 50 and censoring 50%

Similar findings were found for the log-logistic AFT model, where Firth’s penalized method showed improvement over the MLE by reducing bias and MSE and providing narrower confidence interval, particularly when censoring percentage is high, for both the regression coefficients βc and βb in the model (Supplementary Table S1, Fig. 4). Similarly, for the intercept (β0) and scale parameter (b), Firth’s method with ad-hoc correction showed better performance than the MLE, particularly for high censoring situation (Supplementary Table S2). The correction procedure renders better performance in rare event situations under log-logistic distribution. In the presence of separation, Firth’s penalized method outperforms the MLE by reducing bias to some extent and providing narrower confidence interval (Supplementary Table S3). Similar findings can also be observed for survival prediction for the log-logistic AFT model (results not shown).

Fig. 4
figure 4

MSE associated with the estimates of regression coefficients (βc for continuous covariates and βb for binary covariates) obtained from both MLE and Firth procedure for Log-logistic AFT model

For the log-normal AFT model, Firth’s penalized method also showed similar performance by providing lower MSE and narrower confidence intervals than MLE (Supplementary Table S4, Fig. 5). The amount of improvement by Firth’s penalized method is also greater for the regression coefficient (βb) associated with binary covariates. For the intercept and scale parameters, the correction of Firth’s procedure provides an improvement over the MLE (Supplementary Table S5). Similarly, greater performance were also achieved by the Firth’s penalized method when it was used for prediction of the survival probabilities (results not shown).

Fig. 5
figure 5

MSE associated with the estimates of regression coefficients (βc for continuous covariates and βb for binary covariates) obtained from both MLE and Firth procedure for Log-normal AFT model

Illustration using prostate cancer data

The methods are illustrated using prostate cancer data, which were previously used in a study by [5] and are publicly accessible ( They conducted an exploratory analysis on the data from a clinical trial of estrogen treatment for a total of 502 prostatic cancer patients several survival status (1 - alive, 2 - dead from prostatic cancer, 3 - dead from heart or vascular disease, 4 - dead from cerebrovascular accident, 5 - dead from pulmonary embolus, 6 - dead from other cancer, 7 - dead from respiratory disease, 8 - dead from other specific non-cancer cause, 9 - dead from other unspecified non-cancer cause and 10 - dead from unknown cause). In this study, we firstly focus on the effectiveness of the treatment on the survival of patient with prostatic cancer and hence observe the time-to-death due to prostatic cancer or not (censoring percentage 53.82). An exploratory analysis with a 2×2 contingency table showed that the original data does not suffer from the problem of separation. Therefore, for an illustrative purpose of the method discussed, a random sub-sample of patients who are either alive or died from prostatic cancer was taken from the original data in order to create separation and near-to-separation. Furthermore, since the number of patients who died from respiratory disease and pulmonery embolus were very small in the original data (16 and 14 respectively) making these events rare, we also considered two more scenarios with the data: one with patients who are alive or died from respiratory disease and another with patients who are alive or died from pulmonery embolus. Each of these two scenarios created near-to-separation (non-zero cells with few observations).

The covariates of interest are: treatment (0, low; 1, high-dose), age (0, < 75 years; 1, 75 to 80 years; 2, ≥ 80 years), weight index caluctaed as weight (kg) - height(cm) + 200 (0, ≥ 100; 1, 80-99; 2, < 80), performance rating (0, normal; 1, limitation of activity), history of cardiovascular disease (0, no; 1, yes), serum haemoglobin (0, ≥ 12g/100 ml; 1, 9-12g/100 ml; 2, < 9g/100 ml), size of primary lesion (0, < 30 cm2; 1, ≥ 30 cm2), and Gleason stage/grade category (0, ≥ 10; 1, > 10). The variables are denoted as AG (patient age), WT(weight index), PF(performance rating), HX (history of cardiovascular disease), HG (serum haemoglobin), SZ (size of primary lesion) and SG (Geason stage/grade category). For each scenario, we considered a Weibull AFT model with the covariates selected based on the analysis discussed in literature [5, 14].

$$\begin{aligned} \log (T_{i}) &= \beta_{0} + \beta_{1} Treatment_{i} + \beta_{2} AG_{i} + \beta_{3} WT_{i} + \beta_{4} PF_{i} +\\&\beta_{5} HX_{i}+ \beta_{6} HG_{i} +\beta_{7} SZ_{i} +\beta_{8} SG_{i} + bz_{i},\! \quad i=1\dots, n. \end{aligned} $$

The above Weibull AFT model was fitted for both the cases of separation and near-to separation with patients died from prostatic cancer and the near-to separation case with patients died from respiratory disease and pulmonary embolus.

Analysis and results of prostate cancer data

As mentioned, for illustrative purpose, a random sample of size 30 was taken from the original sample to create separation and near-to-separation in the data, respectively. A 2×2 contingency table (Table 5) between the estrogen treatment and patient status shows two different forms of separation. Moreover, patient’s age and haemoglobin level have also created separation in the outcome variable. Table 6 reveals that the MLE of regression coefficient of the treatment status responsible to create separation in the sub-sample is very large comparative to that of the Firth’s estimate of the coefficient making MLE uninterpretable. Under the separation scenario, the MLE fails to deliver a standard error for the regression coefficient of age resulting in a disrupted inference. Conversely, Firth’s procedure produces a finite estimate and standard error for the corresponding covariate. Although, the MLE becomes smaller with increased degrees of overlapping in the near-to-separation for the treatment status, the Firth’s procedure provides both smaller estimate and standard error for this covariate. In terms of standard error, the penalized estimates are more efficient than MLE in each separation scenarios.

Table 5 Contingency tables between dichotomous covariates (treatment) and response (prostate cancer status) showing separation and near-to-separation
Table 6 Estimates of regression parameters and their standard error obtained from MLE and Firth’s procedure by fitting Weibull AFT model for prostate cancer data under separation and near-to-separation

Analysis and results of respiratory disease and pulmonary embolus

The contingency Table 7 between the estrogen treatment and patient’s survival status (death from respiratory disease and pulmonary embolus) shows the existence of near-to-separation in the data. Furthermore, performance rating, serum haemoglobin level and size of primary lesion have also created separation in outcome variable in both cases of respiratory disease and pulmonary embolus. Here, total sample size consists of 148 alive with only 16 (Respiratory disease) and 14 (Pulmonary embolus) events or failure and near-to-separation can be observed. The censoring percentage is 91.92% (Respiratory disease) and 93.08% (Pulmonary embolus) respectively. The Table 8 reveals that similar to the scenario with prostatic cancer, the Firth’s estimates of coefficient of treatment are smaller in magnitude and have smaller standard error than MLE in each case. Moreover, zero or extremely large standard errors estimated by MLE for some covariates (performance rating, serum haemoglobin and size of primary lesion) indicate convergence failure during estimation. In contrary, Firth’s procedure succeeds to deliver finite estimates and smaller standard error for all covariates in each case.

Table 7 Contingency tables between dichotomous covariates (treatment) and response (respiratory disease and pulmonary embolus status)
Table 8 Estimates of regression parameters and their standard error obtained from MLE and Firth’s procedure by fitting Weibull AFT model for time-to-event data with outcome both respiratory disease and pulmonary embolus


The AFT model is being widely used to analyze survival data from health and reliability engineering because of its intuitive interpretation connecting directly with failure time. For the rare event survival data or data with small in size, separation or monotone likelihood often exists in the fitting process of the AFT model using maximum likelihood estimation technique. The paper investigated the performance of the MLE of the AFT model in such data condition and addressed these issues by introducing a penalized likelihood approach by adding a Firth-type penalty term to the original likelihood. Further a post-hoc correction was made by keeping the intercept and scale parameter out of penalization to improve the estimates of predicted survival probabilities. The performance of the proposed method was evaluated using an extensive simulation study considering AFT model under three different (but widely used) distributions of the log-location scale family separately. For each of the models, the proposed penalized method has been shown to provide superior performance over MLE by solving the problem of monotone likelihood reflected by achieving convergence and providing estimates with lower bias and MSE and narrower confidence interval, in most simulation scenarios.

In particular, when sample size is small and/or percentage of censoring is high, the regression coefficient estimates (both binary and continuous) from penalized likelihood are generally shown to have lower bias and MSE with narrower confidence interval than that for MLE. Again, in the presence of any form of separation, the simulation results revealed that the MLE provided large amount of bias and MSE (often infinitely large value indicating frequent convergence failure) for the estimates of the regression coefficient, particularly those associated with the binary covariates that created separation. On the contrary, the penalized method showed improvement over MLE by achieving convergence and reducing bias and MSE to some extent and providing narrower confidence interval. However, comparable results are observed for both methods for the regression coefficient associated with continuous covariates that didn’t make separation. Simulation study also showed that the performance of the penalized likelihood estimation tends to be better than the MLE in separation than that for near-to-separation, indicating the effectiveness of the proposed method in extreme situations of separation. Furthermore, the post-hoc adjustment of the intercept and scale parameters under the penalized method has been shown to generate improved intercept and scale parameter estimates over MLEs by lowering the bias and consequently to provide relatively accurate estimates of the survival probabilities at different quartiles of the survival times. Simulation results of this study are quite similar to those with the other studies in the recent years which discussed Firth-type penalized estimates of regression models such as logistic regression [13, 22] and Cox regression [12].

An illustration of the methods using prostate cancer data supported the simulation findings by providing estimates with intuitive interpretation. However, demonstration of a rigorous application of this approach to a data with existence of high rate of censoring and/or existence of separation was not possible here to due to lack of access to such data, which may be useful for the practical users of this method. The proposed penalized method for AFT model underestimated the true SE for some scenarios (small smaple with high rate of censoring) and hence provided biased estimate of confidence interval, hence further study may be required with profile likelhood based confidence interval to address this problem. Further study may also be required to compare the performance of the Firth’s penalization for the Cox proportional hazard and AFT medals to address such problems related to small sample, high censoring and separation, because of a physical difference between of theses two models. In addition, one may compare the performance of Firth’s penalized AFT model with other penalized methods such as Ridge regression, LASSO etc.


The findings of the paper suggest that if the sample size is small and/or the percentage of censoring is high, the performance of MLE becomes unreliable as it provides biased estimates and creates separation leading to monotone likelihood with frequent convergence failure. The proposed penalized approach showed superior performance over MLE by reducing bias and MSE and solving the problem of separation. Therefore, if sample size is relatively small (e.g., n≤50) or there is evidence of high censoring and/or separation in the data, it is recommended to apply Firth’s penalized method for fitting AFT models.

Availability of data and materials

The dataset used in this study can be downloaded freely from the public domain, under the authority of the department of Biostatistics, Vanderbilt University, USA, at


  1. Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika. 1984; 71(1):1–10.

    Article  Google Scholar 

  2. Arora JS. Introduction to optimum design. 4th ed: Elsevier; 2004.

  3. Bryson MC, Johnson ME. The incidence of monotone likelihood in the cox model. Technometrics. 1981; 23(4):381–83.

    Article  Google Scholar 

  4. Bull SB, Mak C, Greenwood CM. A modified score function estimator for multinomial logistic regression in small samples. Comput Stat Data Anal. 2002; 39(1):57–74.

    Article  Google Scholar 

  5. Byar D, Green S. The choice of treatment for cancer patients based on covariate information. Bull Cancer. 1980; 67(4):477–90.

    CAS  PubMed  Google Scholar 

  6. Cordeiro GM, Cribari-Neto F. On bias reduction in exponential and non-exponential family regression models. Commun Stat-Simul Comput. 1998; 27(2):485–500.

    Article  Google Scholar 

  7. Cordeiro GM, McCullagh P. Bias correction in generalized linear models. J R Stat Soc Ser B Methodol. 1991; 53(3):629–43.

    Google Scholar 

  8. Cox DR. Regression models and life-tables. J R Stat Soc Ser B Methodol. 1972; 34(2):187–202.

    Google Scholar 

  9. Cox DR, Oakes D. Analysis of survival data. 1st ed: Chapman and Hall/CRC; 1984.

  10. Firth D. Bias reduction of maximum likelihood estimate. Biometrika. 1993; 80(1):27–38.

    Article  Google Scholar 

  11. Hauck Jr WW, Donner A. Wald’s test as applied to hypotheses in logit analysis. J Am Stat Assoc. 1977; 72(360a):851–53.

    Article  Google Scholar 

  12. Heinze G, Schemper M. A solution to the problem of monotone likelihood in cox regression. Biometrics. 2001; 57(1):114–19.

    Article  CAS  Google Scholar 

  13. Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002; 21(16):2409–19.

    Article  Google Scholar 

  14. Kay R. Treatment effects in competing-risks analysis of prostate cancer data. Biometrics. 1986; 42(1):203–11.

    Article  CAS  Google Scholar 

  15. Kosmidis I, Firth D. Multinomial logit bias reduction via the Poisson log-linear model. Biometrika. 2011; 98(3):755–59.

    Article  Google Scholar 

  16. Lawless JF. Statistical models and methods for lifetime data. 2nd ed: John Wiley & Sons, Inc; 2011.

  17. Lesaffre E, Albert A. Partial separation in logistic discrimination. J R Stat Soc Ser B Methodol. 1989; 51(1):109–16.

    Google Scholar 

  18. Leung DH-Y, Wang Y-G. Bias reduction using stochastic approximation. Aust N Z J Stat. 1998; 40(1):43–52.

    Article  Google Scholar 

  19. Mondal M, Rahman MS. Bias-reduced and separation-proof GEE with small or sparse longitudinal binary data. Stat Med. 2019; 38(14):2544–60.

    Article  Google Scholar 

  20. Oakes D. The asymptotic information in censored survival data. Biometrika. 1977; 64(3):441–48.

    Article  Google Scholar 

  21. Pettitt A, Kelly J, Gao J. Bias correction for censored data with exponential lifetimes. Stat Sin. 1998; 8(3):941–63.

    Google Scholar 

  22. Puhr R, Heinze G, Nold M, Lusa L, Geroldinger A. Firth’s logistic regression with rare events: accurate effect estimates and predictionsStat Med. 2017; 36(14):2302–17.

    PubMed  Google Scholar 

  23. Qian J, Li B, Chen P-y. Generating survival data in the simulation studies of cox model. In: 2010 Third International Conference on Information and Computing. Wuxi: IEEE: 2010. p. 93–96.

    Google Scholar 

  24. Schaefer RL. Bias correction in maximum likelihood logistic regression. Stat Med. 1983; 2(1):71–78.

    Article  CAS  Google Scholar 

  25. Vaeth M. On the use of wald’s test in exponential families. Int Stat Rev/Rev Int Stat. 1985; 53(2):199–214.

    Article  Google Scholar 

Download references


The authors acknowledge authority of the department of biostatistics, Vanderbilt University, USA for making available the dataset (Byar & Greene prostate cancer data) used in this study in a public domain.


The authors received no specific fund for this study. They did this research based on their on interest and used data from secondary source.

Author information

Authors and Affiliations



MSR and WB contributed to the design of the study; TFA analyzed the data; TFA and MSR wrote the draft of the manuscript and WB provided inputs. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to M. Shafiqur Rahman.

Ethics declarations

Ethics approval and consent to participate

As the dataset is freely available in a public domain and is permitted to use in research publication, the ethics approval and consent statement has been approved by the authority who made the data available for public use. In addition, this is to confirm that all methods used in data collection were carried out in accordance with relevant guidelines and regulations.

Consent for publication

Not applicable.

Competing interests

The authors declared that they have no competing interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

Supplementary Tables.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alam, T.F., Rahman, M.S. & Bari, W. On estimation for accelerated failure time models with small or rare event survival data. BMC Med Res Methodol 22, 169 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: