 Research
 Open Access
 Published:
On estimation for accelerated failure time models with small or rare event survival data
BMC Medical Research Methodology volume 22, Article number: 169 (2022)
Abstract
Background
Separation or monotone likelihood may exist in fitting process of the accelerated failure time (AFT) model using maximum likelihood approach when sample size is small and/or rate of censoring is high (rare event) or there is at least one strong covariate in the model, resulting in infinite estimates of at least one regression coefficient.
Methods
This paper investigated the properties of the maximum likelihood estimator (MLE) of the regression parameters of the AFT models for small sample and/or rareevent situation and addressed the problems by introducing a penalized likelihood approach. The penalized likelihood function and the corresponding score equation is derived by adding a penalty term to the existing likelihood function, which was originally proposed by Firth (Biometrika, 1993) for the exponential family models. Further, a posthoc adjustment of intercept and scale parameters is discussed keeping them out of penalization to ensure accurate prediction of survival probability. The penalized method was illustrated for the widely used loglocationscale family models such as Weibull, Lognormal and Loglogistic distributions and compared the models and methods uisng an extensive simulation study.
Results
The simulation study, performed separately for each of the loglocationscale models, showed that Firth’s penalized likelihood succeeded to solve the problem of separation and achieve convergence, providing finite estimates of the regression coefficients, which are not often possible by the MLE. Furthermore, the proposed penalized method showed substantial improvement over MLE by providing smaller amount of bias, mean squared error (MSE), narrower confidence interval and reasonably accurate prediction of survival probabilities. The methods are illustrated using prostate cancer data with existence of separation, and results supported the simulation findings.
Conclusion
When sample size is small (≤ 50) or event is rare (i.e., censoring proportion is high) and/or there is any evidence of separation in the data, we recommend to use Firth’s penalized likelihood method for fitting AFT model.
Background
It is now well established in generalized linear model literature that maximum likelihood estimation provides consistent estimates of the regression parameters when sample size is large. However, it may fail to provide a finite or unbiased estimate for at least one regression parameter of the model if sample size is small [10]. The smallsample consequences arise frequently and become worse if there exists separation in the data [12]. The problem of separation or monotone likelihood, first introduced by [1] in binary regression model is a special condition in a dataset which breakdown the standard maximum likelihood method during fitting process of the model resulting in nonexistence (or infinite values) of the maximum likelihood estimates. The situation of separation or monotone likelihood may arise when dataset is small in size as well as if event or nonevent of interest can be separated mostly by a binary covariate or a linear combination of several covariates [1, 3, 12, 13]. In the presence of separation, the traditional maximum likelihood estimation approach may provide highly biased, even infinite estimates for the regression coefficients of one or more covariates and hence provide Wald confidence intervals of infinite width [11, 12, 17, 25]. A number of studies discussed finite sample bias correction in the maximum likelihood estimate of the regression coefficient and provided solution to separation [6, 7, 18, 24]. To address the problem of separation and infinite estimates, Firth’s preventive method [10] is widely used in statistical inference as it eliminates the first order term O(n^{−1}) in the asymptotic bias of the estimated parameters by solving the modified estimating equation resulted from the addition of a Jeffrey’s invariant prior based penalty term to the original likelihood function. The performance of Firth’s approach in proving the bias reduced estimates and resolving the problem of separation has been demonstrated for the logistic regression model [13] and other models under exponential family of the distributions [4, 15, 19].
The separation or monotone likelihood issue in the context of survival data were first introduced by [12], where it was argued that the pattern of such problem and its consequences in survival data is very similar as in the binary data. However, one may visualize it differently for survival data context. For example, for a single covariate, this occurs when, at each failure time, the covariate value for the failed subjects is the largest of all covariate values in the risk set at that time, or when it is always the smallest. It also occurs when the same is true for a linear combination of covariates. As discussed by Heinze and Schemper, the problem of separation is likely to occur if the sample size is small, percentage of censoring is high, and/or there are one or more strong covariates, particularly binary covariates. The chance of separation increases with the degree of imbalance in the distribution of binary covariate. However, the separation is rarely occur with the continuous covariate and uncensored data. Whatever the reasons for occurring separation in survival data, the situation is by no means negligible as it creates several consequences. To overcome the problems due to small sample, or rare events, or separation in analyzing survival data, [12] suggested a penalized likelihood function for semiparametric Cox proportional hazard model [8] by incorporating the Firth’s bias preventive principle into the partial likelihood function. However, an appropriately fitted parametric survival model with correctly specified distribution always yields consistent and more efficient estimates of the parameters of interest than the estimates obtained from a semiparametric model [9, 20] and have intuitive interpretation through a direct connection with the failure time. Moreover, the estimation technique under a parametric model is computationally more flexible and provides precise estimates since both the survival and censoring times are used directly to construct the likelihood function. To provide bias corrected estimate in the small sample situation, [21] applied Firth’s penalized approach to exponential survival model, which, however, has limited use in practice because of inapplicability of constant failure rate assumption in most real world applications. Therefore, under the parametric framework, it is obvious to focus on widely used accelerated failure time (AFT) model, which is a general framework of a range of parametric survival models under loglocation scale family of distributions. However, the performance of maximum likelihood estimation technique to estimate the parameters of the AFT models has not been investigated yet when sample size is small or event of interest is rare or if there exists separation in the survival data. In this study, an attempt has been made to examine performance of maximum likelihood estimators in such situations through conducting extensive simulation studies. To address the problems in the MLEs, this paper proposed penalized likelihood estimation for AFT survival models by incorporating the Firth’s penalty term to the original likelihood function, as motivated by [12]. The empirical performance of the newly proposed approach was studied through simulations, where penalized likelihood function was optimized by using quasiNewtonalgorithm.
Though the main purpose of the AFT survival models is to examine how the covariates influence the survival times, estimate of scale parameter is of interest as estimates of both regression and scale parameters are required to predict the survival quantities such as survival probabilities and hazard functions. However, imposing Jeffrey’s prior for penalization may result in further shrinkage of the estimate of scale and intercept parameters, which may lead to biased estimates of survival quantities. This is motivated by the recent study of [22] who identified that Firth’s penalization for logistic regression resulted in inaccurate prediction of overall probability and proposed a correction in the intercept term to ensure accurate prediction. Following the first modification of Firth’s procedure suggested by [22] in logistic regression, this paper also proposed a posthoc adjustment of the intercept and scale parameters estimated by the penalized likelihood method in the AFT model by keeping them out of penalization.
The paper is organized as follows. “Methodology” section describes the methodology starting with a brief discussion on AFT model and maximum likelihood estimation procedure, which is followed by the application of Firth’s principle to derive the penalized likelihood function for AFT model and optimization procedure and ended by the posthoc adjustment of the scale parameters. Comprehensive simulation studies conducted under different scenarios to compare the performance of estimates obtained through the maximum likelihood and penalized likelihood functions are given in “Simulation study” section. An illustration of the proposed methods using data on prostate cancer patients [5] is discussed in “Illustration using prostate cancer data” section. This paper concludes with a brief discussion on findings, limitations, and further scope of this study in “Discussion” section.
Methodology
AFT model
Let us consider a censored random sample containing data (y_{i},δ_{i},x_{i}),i=1,…,n, where y_{i}= log(t_{i}) is a loglifetime or logcensoringtime if the censoring indicator δ_{i}=1 or δ_{i}=0, respectively and x_{i}=(1,x_{i1},…,x_{ir},…,x_{ip})^{T} is a (p+1)dimensional vector of covariates. The locationscale family model describes the relationship between the survivor function S(y_{i}x_{i}) and a set of covariates x as
where S_{0}(z) is the survivor function of a standardized random variable, u(x_{i},β)=β^{T}x_{i} denotes location parameter and b the scale parameter, and β=(β_{0},β_{1},…,β_{r},…,β_{p})^{T} is a vector of regression coefficients [16]. The model can also be written in the following form to describe the relationship between covariate and logsurvival time:
where Z_{i} is a random variable with a standard distribution in (−∞,∞). The covariate effectively alters the logtime scale in additive form (or original timescale in multiplicative form) and hence the model is referred to as accelerated failure time (AFT) model. The above equation represents as a family of models for which Z belong to a standard locationscale family of distributions (i.e., extreme value, logistic, normal distribution) while the survival time T belonging to the loglocation scale family of distributions (i.e., Weibull, loglogistic, lognormal), respectively.
Maximum likelihood estimation for AFT model
The maximum likelihood method is commonly used to estimate the parameter of the model given in Eq. (1). Setting z_{i}=(y_{i}−u_{i})/b with p.d.f \(f_{0}(z)=S^{'}_{0}(z)\); u_{i}=u(x_{i};β) and \(m=\sum _{i=1}^{n}\delta _{i}\), the loglikelihood function for the locationscale family model (1) can be written as:
Let θ=(β^{T},b)^{T} be the (p+2)dimensional vector of parameters. The maximum likelihood estimate of θ is the solution of the estimating equations U(β,b)=0 simultaneously, where U(β,b) is the score function defined as
If X is an n×(p+1) matrix having rows \(\boldsymbol {x}_{i}^{T}=(1, x_{i1}, \ldots, x_{ir}, \ldots, x_{ip})\), then ∂z_{i}/∂β_{r}=−x_{ir}b^{−1}; ∂z_{i}/∂b=−z_{i}b^{−1}. The rth and last components of the score functions are given by,
The (p+2)×(p+2) observed information matrix is given by:
where
Firth’s penalized likelihood method for AFT model
In order to remove the first order bias O(n^{−1}) in the MLE of the regression parameter, say θ, of the generalized linear models, [10] introduced a penalized loglikelihood function by adding a penalty term 1/2 logI(θ) to the original loglikelihood function ℓ(θ). Without loss of generality, Firth’s procedure can be directly applied to the likelihood function of the AFT models given in Eq. (2). For the AFT model (Eq. 1) with a (p+2)dimensional parameter vector θ=(β^{T},b)^{T}, the penalized loglikelihood function with Firth’s penalty term is given by
where I(β,b)^{1/2} is the Jeffreys invariant prior, whose influence is asymptotically negligible. By adding the penalty term with the original likelihood function given in Eq. (2) one can derive an explicit form of the penalized likelihood function. In addition, according to Firth’s principle, corresponding modified score functions for the rth regression β_{r} and scale parameter b can be written by adding the penalty term in Eqs. (3) and (4) respectively as follows :
and
By expanding the above two Eqs. (6–7) one can derive an explicit analytical form of the score equations which ensure finite estimates of both the β and b while solving them using NewtonRaphson method. However, in numerical optimization, the NewtonRaphson’s method can be computationally tedious and inefficient for expansive and complex nonlinear problems if the Hessian (second order derivative of the objective function) is directly calculated iteratively. Moreover, NewtonRaphson’s method might not work properly if the Hessian is singular at any iteration. Therefore, rather than solving the modified score equation by the NewtonRaphson algorithm, we directly optimized the penalized likelihood function given in Eq. (7) using a quasiNewton method referred to the BroydenFletcherGoldfarbShanno (BFGS) algorithm, which also ensures finite estimates of both the β and b. It is computationally cheaper and more efficient than the Newton’s method and approximates the Hessian matrix using the gradient (first order derivative of the objective function) at each step rather than iteratively computing it [2]. Thus the penalized likelihood estimates θ^{∗}=(β^{∗},b^{∗}) can be obtained from the optimization as follows:
The corresponding standard error of the estimator can be obtained from the approximated Hessian matrix.
Intercept and scale parameter correction
Firth’s penalized likelihood method is known to reduce first order bias in the estimate of the model parameter by shrinking the estimate towards the true value. However, incorporating the penalty term in the likelihood may cause greater shrinkage of the intercept and scale parameter, which in turn may provides bias in the estimated survival probabilities given by
Therefore, a posthoc adjustment of the intercept and scale parameter estimates in the AFT model has been performed by keeping these parameters out of penalization. The adjustment was performed by following the procedure described in a recent study by [22] for correcting the intercept term in the Firth’s logistic regression. The intercept and scale parameter corrections in AFT models can be administered as follows:

(i)
Estimate the coefficients as \(\boldsymbol {\hat {\theta }}_{F}=(\hat {\beta }_{F,0},\hat {\beta }_{F,1},\dots,\hat {\beta }_{F,p}, \hat {b}_{F})\) by Firth’s penalization.

(ii)
Calculate the linear predictors \(\hat {\eta _{i}}=\hat {\beta }_{F,1}x_{i1} + \hat {\beta }_{F,2}x_{i2} + \dots + \hat {\beta }_{F,p}x_{ip}\), omitting the intercept.

(iii)
Determine the ML estimate \(\hat {\beta }_{0}\) of the intercept and \(\hat {b}\) of the scale parameter for the AFT model \(Y=\beta _{0}+ \hat {\eta _{i}} + bZ,\) containing only a single predictor \(\hat {\eta _{i}}\) with regression coefficient equal to one. This can be achieved by including an offset in a standard procedure or by direct maximum likelihood estimation.

(iv)
The resulting estimate \(\boldsymbol {\hat {\theta }}_{C} = (\hat {\beta _{0}},\hat {\beta }_{F,1},\dots, \hat {\beta }_{F,p}, \hat {b})\) is then considered as the corrected Firth’s estimate with the intercept and scale parameter replaced by \(\hat {\beta _{0}}\) and \(\hat {b}\) respectively in the original estimates \(\boldsymbol {\hat {\theta }}_{F}\).
This posthoc adjustment is only required if the interest is to use the model for survival prediction, which is often a primary objective of many studies in clinical prediction research. In the following subsection, the performance of the Firth’s estimates of the model parameters is investigated using an extensive simulation study and compared the results to those obtained for the standard maximum likelihood techniques.
Simulation study
Simulation design
Let us consider survival time T_{i} for the i^{th} observation (i=1,…,n) which follows a probability distribution belonging to loglocation scale family of distributions (e.g., Weibull, lognormal and loglogistic) and right censored time C_{i} which is independent of covariates. We considered two covariates of which one is continuous (X_{c}) and the other is binary (X_{b}). The continuous covariate was generated from a standard normal distribution and the binary covariate was generated from a Bernoulli distribution with probability of event π. Then the survival time from the loglocation scale family of distributions has been generated as follows:
where, β_{0} denotes the intercept, β_{c} and β_{b} represent the regression coefficients associated with the continuous and binary covariates, respectively, b is the scale parameter and Z_{i} is the error term generated from locationscale family of distributions (e.g., Gumbel, normal and logistic distributions). To generate survival times T_{i} from Weibull, or lognormal or loglogistic distribution (which one we needed), we considered generating Z_{i} from Gumbel, or Normal or Logistic distribution, respectively. Further the censoring times (C_{i}) were generated independently from the same distribution from where the survival time were generated but by replacing the βx_{i} by a constant term, say λ, referred to the parameter of the censoring distribution. The value of λ control the desired percentage of censoring in the observed data. The observed timetoevent was then defined as t_{i}= min(T_{i},C_{i}) and the indicator as δ_{i}=I(T_{i}≤C_{i}).
Three simulation series were performed separately by considering three different (but widely used) survival distributions such as Weibul, lognormal and loglogistic. Under each distribution (model), the data were generated as described above. The true values of regression and scale parameters were fixed at β=(β_{0},β_{b},β_{c})=(3,1.2,0.7) and b=0.67 respectively for Weibull and loglogistic distribution. A small change was made in the regression parameters β=(β_{0},β_{b},β_{c})=(1,1.2,0.7) in case of lognormal, considering the same scale parameter. As the data were generated randomly, the percentage of censoring was not exactly the same for all simulated datasets. In order to generate data with a specified censoring proportion,the parameter of the censoring times distribution were determined by iterative algorithm so that the specified censoring proportion would be achieved for the selected parameter values [23]. In the simulation, we reported average of the censoring percentages over 1000 simulations.
For each of the three models, several simulation scenarios were considered by varying the sample size n as 30, 50, and 100 and the percentage of censoring as 10, 20, 30, 40, 50, 60, 70, and 80 for each sample size scenario, except for n=30 for which it was administered up to 60 because extremely high percentage censoring for very small sample raised serious convergence problem in fitting procedure of the model due to lack of event. Again, fixing the censoring percentage at 20%, we varied the sample size as 15,30,45,60,75,90,105 and 120 to examine the finite sample properties of the estimates under a fixed proportion of censoring.
Further simulation was performed considering a scenario with separation, caused by an influential covariate in the model. In order to create separation, a comparatively larger value of β_{b} (1.9) was attributed to the binary covariate X_{b} than the continuous covariate X_{c} (0.5) to increase its influence so that it can create separation. As discussed by [12], separation have been considered in a survival dataset if, at each failure time, the covariate value for the subjects who were failed is the greatest (or always smallest) among all the all covariate values in the risk set at that time point. It may happen at most failure time points if the influence of the covariate is strong. If separation happens for a binary covariate, according to the definition by Heinze and Schemper, the covariate value separate (fully or partially) the event from nonevent (censored) in the data, resulting in a large difference in both survival curves and median survival times between two groups of subjects with respect to covariate values (1/0). Therefore, to explore the existence of separation in a simulated dataset, we tried to mimic the situation by producing a 2×2 contingency table of the censoring status and the binary covariate X_{b} and we considered as separated data if there is at least one cell with 0 frequency and/or the median survival times between the subjects with X_{b}=0 and those with X_{b}=1 is significantly different with pvalue < 0.01. However, not all datasets over the number of simulations have such condition, but a number of datasets have at least one of the cells contains frequency less than or equal to 5 and/or the median survival times between these two groups of subjects is significant with pvalue lies between 0.01 and 0.05, which we termed as “neartoseparation" (i.e., partially separated data). The condition of the neartoseparation was discussed in other studies for binary data [19]. We examined the effect of separation or neartoseparation for sample size 50 and censoring percentage 20, 50 and 80.
Model fitting and evaluating the performance
Under each scenario, we fitted model using both maximum likelihood and Firth’s penalized approaches and evaluated the properties such bias, meansquared error (MSE) and length of confidence interval (CI). We reported the estimates of the parameters of the model as the average of 1000 simulations. The bias was calculated as the difference between the estimate and true value and the mean squared error as the mean of the squared differences between the estimates in each simulated data and the true value. We also reported both the analytical standard error as the mean of the standard error obtained during model fitting in each simulated data over the number of simulation sets and simulation standard error as the standard deviation of the estimates obtained in each simulation over the number of simulations. All computations were conducted with R statistical software of the version 3.5.3. The standard AFT models with MLE were fitted using the survreg function of the survival package and a selfwritten function aft.firth was applied to optimize the penalized likelihood function. The Rcode of the aft.firth function is available as supplementary document of the article.
Simulation results
For the Weibull AFT model, the results suggests that both the coefficients associated with continuous (β_{c}) and binary (β_{b}) covariates are overestimated, in general, by MLE (Table 1). The degree of overestimation increases with the increasing percentage of censored observation. On the contrary, Firth’s penalized method showed some improvements by reducing bias and MSE for the estimates of both β_{c} and β_{b} (Figs. 1 and 2). The mean width of Wald based confidence interval is also narrower for Firth’s estimates in all cases, providing more precision in high censored situations. Table 2 shows that the intercept and scale parameter are generally underestimated by the MLE in most cases, except for intercept in high censored cases where it is highly overestimated. After making the posthoc adjustment to the intercept and scale parameter, the Firth’s penalized method showed improvement by providing estimates relatively closer to the true value.
When the performance of the methods were examined for the scenario with existence of separation, the results depicts that the MLE provides infinitely large estimates in the presence of separation particularly for the regression coefficient (β_{b}) associated with the binary covariate that created separation (Table 3). Conversely, Firth’s penalized method showed significant improvement by providing finite estimates of both the coefficient and its SE. It is notable that the values for MLEs are extremely large for high censoring, whereas the Firth’s procedure succeeds to provide finite estimates in such an extreme case. The amount of improvement is greater for the regression coefficient (β_{b}) associated with the binary covariates than those associated with the continuous covariate. The simulation results for the neartoseparation scenario is similar to that of the separation, but with lower amount of bias. It is also reported that, in the presence of separation, the MLE failed to achieve convergence proving infinitely large value. The rate convergencefailure was 26% when sample size was 50 and censoring was 80%, and the rate decreased to 10% when sample size was 100 with the same level of censoring percentage (results not shown). The convergencefailure rate also decreased with the decreasing censoring percentage and it is very low (often negligible) while there was neartoseparation. In contrast, Firth’s penalized method achieved convergence in all simulation scenarios. Further, the penalized method with adhoc adjustment of the intercept and scale parameter outperformed the MLE when it was used for prediction for the survival probabilities (Table 4, Fig. 3). The penalized method provided very close prediction of the true survival probability at the 1st, 2nd, and 3rd quartiles of the survival time (Table 4) and over the whole followup time (Fig. 3) in comparison with the MLE.
Similar findings were found for the loglogistic AFT model, where Firth’s penalized method showed improvement over the MLE by reducing bias and MSE and providing narrower confidence interval, particularly when censoring percentage is high, for both the regression coefficients β_{c} and β_{b} in the model (Supplementary Table S1, Fig. 4). Similarly, for the intercept (β_{0}) and scale parameter (b), Firth’s method with adhoc correction showed better performance than the MLE, particularly for high censoring situation (Supplementary Table S2). The correction procedure renders better performance in rare event situations under loglogistic distribution. In the presence of separation, Firth’s penalized method outperforms the MLE by reducing bias to some extent and providing narrower confidence interval (Supplementary Table S3). Similar findings can also be observed for survival prediction for the loglogistic AFT model (results not shown).
For the lognormal AFT model, Firth’s penalized method also showed similar performance by providing lower MSE and narrower confidence intervals than MLE (Supplementary Table S4, Fig. 5). The amount of improvement by Firth’s penalized method is also greater for the regression coefficient (β_{b}) associated with binary covariates. For the intercept and scale parameters, the correction of Firth’s procedure provides an improvement over the MLE (Supplementary Table S5). Similarly, greater performance were also achieved by the Firth’s penalized method when it was used for prediction of the survival probabilities (results not shown).
Illustration using prostate cancer data
The methods are illustrated using prostate cancer data, which were previously used in a study by [5] and are publicly accessible (https://hbiostat.org/data/). They conducted an exploratory analysis on the data from a clinical trial of estrogen treatment for a total of 502 prostatic cancer patients several survival status (1  alive, 2  dead from prostatic cancer, 3  dead from heart or vascular disease, 4  dead from cerebrovascular accident, 5  dead from pulmonary embolus, 6  dead from other cancer, 7  dead from respiratory disease, 8  dead from other specific noncancer cause, 9  dead from other unspecified noncancer cause and 10  dead from unknown cause). In this study, we firstly focus on the effectiveness of the treatment on the survival of patient with prostatic cancer and hence observe the timetodeath due to prostatic cancer or not (censoring percentage 53.82). An exploratory analysis with a 2×2 contingency table showed that the original data does not suffer from the problem of separation. Therefore, for an illustrative purpose of the method discussed, a random subsample of patients who are either alive or died from prostatic cancer was taken from the original data in order to create separation and neartoseparation. Furthermore, since the number of patients who died from respiratory disease and pulmonery embolus were very small in the original data (16 and 14 respectively) making these events rare, we also considered two more scenarios with the data: one with patients who are alive or died from respiratory disease and another with patients who are alive or died from pulmonery embolus. Each of these two scenarios created neartoseparation (nonzero cells with few observations).
The covariates of interest are: treatment (0, low; 1, highdose), age (0, < 75 years; 1, 75 to 80 years; 2, ≥ 80 years), weight index caluctaed as weight (kg)  height(cm) + 200 (0, ≥ 100; 1, 8099; 2, < 80), performance rating (0, normal; 1, limitation of activity), history of cardiovascular disease (0, no; 1, yes), serum haemoglobin (0, ≥ 12g/100 ml; 1, 912g/100 ml; 2, < 9g/100 ml), size of primary lesion (0, < 30 cm^{2}; 1, ≥ 30 cm^{2}), and Gleason stage/grade category (0, ≥ 10; 1, > 10). The variables are denoted as AG (patient age), WT(weight index), PF(performance rating), HX (history of cardiovascular disease), HG (serum haemoglobin), SZ (size of primary lesion) and SG (Geason stage/grade category). For each scenario, we considered a Weibull AFT model with the covariates selected based on the analysis discussed in literature [5, 14].
The above Weibull AFT model was fitted for both the cases of separation and nearto separation with patients died from prostatic cancer and the nearto separation case with patients died from respiratory disease and pulmonary embolus.
Analysis and results of prostate cancer data
As mentioned, for illustrative purpose, a random sample of size 30 was taken from the original sample to create separation and neartoseparation in the data, respectively. A 2×2 contingency table (Table 5) between the estrogen treatment and patient status shows two different forms of separation. Moreover, patient’s age and haemoglobin level have also created separation in the outcome variable. Table 6 reveals that the MLE of regression coefficient of the treatment status responsible to create separation in the subsample is very large comparative to that of the Firth’s estimate of the coefficient making MLE uninterpretable. Under the separation scenario, the MLE fails to deliver a standard error for the regression coefficient of age resulting in a disrupted inference. Conversely, Firth’s procedure produces a finite estimate and standard error for the corresponding covariate. Although, the MLE becomes smaller with increased degrees of overlapping in the neartoseparation for the treatment status, the Firth’s procedure provides both smaller estimate and standard error for this covariate. In terms of standard error, the penalized estimates are more efficient than MLE in each separation scenarios.
Analysis and results of respiratory disease and pulmonary embolus
The contingency Table 7 between the estrogen treatment and patient’s survival status (death from respiratory disease and pulmonary embolus) shows the existence of neartoseparation in the data. Furthermore, performance rating, serum haemoglobin level and size of primary lesion have also created separation in outcome variable in both cases of respiratory disease and pulmonary embolus. Here, total sample size consists of 148 alive with only 16 (Respiratory disease) and 14 (Pulmonary embolus) events or failure and neartoseparation can be observed. The censoring percentage is 91.92% (Respiratory disease) and 93.08% (Pulmonary embolus) respectively. The Table 8 reveals that similar to the scenario with prostatic cancer, the Firth’s estimates of coefficient of treatment are smaller in magnitude and have smaller standard error than MLE in each case. Moreover, zero or extremely large standard errors estimated by MLE for some covariates (performance rating, serum haemoglobin and size of primary lesion) indicate convergence failure during estimation. In contrary, Firth’s procedure succeeds to deliver finite estimates and smaller standard error for all covariates in each case.
Discussion
The AFT model is being widely used to analyze survival data from health and reliability engineering because of its intuitive interpretation connecting directly with failure time. For the rare event survival data or data with small in size, separation or monotone likelihood often exists in the fitting process of the AFT model using maximum likelihood estimation technique. The paper investigated the performance of the MLE of the AFT model in such data condition and addressed these issues by introducing a penalized likelihood approach by adding a Firthtype penalty term to the original likelihood. Further a posthoc correction was made by keeping the intercept and scale parameter out of penalization to improve the estimates of predicted survival probabilities. The performance of the proposed method was evaluated using an extensive simulation study considering AFT model under three different (but widely used) distributions of the loglocation scale family separately. For each of the models, the proposed penalized method has been shown to provide superior performance over MLE by solving the problem of monotone likelihood reflected by achieving convergence and providing estimates with lower bias and MSE and narrower confidence interval, in most simulation scenarios.
In particular, when sample size is small and/or percentage of censoring is high, the regression coefficient estimates (both binary and continuous) from penalized likelihood are generally shown to have lower bias and MSE with narrower confidence interval than that for MLE. Again, in the presence of any form of separation, the simulation results revealed that the MLE provided large amount of bias and MSE (often infinitely large value indicating frequent convergence failure) for the estimates of the regression coefficient, particularly those associated with the binary covariates that created separation. On the contrary, the penalized method showed improvement over MLE by achieving convergence and reducing bias and MSE to some extent and providing narrower confidence interval. However, comparable results are observed for both methods for the regression coefficient associated with continuous covariates that didn’t make separation. Simulation study also showed that the performance of the penalized likelihood estimation tends to be better than the MLE in separation than that for neartoseparation, indicating the effectiveness of the proposed method in extreme situations of separation. Furthermore, the posthoc adjustment of the intercept and scale parameters under the penalized method has been shown to generate improved intercept and scale parameter estimates over MLEs by lowering the bias and consequently to provide relatively accurate estimates of the survival probabilities at different quartiles of the survival times. Simulation results of this study are quite similar to those with the other studies in the recent years which discussed Firthtype penalized estimates of regression models such as logistic regression [13, 22] and Cox regression [12].
An illustration of the methods using prostate cancer data supported the simulation findings by providing estimates with intuitive interpretation. However, demonstration of a rigorous application of this approach to a data with existence of high rate of censoring and/or existence of separation was not possible here to due to lack of access to such data, which may be useful for the practical users of this method. The proposed penalized method for AFT model underestimated the true SE for some scenarios (small smaple with high rate of censoring) and hence provided biased estimate of confidence interval, hence further study may be required with profile likelhood based confidence interval to address this problem. Further study may also be required to compare the performance of the Firth’s penalization for the Cox proportional hazard and AFT medals to address such problems related to small sample, high censoring and separation, because of a physical difference between of theses two models. In addition, one may compare the performance of Firth’s penalized AFT model with other penalized methods such as Ridge regression, LASSO etc.
Conclusion
The findings of the paper suggest that if the sample size is small and/or the percentage of censoring is high, the performance of MLE becomes unreliable as it provides biased estimates and creates separation leading to monotone likelihood with frequent convergence failure. The proposed penalized approach showed superior performance over MLE by reducing bias and MSE and solving the problem of separation. Therefore, if sample size is relatively small (e.g., n≤50) or there is evidence of high censoring and/or separation in the data, it is recommended to apply Firth’s penalized method for fitting AFT models.
Availability of data and materials
The dataset used in this study can be downloaded freely from the public domain, under the authority of the department of Biostatistics, Vanderbilt University, USA, at https://hbiostat.org/data/.
References
Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika. 1984; 71(1):1–10.
Arora JS. Introduction to optimum design. 4th ed: Elsevier; 2004.
Bryson MC, Johnson ME. The incidence of monotone likelihood in the cox model. Technometrics. 1981; 23(4):381–83.
Bull SB, Mak C, Greenwood CM. A modified score function estimator for multinomial logistic regression in small samples. Comput Stat Data Anal. 2002; 39(1):57–74.
Byar D, Green S. The choice of treatment for cancer patients based on covariate information. Bull Cancer. 1980; 67(4):477–90.
Cordeiro GM, CribariNeto F. On bias reduction in exponential and nonexponential family regression models. Commun StatSimul Comput. 1998; 27(2):485–500.
Cordeiro GM, McCullagh P. Bias correction in generalized linear models. J R Stat Soc Ser B Methodol. 1991; 53(3):629–43.
Cox DR. Regression models and lifetables. J R Stat Soc Ser B Methodol. 1972; 34(2):187–202.
Cox DR, Oakes D. Analysis of survival data. 1st ed: Chapman and Hall/CRC; 1984.
Firth D. Bias reduction of maximum likelihood estimate. Biometrika. 1993; 80(1):27–38.
Hauck Jr WW, Donner A. Wald’s test as applied to hypotheses in logit analysis. J Am Stat Assoc. 1977; 72(360a):851–53.
Heinze G, Schemper M. A solution to the problem of monotone likelihood in cox regression. Biometrics. 2001; 57(1):114–19.
Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002; 21(16):2409–19.
Kay R. Treatment effects in competingrisks analysis of prostate cancer data. Biometrics. 1986; 42(1):203–11.
Kosmidis I, Firth D. Multinomial logit bias reduction via the Poisson loglinear model. Biometrika. 2011; 98(3):755–59.
Lawless JF. Statistical models and methods for lifetime data. 2nd ed: John Wiley & Sons, Inc; 2011.
Lesaffre E, Albert A. Partial separation in logistic discrimination. J R Stat Soc Ser B Methodol. 1989; 51(1):109–16.
Leung DHY, Wang YG. Bias reduction using stochastic approximation. Aust N Z J Stat. 1998; 40(1):43–52.
Mondal M, Rahman MS. Biasreduced and separationproof GEE with small or sparse longitudinal binary data. Stat Med. 2019; 38(14):2544–60.
Oakes D. The asymptotic information in censored survival data. Biometrika. 1977; 64(3):441–48.
Pettitt A, Kelly J, Gao J. Bias correction for censored data with exponential lifetimes. Stat Sin. 1998; 8(3):941–63.
Puhr R, Heinze G, Nold M, Lusa L, Geroldinger A. Firth’s logistic regression with rare events: accurate effect estimates and predictionsStat Med. 2017; 36(14):2302–17.
Qian J, Li B, Chen Py. Generating survival data in the simulation studies of cox model. In: 2010 Third International Conference on Information and Computing. Wuxi: IEEE: 2010. p. 93–96.
Schaefer RL. Bias correction in maximum likelihood logistic regression. Stat Med. 1983; 2(1):71–78.
Vaeth M. On the use of wald’s test in exponential families. Int Stat Rev/Rev Int Stat. 1985; 53(2):199–214.
Acknowledgments
The authors acknowledge authority of the department of biostatistics, Vanderbilt University, USA for making available the dataset (Byar & Greene prostate cancer data) used in this study in a public domain.
Funding
The authors received no specific fund for this study. They did this research based on their on interest and used data from secondary source.
Author information
Authors and Affiliations
Contributions
MSR and WB contributed to the design of the study; TFA analyzed the data; TFA and MSR wrote the draft of the manuscript and WB provided inputs. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
As the dataset is freely available in a public domain and is permitted to use in research publication, the ethics approval and consent statement has been approved by the authority who made the data available for public use. In addition, this is to confirm that all methods used in data collection were carried out in accordance with relevant guidelines and regulations.
Consent for publication
Not applicable.
Competing interests
The authors declared that they have no competing interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
Supplementary Tables.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Alam, T.F., Rahman, M.S. & Bari, W. On estimation for accelerated failure time models with small or rare event survival data. BMC Med Res Methodol 22, 169 (2022). https://doi.org/10.1186/s12874022016381
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874022016381
Keywords
 Bias reduction
 Monotone likelihood
 Jeffreys prior
 Loglocationscale family