This article has Open Peer Review reports available.
The extension of total gain (TG) statistic in survival models: properties and applications
- Babak Choodari-Oskooei^{1}Email author,
- Patrick Royston^{1} and
- Mahesh K.B. Parmar^{1}
https://doi.org/10.1186/s12874-015-0042-x
© Choodari-Oskooei et al 2015
Received: 5 August 2014
Accepted: 12 June 2015
Published: 1 July 2015
Abstract
Background
The results of multivariable regression models are usually summarized in the form of parameter estimates for the covariates, goodness-of-fit statistics, and the relevant p-values. These statistics do not inform us about whether covariate information will lead to any substantial improvement in prediction. Predictive ability measures can be used for this purpose since they provide important information about the practical significance of prognostic factors. R^{2}-type indices are the most familiar forms of such measures in survival models, but they all have limitations and none is widely used.
Methods
In this paper, we extend the total gain (TG) measure, proposed for a logistic regression model, to survival models and explore its properties using simulations and real data. TG is based on the binary regression quantile plot, otherwise known as the predictiveness curve. Standardised TG ranges from 0 (no explanatory power) to 1 (‘perfect’ explanatory power).
Results
The results of our simulations show that unlike many of the other R^{2}-type predictive ability measures, TG is independent of random censoring. It increases as the effect of a covariate increases and can be applied to different types of survival models, including models with time-dependent covariate effects. We also apply TG to quantify the predictive ability of multivariable prognostic models developed in several disease areas.
Conclusions
Overall, TG performs well in our simulation studies and can be recommended as a measure to quantify the predictive ability in survival models.
Keywords
Total gain Predictive ability Cox proportional hazards model Non-proportional hazards Time-dependent covariateBackground
To estimate R^{2} in simple linear regression, Var(Y) and E(Var(Y|Z)) can be replaced with the (scaled) estimates of SST (total sum of squares) and SSE (residual sum of squares), respectively. R^{2} has several appealing properties, of which the most important are: i) R^{2}∈ [0,1]: it lies between 0 (representing no predictive ability) and 1 (perfect predictive ability); ii) monotonicity: it increases with the size of the covariate effect, ∥β∥, in the model; and iii) interpretability as the percentage of variability in the outcome that is explained by the covariates [1].
Due to its popularity, analogous R^{2}-type statistic have been developed for other regression models [2], including logistic and survival models [1, 3]. Logistic regression has wide applications in medical research. The response variable in this model is a binary variable Y, which takes the value 1 for those experiencing the event of interest, e.g. cases, and 0 for others, e.g. controls. In this model, the mean of Y, the probability of experiencing the event, is π. The model is represented by logit(π|Z)=β^{′}Z. Many R^{2} counterparts have been proposed for use in logistic regression [2]. Since the predictions for the outcome variable are expressed as event probabilities in this model, different functions have been proposed to replace Var(Y) and E(Var(Y|Z)) in Equation 1. One such example is the (expected) Brier scores [4] under the null and the model with covariate Z.
For a logistic model, discrimination measures [5, 6] can be regarded as an alternative class of predictive ability measures (see Table one in [6]). The c-statistic [7] belongs to this class which has been extended to survival models. The c-index is identical to the area under the receiver operating characteristic (ROC) curve [6]. It can be interpreted as the chance that a case will have a higher predicted probability of event occurrence than a control. The c-statistic is a rank-order statistic for predictions against true outcomes, and ranges between 0.5 (no discrimination) to 1 (perfect discrimination).
In 1999, Copas [8] proposed a new approach to summarise the predictive ability of a logistic regression model. The logit rank plot is based on the cumulative distribution function of the prognostic index (PI) β^{′}Z. Later, Bura and Gastwirth [9] proposed the binary regression quantile plot, also known as predictiveness curves [10]. A predictiveness curve displays the distribution of estimated (or predicted) event probabilities versus their quantiles. Bura and Gastwirth [9]’s approach differs from the receiver-operating characteristic (ROC) curve and the logit rank plot of Copas [8] as it does not classify subjects into high risk or low risk classes. Bura and Gastwirth [9] extended the plot and proposed a new measure of predictive ability, named total gain (TG), for a logistic regression model. TG is defined as the integrated absolute difference between the predicted event probabilities and the ‘average’ event probability over the cumulative distribution function of the PI. Bura and Gastwirth [9] also proposed a standardised counterpart TG_{ STD } which, similar to R^{2} in linear regression, lies between 0 and 1. Although, in principle, Bura and Gastwirth’s measures can be immediately applied to survival data, their properties have not been investigated in survival data where censoring is present.
Many analogous R^{2}-type statistics have been proposed for the survival models [11]. Some of the measures are only defined for the Cox proportional hazards (PH) model [12, 13], and some have been generalised for use with more general types of survival models [14, 15]. However, as has been shown by Choodari-Oskooei et al. [1, 3] and others [16], they all have shortcomings. The adverse effect of censoring on most of the measures is one of the main reasons for this. Nonetheless, based on their comprehensive empirical investigations, Choodari-Oskooei et al. [1, 3] recommended a set of measures for practical application. They are \(R_{\textit {PM}}^{2}\), \({R_{D}^{2}}\), and \({\rho _{W}^{2}}\) - see Additional file 1 for their definition. These statistics quantify the amount of prognostic information resulting from the model and provide an overall measure of predictive ability for the whole follow-up period. Also, Graf et al. [14] proposed \( R_{\textit {BS}}^{2}(t)\) which uses the (time-dependent) marginal and conditional Brier scores to replace Var(Y) and E(Var(Y|Z)) in Equation 1 - see Additional file 1. \(R_{\textit {BS}}^{2}(t)\) quantifies the accuracy of (survival) probability predictions at the individual level at a particular time-point. Among the above four measures, \(R_{\textit {BS}}^{2}(t)\) is the only statistic that can explicitly assesses the model’s (predictive) performance at any time point over the follow-up period. In their current form, \( R_{\textit {PM}}^{2}\), \({R_{D}^{2}}\), and \({\rho _{W}^{2}}\) are unsuitable for this purpose, hence their application is limited. For example, they can not be applied to models with time-dependent covariate effects included.
The purpose of the present article is fourfold. First, we extend the predictiveness curve, the TG statistic, and its counterpart TG_{ STD } to survival models. Second, we explore their properties in survival models using extensive simulation studies. Third, we show the relationship between a (version of) total gain measure which is based on the squared error loss function with the Schemper’s V-measure [17] for binary outcomes and \(R_{\textit {BS}}^{2}\) for survival models. Fourth, we discuss the application of TG in prognostic modelling and compare its estimates to the those of other recommended measures using real data. We also show that both TG and TG_{ STD } explicitly assess the performance of the model at a specific time point over the follow-up period.
The structure of the paper is as follows. In “Methods”, we describe the predictiveness curve and the TG statistic for a logistic regression model. In “Extension to survival models”, we propose our extension to survival models. We use a real data set from breast cancer to illustrate the steps that should be taken to draw the predictiveness curve, and also to estimate both TG(t) and TG_{ STD }(t) for a survival model. In “Results”, we present the results of our simulation studies to explore the performance of the proposed measure(s) for survival models under numerous scenarios. We study the impact of censoring, covariate distribution, influential (extreme and outlier) observations, and non-proportional hazards (non-PH) on the measure. We also investigate the monotonicity property of the measure as well as the effect of categorising continuous prognostic factors. In “Applications”, we apply our proposed measures to real data from several studies, and compare the results to those from other recommended R^{2}-type measures. Finally, we discuss the findings and make recommendations in “Discussion”.
Methods
Total gain (TG) measure
The total gain (TG) measure [9] is based on the predictiveness curve [10]. We first describe this curve in a logistic regression model. We then extend the plot and present an analogous TG measure for survival models.
Predictiveness curve in logistic regression
In practice, β is estimated with \(\widehat {\beta }\) in which case \( \widehat {\pi }|Z=\text {Pr}\,[Y=1|\widehat {\beta }]\), \(\widehat {\upsilon }=F(\widehat {\beta }z)\), and \(\widehat {R}(\upsilon)=\text {Pr}\, [Y=1|\widehat {\upsilon }]\). In effect, the predictiveness curve is a plot of the risk R(υ) versus the (scaled) ranks of PI. Plotting the risks against the ranks of PI enables us to compare different risk scores from different models as all score values are being transformed to a common scale, i.e. between 0 and 1. Another property of the plot is that it remains invariant to monotonic transformation of the PI - all that matters is that Pr [Y=1|Z=z] is an increasing function of the PI.
In logistic regression, the estimated risks are a monotonic function of the PI. Therefore, the curve is in effect a P-P plot of the cumulative distribution function of the estimated risks themselves. This gives the curve a useful interpretation as it shows the proportion υ of individuals in the study with estimated risks less than R(υ).
TG in logistic regression
so that, similar to the other analogous R^{2}-type measures, TG_{ STD }∈ [0,1].
Based on the normal approximation, Bura and Gastwirth [9] developed a (complex) asymptotic formula for the variance of TG in logistic regression - see Additional file 1. The formula is based on the normal approximation to \(\widehat {\pi }\). For this reason, the proposed (asymptotic) variance formula might not provide a good approximation if the (effective) sample size is small and \(\widehat {\pi }\) is near 0 or 1. However, bootstrap resampling can be used for this purpose in both small and large sample sizes.
Relationship to Brier score and Schemper’s V
where n is the sample size. Mittlbock and Schemper [19] studied V_{ B } for a logistic model and Graf et al. [14] proposed a modified version of it for survival models, i.e. \(R_{\textit {BS}}^{2}\) - see Additional file 1.
Extension to survival models
In this section, we extend the predictiveness curve and the TG statistic to a survival model with a focus on the Cox PH regression model.
Model and notation
Predictiveness curve and TG for a survival model
In practice, β is estimated with \(\widehat {\beta }\) in which case \( \widehat {\pi }(t)|Z=\Pr [T>t|\widehat {\beta }]\), \(\widehat {\upsilon }=F(\widehat {\beta }z)\), and \(\widehat {R}(\upsilon ;t)=\text {Pr}\,[T>t|\widehat {\upsilon }]\).
- 1.
Choose a clinically relevant time point t^{∗}, e.g. 2 years in the breast cancer study [20].
- 2.
Fit the model with covariate vector Z and obtain the predicted survival probabilities given covariate vector Z at time t^{∗}, i.e. \( S(t^{\ast }|z;\widehat {\beta })\). The Kaplan-Meier estimate of survival for all individuals at time t^{∗} should also be obtained, i.e. \(\widehat { \pi }_{0}(t^{\ast })\) - Fig. 1(a).
- 3.
Plot the estimates of survival probabilities from the model \(S(t^{\ast }|z;\widehat {\beta })\) against the PI, i.e. Fig. 1(b).
- 4.
Replace the actual values of the PI with its proportional ranks υ - see Fig. 1(c). This is the predictiveness curve for the survival probability predictions at t^{∗}=2 years. In this graph the dashed line represents \(\widehat {\pi }_{0}(t^{\ast })\), i.e. the Kaplan-Meier survival estimate at t^{∗}=2 years, and the solid curve is the predictiveness curve for the model with covariate vector Z.
- 5.
The shaded area between the solid curve and the dashed line in Fig. 1(d) is TG(t^{∗}) and can be considered as the gain in terms of predictive ability when using prognostic factors Z compared with not using them.
- 6.
\(\widehat {TG}_{\textit {STD}}(t^{\ast })\) is the ratio of the area between the solid curve and the dashed line, i.e. \(\widehat {TG}(t^{\ast })\), to \(2 \widehat {\pi }_{0}(t)(1-\widehat {\pi }_{0}(t))\).
In this example, \(\widehat {TG}(2)\) and \(\widehat {TG}_{\textit {STD}}(2)\) are 0.13 (95 % bootstrap CI: 0.11-0.15) and 0.33 (95 % bootstrap CI: 0.29-0.38), respectively.
Results
Simulation study
We conducted extensive simulation studies to explore the properties of TG(t) and TG_{ STD }(t). Choodari-Oskooei et al. [1] described the properties that a ‘good’ measure of predictive ability for a survival model should possess. They are: i) independence from censoring; ii) monotonicity; iii) robustness against influential (extreme and outlier) observations; and iv) interpretability. Our simulations, therefore, were carried out to explore the performance of the measures with respect to these criteria.
In this section, we first describe the simulation model. Then, we present the results of simulations and assess the performance of the measures with respect to the above-mentioned criteria. We investigate the upper bound of both TG(t) and TG_{ STD }(t), as well as the impact of non-proportional hazards on TG_{ STD }(t). The simulation model (exponential), censoring mechanisms (random censoring), censoring proportions, covariate distributions (normal, positively skewed, and negatively skewed), and covariate effects assumed in our studies are explained below.
Simulation of censored time-to-event data
where U is sampled from the standard uniform distribution, U(0,1). To generate randomly censored survival times, we followed guidelines provided by Burton et al. [22].
Design parameters
Covariate distribution and effects: We study the measures in the context of multiple regression where the PI, i.e. the linear predictor, in the model is generally a function of several variables. As a result of the central limit theorem [1], the prognostic index should tend to Normality as the dimension of the parameter vector β increases. However, skewed prognostic factors are not uncommon in medical research - for example see the distribution of the number of positive lymph nodes (skewness: 2.8 and Kurtosis: 16.2) and progesterone receptor (skewness: 4.8 and Kurtosis: 37.8) in the breast cancer data set studied in [1]. Thus, we conducted our simulation study for three covariate distributions: normal N(0,1); negatively skewed with skewness of −2.8; and positively skewed with skewness of 2.8. We applied the method proposed by Fleishman [23] to transform the standard normal distribution to skewed distributions with mean 0 and variance 1. For all covariate distributions, we carried out our simulations under four covariate effects of exp(β)={1.25,1.5,2,4}.
Censoring mechanisms: we carried out our simulations under both random and type I (or administrative) censoring with 20 %, 50 %, and 80 % censoring proportions. Since the results were very similar, we only present the results under the random censoring condition.
Sample size and the number of replicates: sample size was set at 500 individuals, and the number of replicates was 5,000 is all experimental conditions.
Uncensored data
Mean and standard deviation (in brackets) of TG_{ STD }(t) at 6 different time points by the covariate distribution (Cov.) and covariate effect (exp(β)) - sample size is 500, and 0 % censoring
Cov. | exp(β) | TG _{ STD }(T_{1}) | TG _{ STD }(T_{2}) | TG _{ STD }(T_{3}) | TG _{ STD }(T_{4}) | TG _{ STD }(T_{5}) | TG _{ STD }(T_{6}) |
---|---|---|---|---|---|---|---|
1.25 | 0.090 (0.019) | 0.093 (0.020) | 0.095 (0.020) | 0.098 (0.021) | 0.101 (0.021) | 0.121 (0.025) | |
Normal | 1.5 | 0.163 (0.020) | 0.168 (0.021) | 0.172 (0.021) | 0.177 (0.022) | 0.182 (0.022) | 0.216 (0.025) |
2 | 0.274 (0.022) | 0.281 (0.022) | 0.288 (0.023) | 0.295 (0.023) | 0.303 (0.023) | 0.346 (0.024) | |
4 | 0.499 (0.025) | 0.505 (0.023) | 0.511 (0.022) | 0.517 (0.021) | 0.523 (0.021) | 0.558 (0.020) | |
1.25 | 0.093 (0.023) | 0.094 (0.23) | 0.095 (0.023) | 0.096 (0.023) | 0.097 (0.022) | 0.107 (0.023) | |
Pos. | 1.5 | 0.187 (0.032) | 0.184 (0.029) | 0.182 (0.027) | 0.181 (0.026) | 0.181 (0.025) | 0.185 (0.022) |
skewed | 2 | 0.347 (0.041) | 0.325 (0.034) | 0.313 (0.031) | 0.304 (0.028) | 0.298 (0.026) | 0.284 (0.022) |
4 | 0.603 (0.035) | 0.557 (0.031) | 0.529 (0.029) | 0.509 (0.027) | 0.494 (0.026) | 0.449 (0.023) | |
1.25 | 0.070 (0.014) | 0.072 (0.015) | 0.075 (0.016) | 0.078 (0.016) | 0.081 (0.017) | 0.101 (0.022) | |
Neg. | 1.5 | 0.118 (0.014) | 0.122 (0.015) | 0.127 (0.016) | 0.132 (0.016) | 0.138 (0.017) | 0.176 (0.022) |
skewed | 2 | 0.180 (0.015) | 0.188 (0.015) | 0.196 (0.016) | 0.205 (0.017) | 0.215 (0.018) | 0.280 (0.024) |
4 | 0.289 (0.016) | 0.307 (0.017) | 0.326 (0.018) | 0.346 (0.019) | 0.367 (0.020) | 0.491 (0.025) |
Generally, the estimates of the measure are higher in positively skewed covariates and lower in negatively skewed covariates. In most scenarios, there is a mild increase with increasing time. We also conducted further simulations beyond the time point T_{6} and to the maximal time points (data not shown). The results showed that TG(t) is 0 at time 0. It increases until a certain point (i.e. median of the underlying time-to-event distribution), and then decreases again towards 0 at the maximal time points. However, TG_{ STD }(t) does not follow this pattern and its trend over time depends on the size of the effect (and distribution) of the covariate. In all scenarios, the estimates of TG(t) and TG_{ STD }(t) increase with increasing covariate effects. We also carried out similar simulations with sample sizes of 200 and 1000 which resulted in similar conclusions. The results showed that the dispersion of the measures decreases as sample size increases, as expected.
Furthermore, we carried out similar simulation studies on the time-dependent version of \(R_{\textit {Pepe}}^{2}\) where R(υ), and π_{0} in Equation 4 are replaced with the corresponding R(υ;t), and π_{0}(t) - data not shown. Our results confirmed the underlying theory that \( R_{\textit {Pepe}}^{2}(t)\) and \(R_{\textit {BS}}^{2}(t)\) are asymptotically the same. However, since \(R_{\textit {BS}}^{2}(t)\) is a non-parametric measure, its sampling distribution has larger variance. For example, for HR= 4 with one normally distributed covariate the means (standard deviation) of \(R_{\textit {Pepe}}^{2}(T_{6})\) and \(R_{\textit {BS}}^{2}(T_{6})\) are 0.40 (SD: 0.02) and 0.40 (SD: 0.04), respectively.
Censoring effect
The percentage difference in the means of TG_{ STD }(t) in censored data from those of TG_{ STD }(t) in the corresponding uncensored data by covariate distribution (Cov.), and censoring proportion
Cov. | %cen. | TG_{ STD }(T_{1}) | TG_{ STD }(T_{2}) | TG_{ STD }(T_{3}) | TG_{ STD }(T_{4}) | TG_{ STD }(T_{5}) | TG_{ STD }(T_{6}) |
---|---|---|---|---|---|---|---|
20 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.0 | |
Normal | 50 | 0.2 | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 |
80 | 0.6 | 0.5 | 0.5 | 0.5 | 0.5 | 1.0 | |
20 | 0.0 | 0.0 | -0.1 | -0.1 | -0.1 | -0.1 | |
Pos. | 50 | 0.1 | 0.1 | 0.0 | 0.0 | 0.0 | -0.1 |
skewed | 80 | 0.6 | 0.5 | 0.4 | 0.4 | 0.4 | 0.7 |
20 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.1 | |
Neg. | 50 | 0.1 | 0.1 | 0.1 | 0.1 | 0.0 | 0.2 |
skewed | 80 | 0.6 | 0.6 | 0.7 | 0.7 | 0.8 | 1.9 |
The results show that censoring has almost no effect on the estimates. The percentage difference to the means of the measure in censored scenarios are less than 1 %, except in one case where the censoring proportion is more than 80 %. Even in this scenario, i.e. negatively skewed covariate, the means (standard deviation) of sampling distribution of TG_{ STD }(T_{6}) for 0 % and 80 % censoring conditions are 0.280 (SD: 0.022) and 0.285 (SD: 0.062), respectively – a practically negligible difference in means.
Monotonicity and upper bound
The monotonicity property requires that TG_{ STD }(t) should increase with the size of covariate effect, i.e. |β|. In this section, we applied simulations to explore the means of both TG and TG_{ STD } for a range of covariate effects where the distribution of survival time is exponential and the covariate is normally distributed.
Influential observations
In this section, we study the impact of extreme and outlier observations on TG_{ STD }(t) using simulations. We follow the definition of extreme and outlier observations as outlined in [1]: an extreme observation fits the underlying relationship between survival time and the covariate but it lies in the extremes of the (covariate) distribution, whereas the outlier observation does not fit the underlying relationship. In simulations, we generated survival times from an exponential distribution with one normally distributed covariate N(0,1) and covariate effect of β=0.69, with each replicated data set of size 200. Each data set was contaminated with a single extreme or outlier observation according to the procedure described in [1].
Non-proportional hazard and time-dependent covariates
We carried out simulations to investigate the performance of TG_{ STD }(t) under non-proportional hazards (non-PH) and time-dependent covariate effect in a two arm trial setting. We used the Weibull distribution to generate time-to-event data and obtained the corresponding shape and scale parameters for the distribution of time-to-event data in each arm from IPASS (Iressa Pan-ASia Study) trial [24] - we used the same method as in [25] to estimate these parameters.
IPASS is a phase 3, two arm trial of previously untreated patients in East Asia who had advanced pulmonary adenocarcinoma (lung cancer) [24]. The main results from IPASS are summarized in Mok et al. [24]’s Figure two, which shows the distribution of time-to-event in each arm as Kaplan–Meier curves. Their Figure two(A) (i.e. Figure two, Panel A) shows that the progression-free survival curves cross at approximately 5.7 months, thus showing extreme non-PH. As it has been shown in [25], Weibull distributions with the following scale and shape parameters provide a good fit to the (censored) survival times in the two treatment arms: control arm parameters, scale = 0.35 and shape = 1.72; and experimental arm parameters, scale = 0.10 and shape = 1.08. We used these parameters to generate the survival times in the two groups. We truncated the time to event at 20 months to resemble the follow-up pattern in Mok et al. [24]’s Figure two(A).
Impact of categorisation of covariates
In this section, we study the impact of categorisation of covariates on TG_{ STD }(t). We carried out simulations to explore its performance when the continuous prognostic factors such as age and weight are categorised. Royston et al. [26] explained the dangers of dichotomisation of continuous covariates in the context of regression modelling, with the conclusion that it is an unnecessary practice for statistical analysis. They also showed that it will reduce both the amount of prognostic information and power, resulting in a reduction in the predictive ability of the fitted model.
In our simulations, the (conditional) distribution of survival times were exponential. The covariate was normally distributed as N(0,1) with an effect of exp(β)=4. We progressively categorised the covariate into j=2,...,20 categories by its quantiles. In each scenario, we computed percentiles corresponding to percentages 100∗k/j for k=1,2,...,j−1. For example, the categorisation of the covariate into 10 different groups requires that the 10th, 20th,..., 90th percentiles be computed.
To compare the performance of the measures with those of other proposed measures of predictive ability, we also carried out similar simulations for \( R_{\textit {PM}}^{2}\), \({R_{D}^{2}}\), \({\rho _{W}^{2}}\), and \(R_{\textit {BS}}^{2}(t)\), which have been recommended by Choodari-Oskooei et al. [1, 3] for general use. \(R_{\textit {PM}}^{2}\), \({R_{D}^{2}}\), and \({\rho _{W}^{2}}\) summarise the predictive ability for the entire follow-up period, whereas \( R_{\textit {BS}}^{2}(t)\) is time-dependent and changes over the follow-up period. \( R_{\textit {BS}}^{2}(t)\) is based on the (modified) Brier score [14]. Both \( R_{\textit {PM}}^{2}\) and \({R_{D}^{2}}\) are (monotonic) functions of the variance of the prognostic index of the model, whereas \({\rho _{W}^{2}}\) is based on the expected likelihood (entropy) under the full and null models - see [1, 3] for their formula and further details.
Applications
The estimates of TG_{ STD }(t), \(R_{\textit {PM}}^{2}\), \({R_{D}^{2}}\), \({\rho _{W}^{2}}\), and \(R_{\textit {BS}}^{2}(t)\), including 95 % bootstrap confidence intervals from 1000 replicates, in real data sets at 3 time points. The time points T_{1}, T_{2}, and T_{3} at which TG_{ STD }(t) and \(R_{\textit {BS}}^{2}(t)\) are evaluated in all data sets are the 25th, 50th, and 75th quantile of the follow-up period, i.e. the time to the last event, in each study. Therefore, T_{1}, T_{2}, and T_{3} are different in each study
Est. TG_{ STD }(t) at 3 time points | Est. \(R_{\textit {BS}}^{2}(t)\) at 3 time points | ||||||||
---|---|---|---|---|---|---|---|---|---|
Study | \(\widehat {{TG}}_{\textit {STD}}{(T}_{1}{)}\) | \( \widehat {{TG}}_{\textit {STD}}{(T}_{2}{)}\) | \(\widehat {{TG} }_{\textit {STD}}{(T}_{3}{)}\) | \(\widehat {{R}}_{\textit {PM}}^{2}\) | \( \widehat {{R}}_{D}^{2}\) | \(\widehat {{\rho }}_{W}^{2}\) | \( \widehat {{R}}_{\textit {BS}}^{2}{(T}_{1}{)}\) | \(\widehat {{R }}_{\textit {BS}}^{2}{(T}_{2}{)}\) | \(\widehat {{R}}_{\textit {BS}}^{2} {(T}_{3}{)}\) |
Breast | 0.32 | 0.33 | 0.35 | 0.27 | 0.28 | 0.36 | 0.12 | 0.16 | 0.20 |
cancer | (0.27-0.37) | (0.28-0.38) | (0.30-0.40) | (0.21-0.35) | (0.21-0.35) | (0.29-0.47) | (0.07-0.18) | (0.10-0.21) | (0.14-0.25) |
Lymphoma | 0.28 | 0.31 | 0.36 | 0.23 | 0.23 | 0.32 | 0.16 | 0.22 | 0.24 |
(0.16-0.40) | (0.18-0.44) | (0.21-0.50) | (0.11-0.42) | (0.11-0.40) | (0.15-0.53) | (0.02-0.24) | (0.05-0.34) | (0.07-0.38) | |
PBC | 0.58 | 0.62 | 0.56 | 0.56 | 0.65 | 0.60 | 0.38 | 0.47 | 0.47 |
(0.52-0.65) | (0.54-0.70) | (0.50-0.62) | (0.48-0.65) | (0.55-0.74) | (0.53-0.68) | (0.19-0.52) | (0.38-0.58) | (0.34-0.57) | |
Renal | 0.34 | 0.37 | 0.41 | 0.27 | 0.26 | 0.33 | 0.24 | 0.27 | 0.19 |
cancer | (0.28-0.40) | (0.31-0.42) | (0.36-0.46) | (0.21-0.36) | (0.20-0.33) | (0.27-0.42) | (0.16-0.31) | (0.21-0.34) | (0.11-0.26) |
Prostate | 0.22 | 0.24 | 0.26 | 0.13 | 0.13 | 0.18 | 0.06 | 0.11 | 0.10 |
cancer | (0.17-0.27) | (0.19-0.29) | (0.21-0.32) | (0.09-0.20) | (0.09-0.21) | (0.13-0.27) | (0.02-0.10) | (0.06-0.15) | (0.05-0.14) |
Multivariable prognostic models based on the Cox PH model have already been developed for the above data sets. We applied the measures to these models to compare their performance. The first two data sets have been analysed extensively by Choodari-Oskooei et al. [1, 3]- see Additional file 2 for further details on the data sets, prognostic factors included in each study, and the summary of fitted models.
Table 3 shows the estimates of TG_{ STD }(t) and \(R_{\textit {BS}}^{2}(t)\) at 3 time points, together with the estimates of \(R_{\textit {PM}}^{2}\), \({R_{D}^{2}}\), and \({\rho _{W}^{2}}\). The 3 time points at which TG_{ STD }(t) and \( R_{\textit {BS}}^{2}(t)\) are evaluated in all data sets are the 25th, 50th, and 75th quantile of the follow-up period (the time to the last event) in each study. We emphasise that in practice a clinically motivated time point should be chosen. In all studies, the point estimates of TG_{ STD }(t) mildly increase with time, and are markedly higher than those for \( R_{\textit {BS}}^{2}(t)\). In some data sets, the estimates are within close range of those for \(R_{\textit {PM}}^{2}\) and \({R_{D}^{2}}\).
Discussion
In this paper, we described the predictiveness curve, and extended the (standardised) total gain statistic to a survival model. We carried out comprehensive simulation studies assessing its performance with respect to the criteria that a good measure of predictive ability should possess.
Summary of our findings
Both TG(t) and TG_{ STD }(t) are based on the predictiveness curve. In simple terms, the predictiveness curve is a plot of the rank-ordered predicted survival probabilities versus the cumulative percentile for each predicted survival probability. The plot, therefore, illustrates the distribution of estimated risk (or survival probability predictions) in the population under study. The results of our empirical studies showed that both TG(t) and TG_{ STD }(t) are an increasing function of the covariate effect, and are independent of random censoring. Our results also showed that both measures are affected by the distribution of the prognostic factor in a survival model.
Our findings indicate that TG_{ STD }(t) is an increasing function of time in a multivariable regression model where the distribution of the PI is roughly Gaussian, whereas TG(t) increases with time until a certain time point and decreases afterwards towards zero in maximal survival times. This accords with the behaviour of the (modified) Brier score suggested by Graf et al. [14] for the survival models [3], and that of \(R_{\textit {BS}}^{2}(t)\). The trend of TG(t) over time indicate that (for the models we studied) discrimination in survival probability predictions is minimal at the time origin as well as maximal time points, but it reaches its maximum around the median of the underlying distribution of survival time. We emphasise that the time points where both TG(t) and TG_{ STD }(t) are evaluated should be clinically relevant. We favour the use of TG_{ STD }(t) over TG(t) since it is in a similar scale to those of other R^{2}-type statistics. TG(t) can easily be obtained from the estimate of TG_{ STD }(t) if the Kaplan-Meier estimate of survival probability at time t, \(\widehat {\pi }_{0}(t)\), is also reported.
Properties of TG_{ STD }(t)
An important property of TG_{ STD }(t) is that, unlike some of the proposed R^{2}-type measures [1], it always lies between 0 and 1. Another advantage of TG_{ STD }(t) is its extendability to other types of survival model, including parametric survival models [25, 31]. The only underlying assumption in both measures is that the predictiveness curve R(υ;t) should be a monotonic function of the prognostic index in the model. Unlike \(R_{\textit {BS}}^{2}(t)\) and other proposed predictive accuracy measures [3] (which assess the ability of the model to predict the outcome of interest at the individual level), TG_{ STD }(t) is a measure that quantifies the amount of prognostic information for a group of patients as it does not directly compare the individuals’ predicted risk probabilities with their actual outcomes. For a survival model, \(R_{\textit {PM}}^{2}\), \({R_{D}^{2}}\), and \({\rho _{W}^{2}}\) also quantify the amount of prognostic information for a group of patients. They, however, provide an overall measure of predictive ability for the entire follow-up period. In this respect, TG_{ STD }(t) has an advantage over these measures because, inherently, it is a function of time and can be used to compare studies with different follow-up periods. But, its (perceived) downside is that it does not provide a unique value for a given model. One possible solution is to define an integrated version of TG_{ STD }(t) over the entire follow-up period - similar to the integrated \(R_{\textit {BS}}^{2}(t)\) proposed by Graf et al. [14].
Bura and Gastwirth [9] showed that TG is normally distributed for a logistic regression model and developed a formula for its variance. The results of our simulations showed that the sampling distribution of TG(t) is also (asymptotically) normal (e.g. see Fig. 2). Based on the (large sample) asymptotic distribution of TG, Bura and Gastwirth [9] developed an asymptotic formula for its variance - see Additional file 1. In principle, the formula can be adopted (with some amendments) for use in a survival model. However, since it is based on the normal approximation to the probability of having an event by time t, i.e. π_{0}(t), it might not provide a good approximation if the (effective) sample size is small and π_{0}(t) is near 0 or 1. We, therefore, propose bootstrap resampling to construct confidence intervals.
Finally, most R^{2}-type measures proposed for survival models lack the intuitive interpretation of R^{2} in linear regression as explained variation. TG_{ STD }(t) is not an exception in this regard. Therefore, further research is needed to explain these measures (and their properties) in a way that is easily accessible to practical researchers.
Relationship to other measures
We have shown that \(R_{\textit {Pepe}}^{2}(t)\) (i.e. TG_{ STD }(t) with squared error loss) is the model-based, i.e. parametric, version of \(R_{\textit {BS}}^{2}(t)\). Therefore, they are asymptotically equivalent if the model is correctly specified. It can be argued that the assumption of correctly specified model may not be entirely feasible in practice. Nonetheless, the smaller variance in (the estimates of) \(R_{\textit {Pepe}}^{2}(t)\) makes it an appealing choice - i.e. the classic bias versus variance trade-off. Further research is required to study the trade-off between the bias and variance of these estimators in a series of simulations based on real datasets. For a logistic regression model, the relationships between the predictiveness curve R(υ;t), the c-statistic, and reclassification measures have been established [10]. For a survival model, however, this is a topic for further research.
Conclusions
Our studies showed that the total gain measure performed well with respect to our criteria. It can also be applied to a broad class of survival models. Overall, we believe that it can be recommended as a measure to quantify the predictive ability in survival models.
Declarations
Acknowledgements
We are grateful to Dr Tim Morris and Dr Daniel Bratton for their comments on the earlier version of this article. We also thank two reviewers and the associate editor for their comments on the earlier version of this manuscript. This research was supported by London Hub for Trials Methodology Research grant number 510636 (MQEL).
Authors’ Affiliations
References
- Choodari-Oskooei B, Royston P, Parmar MKB. A simulation study of predictive ability measures in a survival model I: Explained variation measures. Stat Med. 2012; 31(23):2627–43.View ArticlePubMedGoogle Scholar
- Demaris A. Explained variance in logistic regression a Monte Carlo study of proposed measures. Sociol Methods Res. 2002; 77:329–42.Google Scholar
- Choodari-Oskooei B, Royston P, Parmar MKB. A simulation study of predictive ability measures in a survival model II: Explained randomness and predictive accuracy measures. Stat Med. 2012; 31(23):2644–59.View ArticlePubMedGoogle Scholar
- Spiegelhalter DJ. Probabilistic prediction in patient management and clinical trials. Stat Med. 1986; 5:421–33.View ArticlePubMedGoogle Scholar
- Nam BH, D’Agostino RB. Discrimination index, the area under the ROC curve. Goodness-of-fit tests and model validity. 2002; 1:267–280.View ArticleGoogle Scholar
- Steyerberg EW, Vickers AJ, Cook RN, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010; 21(1):128–138.View ArticlePubMedPubMed CentralGoogle Scholar
- Harrell F E, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. JAMA. 1982; 247(18):2543–6.View ArticlePubMedGoogle Scholar
- Copas J. The effectiveness of risk scores: the logit rank plot. Appl Stat. 1999; 48(2):165–83.Google Scholar
- Bura E, Gastwirth JL. The binary regression quantile plot: Assessing the importance of predictors in binary regression visually. Biometrical J. 2001; 43(1):5–21.View ArticleGoogle Scholar
- Huang Y, Pepe MS, Feng Z. Evaluating the predictiveness of a continuous marker. Biometrics. 2007; 63(4):1181–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Choodari-Oskooei B. Summarising Predictive Ability of a Survival Model and Applications in Medical Research. PhD thesis, University College London, 2008.Google Scholar
- Kent J, O’Quigley J. Measures of dependence for censored survival data. Biometrika. 1988; 75(3):525–34.View ArticleGoogle Scholar
- O’Quigley J, Flandre P. Predictive capability of proportional hazards regression. Proc Nat Acad Sci USA. 1994; 91:2310–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999; 18:2529–45.View ArticlePubMedGoogle Scholar
- Royston P, Sauerbrei W. A new measure of prognostic separation in survival data. Stat Med. 2004; 23:723–48.View ArticlePubMedGoogle Scholar
- Schmid M, Potapov S. A comparison of estimators to evaluate the discriminatory power of time-to-event models. Stat Med. 2012; 31(23):2588–609.View ArticlePubMedGoogle Scholar
- Schemper M. Predictive accuracy and explained variation. Stat Med. 2003; 22(14):2299–308.View ArticlePubMedGoogle Scholar
- Pepe MS, Feng Z, Huang Y, Longton G, Prentice R, Thompson IM, Zheng Y. Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol. 2008; 167:362–368.View ArticlePubMedGoogle Scholar
- Mittlbock M, Schemper M. Explained variation for logistic regression - small sample adjustments confidence intervals and predictive precision. Biometrical J. 2002; 44(3):263–72.View ArticleGoogle Scholar
- Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J R Stat Soc (Series A). 1999; 162:71–94. Corrigendum: J R Stat Soc. (Series A) 2002; 165:399–400.View ArticleGoogle Scholar
- Schumacher M, Bastert G, Bojar H, Hubner K, Olshewski M, Sauerbrei W, Schmoor C, Beyerle C, Newmann RLA, Rauschecker HF. Randomised 2*2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. J Clinical Oncol. 1994; 12:2086–2093.Google Scholar
- Burton A, Altman DG, Royston P, Holder RL. The design of simulation studies in medical statistics. Stat Med. 2006; 25:4279–92.View ArticlePubMedGoogle Scholar
- Fleishman AI. A method for simulating non-normal distributions. Psychometrika. 1978; 43:521–31.View ArticleGoogle Scholar
- Mok TS, Wu YL, Thongprasert S, Yang CH, Chu DT, Saijo N, Sunpaweravong P, Han B, Margono B, Ichinose Y, Nishiwak Y, Ohe Y, Yang J-J, Chewaskulyong B, Jiang H, Duffield EL, Watkins C L, Armour AA, Fukuoka M. Gefitinib or carboplatin?paclitaxel in pulmonary adenocarcinoma. N Engl J Med. 2009; 361:947–57.View ArticlePubMedGoogle Scholar
- Royston P, Lambert PC. Flexible Parametric Survival Analysis Using Stata: beyond the Cox model. Texas: Stata Press; 2011.Google Scholar
- Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006; 25(1):127–41.View ArticlePubMedGoogle Scholar
- Rosenwald A, Wright G, Chan WC, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. N Engl J Med. 2002; 346:1937–47.View ArticlePubMedGoogle Scholar
- Fleming TR, Harrington DP. Counting processes and survival analysis. New York: Wiley and Sons; 1991.Google Scholar
- Ritchie R, Griffiths G, Parmar MKB. Interferon-alfa and survival in metastatic renal carcinoma: early results of a randomised controlled trial. Lancet. 1999; 353(9146):14–17.View ArticleGoogle Scholar
- Byar DP, Green SB. The choice of treatment for cancer patients based on covariate information - application to prostate cancer. Bulletin Du Cancer. 1980; 67(4):477–90.PubMedGoogle Scholar
- Royston P, Parmar MKB. Flexible parametric models for censored survival data with application to prognostic modelling and estimation of treatment effects. Stat Med. 2002; 21:2175–97.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.