A pseudo-R^{2}measure for selecting genomic markers with crossing hazards functions
- Sigrid Rouam^{1, 2}Email author,
- Thierry Moreau^{3} and
- Philippe Broët^{1, 2, 4}
DOI: 10.1186/1471-2288-11-28
© Rouam et al; licensee BioMed Central Ltd. 2011
Received: 26 July 2010
Accepted: 15 March 2011
Published: 15 March 2011
Abstract
Background
In genomic medical studies, one of the major objectives is to identify genomic factors with a prognostic impact on time-to-event outcomes so as to provide new insights into the disease process. Selection usually relies on statistical univariate indices based on the Cox model. Such model assumes proportional hazards (PH) which is unlikely to hold for each genomic marker.
Methods
In this paper, we introduce a novel pseudo-R^{2} measure derived from a crossing hazards model and designed for the selection of markers with crossing effects. The proposed index is related to the score statistic and quantifies the extent of a genomic factor to separate patients according to their survival times and marker measurements. We also show the importance of considering genomic markers with crossing effects as they potentially reflect the complex interplay between markers belonging to the same pathway.
Results
Simulations show that our index is not affected by the censoring and the sample size of the study. It also performs better than classical indices under the crossing hazards assumption. The practical use of our index is illustrated in a lung cancer study. The use of the proposed pseudo-R^{2} allows the identification of cell-cycle dependent genes not identified when relying on the PH assumption.
Conclusions
The proposed index is a novel and promising tool for selecting markers with crossing hazards effects.
Background
In genomic medical research, one of the major objectives is to identify genomic markers having a prognostic impact on clinical outcomes (e.g. relapse, death) so as to provide new insights into the disease process. Most of the studies which investigate the relationship between genomic markers and time-to-event outcomes usually rely on marginal survival analysis that consider univariate prognostic indices derived from the semi-parametric Cox proportional hazards model. This proportional hazards (PH) assumption states that the ratio of the hazard functions of different individuals remains constant over time. Although this assumption is arbitrary, it is widely used since it offers a convenient way to summarize the effect of a covariate on the baseline hazard function and the resulting inference on the parameters of the model is robust enough to encompass some instances of non-proportionality (monotone, converging or diverging hazard functions). However, this PH modelisation is clearly not coping with crossing hazard functions. Crossing-hazards models explicitly specify that there is a time at which the hazard curves for different levels of a covariate cross. To our best knowledge, the crossing hazards phenomenon is barely investigated in genomic studies and it is usually described as a time-dependent effect of the genomic marker without any meaningful bioclinical interpretation.
In this paper, we introduce a novel pseudo-R^{2} index derived from a semi-parametric non-proportional hazards model that is suited for the selection of genomic markers with crossing hazard functions. We also discuss one of the plausible interpretations for such crossing phenomenon that relates to a gene effect modification. For censored survival data, two main sapproaches have been considered for quantifying the predictive ability of a variable to separate patients: concordance and proportion of explained variation. This latter quantifies the relative gain in prediction ability between a covariate-based model and a null model, by analogy with the well-known linear model (and the R^{2} criterion). In this framework, we propose a novel statistical quantity which is related to the score statistic. The proposed pseudo-R^{2} index relies on the partial likelihood function in such a way that it has an interpretation in terms of percentage of separability between patients according to their survival times and marker measurements. It extends a previous work [1] for taking into account crossing hazards situations. From a real example, we show that the proposed index can be used to select genes with crossing hazards behavior that potentially indicate a modification of their prognostic effect. Moreover, it proves useful for identifying genomic markers with a common effect across multiple genomic studies due to its weak dependence on sample size variations. The paper is organized as follows. In section Methods, we first introduce an example of a simple interplay between two markers (effect modification) that leads to marginal crossing hazard functions and has prompted us to derive a novel pseudo-R^{2} measure for such non-proportional situations. Then, we introduce a semi-parametric non-proportional hazards model which gives rise to some crossing effect of the hazard function. Finally, we derive from this model a pseudo-R^{2} measure well-suited for crossing hazard function and show its link to the robust score statistic for testing no effect of the considered marker [2]. In the Results section, we report and discuss the properties of the index from simulations experiments and compare them to those of classical indices [3–7] which are also linked to the likelihood function. In the Example section, we illustrate the use of the index for selecting genomic factors with crossing hazard effect in a lung cancer study. In the last section, we summarize our work.
Methods
In this section, we first present a simple situation which motivates the use of the semi-parametric non-proportional hazards model introduced in the next subsection.
Notations
Let the random variables X and C be the failure and censoring times, and T = min(X, C) be the observed follow-up time. The random variables T and C are assumed to satisfy the condition of independent censoring [8]. We denote {N _{ i } (t), t ≥ 0} the counting process that indicates the number of events that have occurred in the interval (0, t] for subject i, i = 1,⋯, n, so that N _{ i } (t) takes values 0 or 1. Let Y _{ i } be the at-risk process, so that Y _{ i } (t) = 1 indicates that subject i is at risk just before time t, and Y _{ i } (t) = 0 otherwise.
Let dN _{ i } (t) = N _{ i } (t ^{-} + dt) - N _{ i } (t ^{-}) be the number of events occurring in the interval [t, t + dt) for subject i, the total number of events that have occurred in the interval (0, t] and the number of subjects at risk at time t. Finally, let represent the value of the g^{th} covariate for individual i.
Motivational situation: the modulating effect
In the following, we show how a simple interplay between two binary markers Z ^{(1)} and Z ^{(2)} can lead to marginal crossing hazard functions.
where λ _{0}(t) is an arbitrary unspecified baseline hazard function, and α and γ are unknown regression coefficients.
Hazard function
Z ^{(1)} | ||
---|---|---|
Z ^{(2)} | 0 | 1 |
0 | λ _{0}(t) | λ _{0}(t) |
1 | λ _{0}(t)e ^{γ} | λ _{0}(t)e ^{α+γ} |
It appears from this expression that hazards may cross over time. More precisely, it is shown in Additional File 1 that when α and γ are positive and assuming a balanced joint distribution for (Z ^{(1)}, Z ^{(2)}), the hazards ratio inverts at a given time in (0; +∞). Obviously, such a time-dependence cannot be properly handled by using the proportional hazards model to analyze the data.
Semi-parametric model
where is the cumulative baseline cumulative hazard function.
In the simple case of a covariate Z taking values 0 or 1, the hazards ratio (HR) of the two groups corresponding to Z = 1 and Z = 0 is equal to . If β > 0, this function is increasing from 0 to +∞ and takes value 1 for . In this case, the risk of occurrence of the event is smaller in group 1 than in group 0 for 0 ≤ t < τ, and becomes greater when t > τ. If β < 0, the hazards ratio is decreasing from +∞ to 0 and takes value 1 for t = τ as calculated above. The risk of occurrence of the event is thus greater in group 1 than in group 0 when 0 < t ≤ τ and becomes smaller for t > τ.
Thus, as expected, model (4) allows hazards to cross over time. Note that the survival functions cross at a time larger than the crossing time of the hazards, and may not cross at a finite time.
with ω(s) = 1 + log{Λ_{0}(s)}.
Pseudo-R^{2}measure
With .
From this expression, we show that, for a given covariate Z at time t, the U _{ i } can be expressed as the weighted difference between the value of the covariate of the patient observed to experience the event of interest and the mean of the covariates of the group of patients observed to not experience the event.
where , Λ_{0}(t _{ i } ) is estimated by the left-continuous version of the Nelson's estimator [10, 11], and δ _{ i } , i = 1, ⋯, n is the indicator of failure at time t _{ i } .
The sum over i of the robust W _{ i } , i = 1, ⋯, n is identical to the sum of the U _{ i } . However, as for the Cox model, the W _{ i } are independent, while the U _{ i } are not.
The index D _{0} is interpreted in terms of percentage of separability over time between the event/non-event groups. Its calculation is easy as it does not require the estimation of the parameter β of the crossing hazards model. We can easily demonstrate that 0 ≤ D _{0} ≤ 1.
It is worth noting that the index D _{0} can be interpreted as a pseudo-R^{2} measure. In the linear regression model, the R^{2} (coefficient of determination) can be directly linked to likelihood-related quantities such as the Wald test, the likelihood ratio and the score statistics (see [12]). These formal relationships provide different ways to interpret the R^{2}. In the framework of non-linear models, statisticians have searched for a corresponding index and different pseudo-R^{2} statistics have been proposed for censored data. Our proposed index is an extension of the definition of the R^{2} for survival model with crossing hazards which relies on the score statistic.
Results
Simulation Scheme
A simulation study was performed to describe the behavior of the proposed index, , in finite samples generated under the crossing hazards model (3), and to compare it to the behavior of other existing indices. Different situations were considered, corresponding to different covariate distributions, regression parameter values and sample sizes. The influence of various censoring distributions was also investigated. The indices that were compared to include our previous index derived from a PH model (i.e. calculated according to the same approach than with ω(t _{ i } ) = 1, [1]) and other most usual indices: Allison's index [3], its modified version [5], Nagelkerke [4] and Xu and O'Quigley's [6] indices. All of them are designed for a PH model and are denoted , and , respectively. The different elements defining a configuration were the following. For a given subject, the distribution of the covariate Z included in model (3) was either discrete (Bernoulli ℬ(0.5)) or continuous (log-normal with mean 0 and variance 1/4, or uniform . These three distributions of Z were standardized to have the same variance. Two distributions for the survival time X were considered. The first one was defined by model (3) with Λ_{0}(t) = t. It is equivalent to a Weibull parametric model with scale parameter η = 1 and shape parameter α = exp(βZ). The second one correspond to a log-normal distribution with mean equal to 0 and standard deviation equal to e ^{ -βZ } . These two distributions allowed to simulate the crossing hazards phenomenon.
The coefficient β was given a value such that e ^{ β } = 1, or 2, or 3, or 4. It can be noticed that, in the case of a Bernoulli variable Z, the hazard functions corresponding to e ^{ β } = 2, or 3, or 4 cross when the survival function of group Z = 1 equal 0.78, 0.82, 0.85 respectively. The censoring variable C _{ i } was assumed to be independent from X _{ i } given Z _{ i } and distributed according to either a uniform or exponential C _{ i } ~ℰ(γ) distribution. The parameters r and γ were calculated in order to yield an expected overall percentage of censoring p _{ c } equal to 0%, 25% and 50%. The sample size n was taken equal to 50, 100 or 500. Data were generated as follows. For each subject i, i = 1, ⋯, n, a value of the covariate Z _{ i } was generated. Given that value, a survival time X _{ i } was generated according to a either Weibull distribution , or a log-normal distribution . The censoring variable C _{ i } was independently generated and the observed follow-up time T _{ i } was calculated as min(X _{ i } , C _{ i } ). For each configuration 1,000 independent replications were generated.
Simulation Results
As seen from Figures 1, 2, 3, 4, 5 and 6 and Additional Files 2, 3 and 4, when β = 0, i.e. in the absence of covariates, the different indices are close to zero for n = 50, 100 and 500. Among the six indices, only shows a mean value increasing regularly with β. The means of the five other indices do not appear to increase with β and remain below 0.05 even for the highest value of e ^{ β } = 4. When β ≠ 0, the mean values of for the different sample sizes are fairly stable.
The standard errors of the six indices are small when β = 0. The standard errors of are larger than those of the other indices when β ≠ 0 and slightly decrease as β increases. As expected, the standard errors of the different indices decrease when n increases.
The mean value of does not appear to be sensitive to the censoring rate. However, the standard error of this index moderately increases as the percentage of censoring increases from 0% to 50%. In addition, the results obtained with a log-normal distribution of the survival time X on one hand (see Additional Files 5, 6 and 7 for the case n = 100), and with an exponential distribution of the censoring variable, on the other hand (not shown) are very similar, concerning and the other indices as well.
Application of the index on real data
In this section, we illustrate the use of the proposed index by selecting transcriptomic prognostic factors having a crossing effect in a lung cancer study. We compare the selection to the one obtained when relying on the index calculated under a proportional hazards model.
Dataset
This series is composed of 74 patients who underwent surgery at the Hôtel-Dieu Hospital (AP-HP, France) between August 2000 and February 2004 for stage IB (pT2N0) primary adenocarcinoma or large cell lung carcinoma of peripheral location [13]. Relapse-free survival was defined as the time from surgery until disease-related death, disease recurrence (either local or distant), or last follow-up examination. The median relapse-free survival time was 63.8 months. The two years relapse-free survival was 80.3%[71.2%, 90.5%], and the five years relapse-free survival was 59.3%[47.2%, 74.5%]. For each patient, we considered the gene expression measurments of 51,852 transcripts (obtained using Aymetrix HU133 Plus 2.0; Aymetrix, Santa Clara, CA, USA) located on the autosomal chromosomes.
Selection of the variables
The genes were ranked according to the value of either or . We decided to focus our attention on the first 200 top-ranked transcripts in both cases. The lowest separability for both indices were close to each other (29.8% for and 29.1% for ). Only a small proportion of transcripts (5%) was common to both lists.
We then examined the biological processes that were significantly over-represented in the two sublists using the PANTHER (Protein ANalysis THrough Evolutionary Relationships) classification system [14]. Results showed that the two indices allow selecting genes involved in different biological processes (See Additional File 8). For the cell-cycle process, selected 25 transcripts (significantly higher number than the 5% expected by chance), whereas selected only 16 transcripts (not significantly higher number than the 5% expected by chance). The two lists of genes involved in cell cycle are given in Additional File 9.
The gene FGFR2 (fibroblast growth factor receptor 2) is known to be involved in various cancer types [15] and low gene expression measurements have been reported as linked to a shorter survival in lung cancer [16]. The analysis of FGFR2 gene expression taking into account FGF4 gene expression, which is one of its ligand, suggested a potential modulating effect between the two genes. In the following, we reported the hazards ratio (HR) computed under the Cox PH model for the four groups resulting from dichotomizing the two variables at the median. We also displayed the Kaplan-Meier curves on Figure 7.c. As seen on this latter, patients with low expression (below the median) of both FGFR2 and FGF4 have the worst prognosis (reference group). When FGFR2 is highly expressed (above the median) and FGF4 lowly expressed, the survival is not significantly improved (HR = 0.532 [0.202, 1.399]). However, the over-expression of FGF4 significantly improve the survival (HR = 0.329 [0.112, 0.967]). Finally, patients with a high expression of FGFR2 and FGF4 have the best prognosis (HR = 0.103 [0.021, 0.516]).
In the same way, we discussed the interaction between MCL1 and BCL2, two anti-apoptotic genes belonging to the BCL-2 gene family. Considered initially as oncogenes, the prognostic impact of BCL-2/MCL-1 for various type of cancer is debated due to their dual function on cell death and cell proliferation (for a review, [17]). The anti-apoptotic effect is associated with resistance to chemotherapy, leading an adverse prognostic role in some cancers such as leukemia or advanced ovarian tumors. In contrast, the anti-proliferative activity of MCL-1 and BCL-2 is associated with a favorable prognosis effect in some early carcinomas, such as lung adenocarcinoma [18]. Moreover, the combined analysis of MCL1 and BCL2 gene expressions indicated a potential modulating effect between them. As seen on Figure 7.d., in our lung cancer study of early lung adenocarcinomas treated by surgery alone, patients with low expression (below the median) of both genes MCL1 and BCL2 have the worst prognosis (reference group). When MCL1 is highly expressed (above the median) and BCL2 lowly expressed, the prognosis is not significantly improved (HR = 0.533[0.202, 1.4103]). On the contrary, patients with low expression of MCL1 and high expression of BCL2 have a significantly improved survival (HR = 0.296[0.091, 0.962]). Finally, the over-expression of both MCL1 and BCL2 gives the best prognosis (HR = 0.189[0.051, 0.700]).
In these two examples, we could hypothesize that the crossing effect observed in the marginal analysis of FGFR2 or MCL1 is related to some potential effect modification linked to FGF4 or BLC2, respectively. This hypothesis is consistent with the known biological activity of those genes. FGFR2 encodes for a receptor, which needs one of its ligand (i.e. FGF4 ) for activation and biological activity. Also, BCL2 and MCL-1 encode two proteins of the same family, which may act together on the cell through heterodimerization on apoptosis or cell proliferation. The resulting subgroup of patients defined by high expression of these genes couples might be clinically relevant and the object of further investigations.
Discussion
For survival data analysis, univariate feature selection strategy is mainly based on ranking markers according to the value of a test statistic or a predictive index obtained under the classical Cox PH model. In such setting, we demonstrated in a previous work the interest of using a pseudo-R^{2} measure for genomic studies. However, various departures from the PH assumption can be observed and crossing hazards phenomenon can be encountered in real situations.
In this context, we propose a novel pseudo-R^{2} measure that is suitable for identifying genomic markers with crossing effects. It is linked to a semi-parametric survival model that provides sufficient flexibility to handle data with crossing hazards. Selecting such markers is potentially important since it could reflect the complex interplay between genes belonging to the same pathway.
The proposed index is ranging from zero to one and can be interpreted in terms of percentage of separability over time between the subgroup of subject(s) experiencing the event and the subgroup of those experiencing the event at a later time. It quantifies the prognostic separability of markers under a crossing hazard function assumption, whereas for the proportional hazards setting other specialized indices have previously been proposed [1]. This pseudo-R^{2} is derived from the partial log-likelihood function and directly linked to the robust score statistic, while similar derivations from Wald or likelihood ratio statistics are not trivial and not easily tractable. As seen from our simulation results, the proposed index increases with the value of the regression parameter and is affected neither by the percentage of censoring nor the sample size of the study. The results show that our pseudo-R^{2} is the most suitable for taking into account the crossing hazards phenomenon, as compared to classical indices.
From a real dataset on lung cancer, we show that our index allows to identify genes involved in biological processes linked to the tumor evolution and that are not selected under the PH assumption.
Among the cell-cycle related genes of our selection, we investigate two genes, FGFR2 and MCL1, which crossing effects could potentially be linked to some modulating effect due to other genes from the same biological pathway. Knowing the complexity of gene interactions, this is an over-simplification of the biological reality and other mechanisms can obviously lead to such non-proportional phenomenon. In this analysis, the gene expression measurements are dichotomized based on the median but other cutoffs could be investigated (by searching for optimal cutoff point) as proposed by Motakis et al. [19]. To the best of our knowledge, the present work is the first to propose a pseudo-R^{2} measure that is specifically designed for crossing hazards situations.
Conclusions
We propose a novel pseudo-R^{2} measure that quantifies the prognostic separability of markers under a crossing hazard function assumption. This phenomenon can be encountered in real situations promoting the use of this novel index.
Declarations
Acknowledgements
We acknowledge the following institutions for general funding: the Genome Institute of Singapore (Singapore) and the French Ministry of Higher Education and Research (France). We thank all our colleagues from the Computational and Mathematical Biology group for fruitful discussions.
Authors’ Affiliations
References
- Rouam S, Moreau T, Broet P: Identifying common prognostic factors in genomic cancer studies: A novel index for censored outcomes. BMC Bioinformatics. 2010, 11: 150-10.1186/1471-2105-11-150.View ArticlePubMedPubMed CentralGoogle Scholar
- Lin DY, Wei LJ: The robust inference for the Cox proportional hazards model. Journal of the American Statistical Association. 1989, 84: 1074-1078. 10.2307/2290085.View ArticleGoogle Scholar
- Allison PD: Survival Analysis Using SAS: A Practical Guide. 1995, SAS PublishingGoogle Scholar
- Nagelkerke N: A note on a general definition of the coefficient of determination. Biometrika. 1991, 78 (3): 691-692. 10.1093/biomet/78.3.691.View ArticleGoogle Scholar
- O'Quigley J, Xu R, Stare J: Explained randomness in proportional hazards models. Statistics in Medicine. 2005, 24 (3): 479-489.View ArticlePubMedGoogle Scholar
- Xu R, O'Quigley J: A R2 type measure of dependence for proportional hazards models. Journal of Nonparametric Statistics. 1999, 12: 83-107. 10.1080/10485259908832799.View ArticleGoogle Scholar
- O'Quigley J, Flandre P: Predictive capability of proportional hazards regression. Proceedings of the National Academy of Sciences of the United States of America. 1994, 91: 2310-2314.View ArticlePubMedPubMed CentralGoogle Scholar
- Fleming TR, Harrington DP: Counting Processes and Survival Analysis. 1991, WileyGoogle Scholar
- Quantin C, Moreau T, Asselain B, Maccario J, Lellouch J: A Regression Survival Model for Testing the Proportional Hazards Hypothesis. Biometrics. 1996, 52 (3): 874-885. 10.2307/2533049.View ArticlePubMedGoogle Scholar
- Nelson W: Theory and applications of hazard plotting for censored failure data. Technometrics. 1972, 14: 945-965. 10.2307/1267144.View ArticleGoogle Scholar
- Nelson W: Hazard plotting for incomplete failure data. Journal of Quality Technology. 1969, 1: 27-52.Google Scholar
- Magee L: R2 measures based on Wald and likelihood ratio joint significance tests. The American Statistician. 1990, 44 (3): 250-253. 10.2307/2685352.Google Scholar
- Broët P, Camilleri-Broët S, Zhang S, Alifano M, Bangarusamy D, Battistella M, Wu Y, Tuefferd M, Régnard JF, Lim E, Tan P, Miller LD: Prediction of clinical outcome in multiple lung cancer cohorts by integrative genomics: implications for chemotherapy selection. Cancer Research. 2009, 69 (3): 1055-1062.View ArticlePubMedGoogle Scholar
- Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: a library of protein families and subfamilies indexed by function. Genome Research. 2003, 13: 2191-2141.Google Scholar
- Katoh M: Cancer genomics and genetics of FGFR2 (Review). International Journal of Oncology. 2008, 33 (2): 233-237.PubMedGoogle Scholar
- Raponi M, Zhang Y, Yu J, Chen G, Lee G, Taylor JMG, Macdonald J, Thomas D, Moskaluk C, Wang Y, Beer DG: Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung. Cancer Research. 2006, 66 (15): 7466-7472. 10.1158/0008-5472.CAN-06-1191.View ArticlePubMedGoogle Scholar
- Zinkel S, Gross A, Yang E: BCL2 family in DNA damage and cell cycle control. Cell Death Dier. 2006, 13 (8): 1351-1359. 10.1038/sj.cdd.4401987.View ArticleGoogle Scholar
- Martin B, Paesmans M, Berghmans T, Branle F, Ghisdal L, Mascaux C, Meert AP, Steels E, Vallot F, Verdebout JM, Lafitte JJ, Sculier JP: Role of Bcl-2 as a prognostic factor for survival in lung cancer: a systematic review of the literature with meta-analysis. British Journal of Cancer. 2003, 89: 55-64. 10.1038/sj.bjc.6601095.View ArticlePubMedPubMed CentralGoogle Scholar
- Motakis E, Ivshina AV, Kuznetsov VA: Data-Driven Approach to Predict Survival of Cancer Patients. IEEE Engineering in Medicine and Biology Magazine. 2009, 28 (4): 58-66. 10.1109/MEMB.2009.932937.View ArticlePubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/11/28/prepub
Pre-publication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.