A pseudo-R2 measure for selecting genomic markers with crossing hazards functions

Background In genomic medical studies, one of the major objectives is to identify genomic factors with a prognostic impact on time-to-event outcomes so as to provide new insights into the disease process. Selection usually relies on statistical univariate indices based on the Cox model. Such model assumes proportional hazards (PH) which is unlikely to hold for each genomic marker. Methods In this paper, we introduce a novel pseudo-R2 measure derived from a crossing hazards model and designed for the selection of markers with crossing effects. The proposed index is related to the score statistic and quantifies the extent of a genomic factor to separate patients according to their survival times and marker measurements. We also show the importance of considering genomic markers with crossing effects as they potentially reflect the complex interplay between markers belonging to the same pathway. Results Simulations show that our index is not affected by the censoring and the sample size of the study. It also performs better than classical indices under the crossing hazards assumption. The practical use of our index is illustrated in a lung cancer study. The use of the proposed pseudo-R2 allows the identification of cell-cycle dependent genes not identified when relying on the PH assumption. Conclusions The proposed index is a novel and promising tool for selecting markers with crossing hazards effects.


Background
In genomic medical research, one of the major objectives is to identify genomic markers having a prognostic impact on clinical outcomes (e.g. relapse, death) so as to provide new insights into the disease process. Most of the studies which investigate the relationship between genomic markers and time-to-event outcomes usually rely on marginal survival analysis that consider univariate prognostic indices derived from the semi-parametric Cox proportional hazards model. This proportional hazards (PH) assumption states that the ratio of the hazard functions of different individuals remains constant over time. Although this assumption is arbitrary, it is widely used since it offers a convenient way to summarize the effect of a covariate on the baseline hazard function and the resulting inference on the parameters of the model is robust enough to encompass some instances of non-proportionality (monotone, converging or diverging hazard functions). However, this PH modelisation is clearly not coping with crossing hazard functions. Crossing-hazards models explicitly specify that there is a time at which the hazard curves for different levels of a covariate cross. To our best knowledge, the crossing hazards phenomenon is barely investigated in genomic studies and it is usually described as a timedependent effect of the genomic marker without any meaningful bioclinical interpretation.
In this paper, we introduce a novel pseudo-R 2 index derived from a semi-parametric non-proportional hazards model that is suited for the selection of genomic markers with crossing hazard functions. We also discuss one of the plausible interpretations for such crossing phenomenon that relates to a gene effect modification. For censored survival data, two main sapproaches have been considered for quantifying the predictive ability of a variable to separate patients: concordance and proportion of explained variation. This latter quantifies the relative gain in prediction ability between a covariate-based model and a null model, by analogy with the well-known linear model (and the R 2 criterion). In this framework, we propose a novel statistical quantity which is related to the score statistic. The proposed pseudo-R 2 index relies on the partial likelihood function in such a way that it has an interpretation in terms of percentage of separability between patients according to their survival times and marker measurements. It extends a previous work [1] for taking into account crossing hazards situations. From a real example, we show that the proposed index can be used to select genes with crossing hazards behavior that potentially indicate a modification of their prognostic effect. Moreover, it proves useful for identifying genomic markers with a common effect across multiple genomic studies due to its weak dependence on sample size variations. The paper is organized as follows. In section Methods, we first introduce an example of a simple interplay between two markers (effect modification) that leads to marginal crossing hazard functions and has prompted us to derive a novel pseudo-R 2 measure for such nonproportional situations. Then, we introduce a semiparametric non-proportional hazards model which gives rise to some crossing effect of the hazard function. Finally, we derive from this model a pseudo-R 2 measure well-suited for crossing hazard function and show its link to the robust score statistic for testing no effect of the considered marker [2]. In the Results section, we report and discuss the properties of the index from simulations experiments and compare them to those of classical indices [3][4][5][6][7] which are also linked to the likelihood function. In the Example section, we illustrate the use of the index for selecting genomic factors with crossing hazard effect in a lung cancer study. In the last section, we summarize our work.

Methods
In this section, we first present a simple situation which motivates the use of the semi-parametric non-proportional hazards model introduced in the next subsection.

Notations
Let the random variables X and C be the failure and censoring times, and T = min(X, C) be the observed follow-up time. The random variables T and C are assumed to satisfy the condition of independent censoring [8]. We denote {N i (t), t ≥ 0} the counting process that indicates the number of events that have occurred in the interval (0, t] for subject i, i = 1,..., n, so that N i (t) takes values 0 or 1. Let Y i be the at-risk process, so that Y i (t) = 1 indicates that subject i is at risk just before time t, and Y i (t) = 0 otherwise.
Let dN i (t) = N i (t -+ dt) -N i (t -) be the number of events occurring in the interval [t, t + dt) for subject i,

Motivational situation: the modulating effect
In the following, we show how a simple interplay between two binary markers Z (1) and Z (2) can lead to marginal crossing hazard functions.
The joint distribution of Z (1) and Z (2) is defined by: It is assumed that the hazard function of subject i with Z (1) i = j and Z (2) i = j is given by where l 0 (t) is an arbitrary unspecified baseline hazard function, and a and g are unknown regression coefficients.
Model (1) describes a modulating effect of the two markers Z (1) and Z (2) , whereby Z (2) has a multiplicative effect on the hazard and Z (1) has a multiplicative effect only if Z (2) equals one (so called effect modification). The corresponding hazard functions according to the values of Z (1) and Z (2) are shown in Table 1.
Assuming that model (1) is the true one, the consequences of omitting Z (2) on the formulation of the observed hazards ratio relative to Z (1) is described below. Expressing model (1) in terms of the conditional survival function given (Z (1) i , Z (2) i ) leads to: where S 0 (t) is the survival function corresponding to the baseline hazard function l 0 (t). The survival function given ( Z (1) i = j ) follows directly from Bayes' theorem, and the hazard function given ( Z (1) i = j ) can be easily deduced as: It is worth noting that this latter expression can be obtained as the expectation of (1) taken over Z (2) given Table 1 Hazard function l 0 (t)e γ l 0 (t)e α+γ the at risk process. Finally, the hazards ratio relative to the values (Z It appears from this expression that hazards may cross over time. More precisely, it is shown in Additional File 1 that when a and g are positive and assuming a balanced joint distribution for (Z (1) , Z (2) ), the hazards ratio inverts at a given time in (0; +∞). Obviously, such a time-dependence cannot be properly handled by using the proportional hazards model to analyze the data.

Semi-parametric model
The proposed model defines the survival function of subject i with covariate Z i as follows where l 0 (t) is an unspecified baseline hazard function and b an unknown regression parameter. It is a particular case of a model that was proposed for handling hazards ratio that invert over time [9], and it corresponds to a semi-parametric generalization of the Weibull distribution. For subject i; i = 1, ..., n, the model (3) can be written in terms of the hazard function where 0 (t) = t 0 λ 0 (u)du is the cumulative baseline cumulative hazard function.
In the simple case of a covariate Z taking values 0 or 1, the hazards ratio (HR) of the two groups corresponding to Z = 1 and Z = 0 is equal to HR = e β 0 (t) (e β −1) . If b > 0, this function is increasing from 0 to +∞ and ) . In this case, the risk of occurrence of the event is smaller in group 1 than in group 0 for 0 ≤ t < τ, and becomes greater when t > τ. If b < 0, the hazards ratio is decreasing from +∞ to 0 and takes value 1 for t = τ as calculated above. The risk of occurrence of the event is thus greater in group 1 than in group 0 when 0 <t ≤ τ and becomes smaller for t > τ.
Thus, as expected, model (4) allows hazards to cross over time. Note that the survival functions cross at a time larger than the crossing time of the hazards, and may not cross at a finite time.
At time t and for an individual i, i = 1, ..., n, the first derivative of the partial log-likelihood with respect to b is the score function: The score function evaluated for b = 0 can be written at time t as with ω(s) = 1 + log{Λ 0 (s)}.

Pseudo-R 2 measure
The goal of this section is to propose a pseudo-R 2 index that can be interpreted in terms of percentage of separability between patients according to their survival times and marker measurements under the crossing hazards model (4). The approach used below is based on the score function (5). It extends the particular case that we considered in a former work [1] where we assumed the classical PH model. The main idea is to note that the score can be rewritten as a separability quantity between patients experiencing or not the event of interest. More precisely, the quantities U i in (5) can be rewritten as From this expression, we show that, for a given covariate Z at time t, the U i can be expressed as the weighted difference between the value of the covariate of the patient observed to experience the event of interest and the mean of the covariates of the group of patients observed to not experience the event.
An estimation of the U i is given bŷ whereω (t i ) = (1 + log{ˆ 0 (t i )}), Λ 0 (t i ) is estimated by the left-continuous version of the Nelson's estimator [10,11], and δ i , i = 1, ..., n is the indicator of failure at time t i .
For distributional reason, instead of the U i , we introduce the so called robust scores W i [2] which expressions are Where The W i can be estimated bŷ The sum over i of the robust W i , i = 1, ..., n is identical to the sum of the U i . However, as for the Cox model, the W i are independent, while the U i are not.
Finally, the index is equal to the robust score statistic divided by the number of distinct uncensored failure times k: The index D 0 is interpreted in terms of percentage of separability over time between the event/non-event groups. Its calculation is easy as it does not require the estimation of the parameter b of the crossing hazards model. We can easily demonstrate that 0 ≤ D 0 ≤ 1.
It is worth noting that the index D 0 can be interpreted as a pseudo-R 2 measure. In the linear regression model, the R 2 (coefficient of determination) can be directly linked to likelihood-related quantities such as the Wald test, the likelihood ratio and the score statistics (see [12]). These formal relationships provide different ways to interpret the R 2 . In the framework of non-linear models, statisticians have searched for a corresponding index and different pseudo-R 2 statistics have been proposed for censored data. Our proposed index is an extension of the definition of the R 2 for survival model with crossing hazards which relies on the score statistic.

Simulation Scheme
A simulation study was performed to describe the beha- with ω(t i ) = 1, [1]) and other most usual indices: Allison's index [3], its modified version [5], Nagelkerke [4] and Xu and O'Quigley's [6] indices. All of them are designed for a PH model and are denoted D (PH) 0 , ρ 2 N , ρ 2 k , R 2 N and ρ 2 XOQ , respectively. The different elements defining a configuration were the following. For a given subject, the distribution of the covariate Z included in model (3) was either discrete (Bernoulli ℬ(0.5)) or continuous (log-normal with mean 0 and variance 1/4, or uniform U[0, √ 3]) . These three distributions of Z were standardized to have the same variance. Two distributions for the survival time X were considered. The first one was defined by model (3) with Λ 0 (t) = t. It is equivalent to a Weibull parametric model W(η, α) with scale parameter h = 1 and shape parameter a = exp(bZ). The second one correspond to a log-normal distribution with mean equal to 0 and standard deviation equal to e -bZ . These two distributions allowed to simulate the crossing hazards phenomenon.
The coefficient b was given a value such that e b = 1, or 2, or 3, or 4. It can be noticed that, in the case of a Bernoulli variable Z, the hazard functions corresponding to e b = 2, or 3, or 4 cross when the survival function of group Z = 1 equal 0.78, 0.82, 0.85 respectively. The censoring variable C i was assumed to be independent from X i given Z i and distributed according to either a uniform C i ∼ U{0, r} or exponential C i~ℰ (g) distribution. The parameters r and g were calculated in order to yield an expected overall percentage of censoring p c equal to 0%, 25% and 50%. The sample size n was taken equal to 50, 100 or 500. Data were generated as follows. For each subject i, i = 1, ..., n, a value of the covariate Z i was generated. Given that value, a survival time X i was generated according to a either Weibull distribution W(η = 1, α = e βZi ) , or a log-normal distribution LogN (μ = 0, σ 2 = e −2βZi ) . The censoring variable C i was independently generated and the observed follow-up time T i was calculated as min(X i , C i ). For each configuration 1,000 independent replications were generated. shows a mean value increasing regularly with b. The means of the five other indices do not appear to increase with b and remain below 0.05 even for the

Application of the index on real data
In this section, we illustrate the use of the proposed index by selecting transcriptomic prognostic factors having a crossing effect in a lung cancer study. We compare the selection to the one obtained when relying on the index calculated under a proportional hazards model.

Dataset
This series is composed of 74 patients who underwent surgery at the Hôtel-Dieu Hospital (AP-HP, France) between August 2000 and February 2004 for stage IB (pT2N0) primary adenocarcinoma or large cell lung carcinoma of peripheral location [13]. Relapse-free survival was defined as the time from surgery until diseaserelated death, disease recurrence (either local or distant), or last follow-up examination. The median relapse-free survival time was 63.8 months. The two years relapsefree survival was 80.3%[71.2%, 90.5%], and the five years relapse-free survival was 59.3%[47.2%, 74.5%]. For each patient, we considered the gene expression measurments of 51,852 transcripts (obtained using Aymetrix HU133 Plus 2.0; Aymetrix, Santa Clara, CA, USA) located on the autosomal chromosomes.

Selection of the variables
The genes were ranked according to the value of either ). Only a small proportion of transcripts (5%) was common to both lists.
We then examined the biological processes that were significantly over-represented in the two sublists using the PANTHER (Protein ANalysis THrough Evolutionary Relationships) classification system [14]. Results showed that the two indices allow selecting genes involved in different biological processes (See Additional File 8). For the cell-cycle process, D (NPH) 0 selected 25 transcripts (significantly higher number than the 5% expected by chance), whereas D (PH) 0 selected only 16 transcripts (not significantly higher number than the 5% expected by chance). The two lists of genes involved in cell cycle are given in Additional File 9.
Among the 25 cell-cycle related transcripts selected according to D (NPH) 0 , we discussed the behavior of two genes, namely FGFR2, MCL1, known to be involved in complex biological pathways. In particular, we examined whether the crossing phenomenon (observed om Figures  7.a and 7.b) could potentially be related to some effect modification of other genes.
The gene FGFR2 (fibroblast growth factor receptor 2) is known to be involved in various cancer types [15] and low gene expression measurements have been reported  as linked to a shorter survival in lung cancer [16]. The analysis of FGFR2 gene expression taking into account FGF4 gene expression, which is one of its ligand, suggested a potential modulating effect between the two genes. In the following, we reported the hazards ratio (HR) computed under the Cox PH model for the four groups resulting from dichotomizing the two variables at the median. We also displayed the Kaplan-Meier curves on Figure 7.c. As seen on this latter, patients with low expression (below the median) of both FGFR2 In the same way, we discussed the interaction between MCL1 and BCL2, two anti-apoptotic genes belonging to the BCL-2 gene family. Considered initially as oncogenes, the prognostic impact of BCL-2/MCL-1 for various type of cancer is debated due to their dual function on cell death and cell proliferation (for a review, [17]). The anti-apoptotic effect is associated with resistance to chemotherapy, leading an adverse prognostic role in some cancers such as leukemia or advanced ovarian tumors. In contrast, the anti-proliferative activity of MCL-1 and BCL-2 is associated with a favorable prognosis effect in some early carcinomas, such as lung adenocarcinoma [18]. Moreover, the combined analysis of MCL1 and BCL2 gene expressions indicated a potential modulating effect between them. As seen on Figure  7.d., in our lung cancer study of early lung adenocarcinomas treated by surgery alone, patients with low expression (below the median) of both genes MCL1 and BCL2 have the worst prognosis (reference group). When MCL1 is highly expressed (above the median) and BCL2 lowly expressed, the prognosis is not significantly improved (HR = 0.533[0.202, 1.4103]). On the contrary, In these two examples, we could hypothesize that the crossing effect observed in the marginal analysis of FGFR2 or MCL1 is related to some potential effect modification linked to FGF4 or BLC2, respectively. This hypothesis is consistent with the known biological activity of those genes. FGFR2 encodes for a receptor, which needs one of its ligand (i.e. FGF4 ) for activation and biological activity. Also, BCL2 and MCL-1 encode two proteins of the same family, which may act together on the cell through heterodimerization on apoptosis or cell proliferation. The resulting subgroup of patients defined by high expression of these genes couples might be clinically relevant and the object of further investigations. The groups of patients were determined according to the low-high expression status of the genes considered. Patients whose expression measurement was higher (resp. lower) than the median were assigned to the "highly expressed" (resp. "lowly expressed") group.

Discussion
For survival data analysis, univariate feature selection strategy is mainly based on ranking markers according to the value of a test statistic or a predictive index obtained under the classical Cox PH model. In such setting, we demonstrated in a previous work the interest of using a pseudo-R 2 measure for genomic studies. However, various departures from the PH assumption can be observed and crossing hazards phenomenon can be encountered in real situations.
In this context, we propose a novel pseudo-R 2 measure that is suitable for identifying genomic markers with crossing effects. It is linked to a semi-parametric survival model that provides sufficient flexibility to handle data with crossing hazards. Selecting such markers is potentially important since it could reflect the complex interplay between genes belonging to the same pathway.
The proposed index is ranging from zero to one and can be interpreted in terms of percentage of separability over time between the subgroup of subject(s) experiencing the event and the subgroup of those experiencing the event at a later time. It quantifies the prognostic separability of markers under a crossing hazard function assumption, whereas for the proportional hazards setting other specialized indices have previously been proposed [1]. This pseudo-R 2 is derived from the partial log-likelihood function and directly linked to the robust score statistic, while similar derivations from Wald or likelihood ratio statistics are not trivial and not easily tractable. As seen from our simulation results, the proposed index increases with the value of the regression parameter and is affected neither by the percentage of censoring nor the sample size of the study. The results show that our pseudo-R 2 is the most suitable for taking into account the crossing hazards phenomenon, as compared to classical indices.
From a real dataset on lung cancer, we show that our index allows to identify genes involved in biological processes linked to the tumor evolution and that are not selected under the PH assumption.
Among the cell-cycle related genes of our selection, we investigate two genes, FGFR2 and MCL1, which crossing effects could potentially be linked to some modulating effect due to other genes from the same biological pathway. Knowing the complexity of gene interactions, this is an over-simplification of the biological reality and other mechanisms can obviously lead to such non-proportional phenomenon. In this analysis, the gene expression measurements are dichotomized based on the median but other cutoffs could be investigated (by searching for optimal cutoff point) as proposed by Motakis et al. [19]. To the best of our knowledge, the present work is the first to propose a pseudo-R 2 measure that is specifically designed for crossing hazards situations.