 Research
 Open access
 Published:
A novel nonnegative Bayesian stacking modeling method for Cancer survival prediction using highdimensional omics data
BMC Medical Research Methodology volume 24, Article number: 105 (2024)
Abstract
Background
Survival prediction using highdimensional molecular data is a hot topic in the field of genomics and precision medicine, especially for cancer studies. Considering that carcinogenesis has a pathwaybased pathogenesis, developing models using such group structures is a closer mimic of disease progression and prognosis. Many approaches can be used to integrate group information; however, most of them are singlemodel methods, which may account for unstable prediction.
Methods
We introduced a novel survival stacking method that modeled using group structure information to improve the robustness of cancer survival prediction in the context of highdimensional omics data. With a super learner, survival stacking combines the prediction from multiple submodels that are independently trained using the features in pregrouped biological pathways. In addition to a nonnegative linear combination of submodels, we extended the super learner to nonnegative Bayesian hierarchical generalized linear model and artificial neural network. We compared the proposed modeling strategy with the widely used survival penalized method Lasso Cox and several group penalized methods, e.g., group Lasso Cox, via simulation study and realworld data application.
Results
The proposed survival stacking method showed superior and robust performance in terms of discrimination compared with singlemodel methods in case of highnoise simulated data and realworld data. The nonnegative Bayesian stacking method can identify important biological signal pathways and genes that are associated with the prognosis of cancer.
Conclusions
This study proposed a novel survival stacking strategy incorporating biological group information into the cancer prognosis models. Additionally, this study extended the super learner to nonnegative Bayesian model and ANN, enriching the combination of submodels. The proposed Bayesian stacking strategy exhibited favorable properties in the prediction and interpretation of complex survival data, which may aid in discovering cancer targets.
Introduction
Survival prediction using highdimensional omics data has been a widely discussed topic in the field of precision medicine, particularly when it comes to cancer research [1,2,3]. Genomic data that contains abundant hereditary information largely determines the phenotype heterogeneity of cancer patients [4, 5]. In recent years, highthroughput sequence technologies facilitate the extensive application of genomic information to predict the patient’s prognosis [6]. The challenge lies in how to construct efficient and robust models for survival prediction in the context of highdimensional data.
Regularization methods, such as Lasso, relaxed Lasso, and elasticnet, are recognized as powerful modeling tools yielding predictive and interpretable models [7]. These methods were extended to the Cox model for better handling the survival data [8]. When using genomic data, these methods construct models based on individual genes, treating them as independent predictors. However, the progression and prognosis of cancer are regulated by multiple biological signaling pathways, and thus incorporating pathwaylevel information into model building can be a more accurate representation of the underlying biological processes [9,10,11]. In this light, several extensions, such as the group Lasso (grlasso) and composite minimax concave penalty (cMCP), are able to integrate the biological pathway information as group structure into the modeling procedure [12, 13]. Besides, several attempts have been made to build pathwaybased modeling strategies. Chen and Wang proposed to integrate prior defined biological pathway information and gene expression profiles for cancer prognosis [14]. Zhang et al. proposed a twostage strategy integrating risk scores derived from pathwaybased models to make cancer survival prediction [15]. Kim et al. utilized a directed random walk algorithm that navigates through the pathway network, generating an effective genomic feature extraction [16]. However, the majority of these are singlemodel based methods, usually leading to unstable prediction. Others employ similar concepts with the naive stacking learning.
Stacking strategy is a wise ensemble learning method that combines crossvalidated (CV) predictions from multiple varied algorithms or models [17]. By leveraging the strengths of different models, stacking methods often yield more robust and accurate predictions than using a single model [18]. However, the implementation of stacking methods to survival data is more complex since the predicted survival probability is varied across time. Andrew Wey, et al. proposed using the inverse probability of censoring weighted Brier Score (IPCWBS) as the objective function for survival stacking models based on multiple time points [19]. Golmakani and Polley assumed that candidate models were all on the condition of proportional hazards and used crossvalidated negative log partial likelihood as an optimization function [20]. Robert Tibshirani, et al. demonstrated that the logistic regression estimation fitting the events of different time points is the approximate estimation of the Cox model and thus one can cast survival analysis as a stacking classification problem [21]. Ginestet, et al. proposed an ensemble procedure based on the pseudoobservationbasedAUC loss to optimally stack predictions from survival algorithms [22].
In the present study, we introduced a novel survival stacking method that integrated group structure information to improve the robustness of cancer survival prediction using highdimensional omics data. Our approach involved grouping genomic data into multiple subdata based on biological pathway knowledge. Submodels were then independently trained using the features in each subdata. In addition to a nonnegative linear combination of submodels using a traditional optimization method based on the integrated Brier Score (IBS) loss function, we also proposed a Bayesian hierarchical generalized linear model (BhGLM) using the nonnegative mixture doubleexponential (DE) prior, as well as an artificial neural network (ANN), to ensemble the predictions of submodels. We compared the proposed methods to several competitors, including the widely used survival penalized method and the extensions that consider the group structures, through simulation study and realworld data application. The results showed that the proposed survival stacking strategy exhibited favorable properties in prediction and interpretability.
The paper is organized as follows: In Section 2, we presented a detailed illustration of the proposed strategy. Section 3 compared the prediction performance of the proposed method and existing methods through a simulation study. In Section 4, the proposed methods were applied to several realworld data. Lastly, Section 5 concluded the paper and discussed several critical issues related to our methods.
Materials and methods
Pathwaybased survival stacking strategy
Supposing a rightcensored survival data of n subjects consists of triplets {(y_{i}, δ_{i}, x_{i})}, for i = 1, 2, …, n. Denote the observed survival time y_{i} = min(t_{i}, c_{i}), where t_{i} and c_{i} are event time and censored time, respectively. δ_{i} = I(t_{i} < c_{i}) indicates the occurrence of events. The goal is to estimate the survival function of the eventtime random variable Y that depends on p covariates x, i.e. S(y x) = P(Y > y x). In this study, we aim to predict the survival of cancer patients using genomics data.
The proposed survival stacking method is a twolayer learning structure consisting of multiple base learners (submodels) and a super learner (metamodel). See Fig. 1 for the framework flow.
We first transform the genomics data into J subdata containing genes in each pathway. Then, in the first layer, submodels are independently trained for each subdata. The resulting submodels represent the predictive capacity of pathways. To mitigate overfitting, we calculate the crossvalidated survival predictions based on submodels. Specifically, in each pathway, samples of original data are randomly partitions into K subsets (folds) of (rough) equal size. The k^{th} fold is used as the validation data, V(k), while the remaining folds are the training data, T(−k), k = 1, 2, …, K. In the training data, penalty Cox model can be used to fit submodel and the baseline hazard h_{0}^{−k}(y^{−k}) can be estimated by the breslow method. Then the linear predictor (lp^{k}) in the validation data is estimated by the fitted submodel. The estimated survival probabilities \({\hat{S}}^k\left({y}^k\boldsymbol{x}\right)\) in V(k) can be calculated using lp^{k} and h_{0}^{−k}(y^{−k}), that is
where \({H}^{k}\left({y}^{k}\right)={H}_0^{k}\left({y}^{k}\right)\times {e}^{lp^k}\), \({H}_0^{k}\left({y}^{k}\right)\) is cumulative baseline hazard, i.e. the integral of h_{0}^{−k}(y^{−k}). The process is repeated for all K folds, yielding the CV predictive survival probabilities of all cases. For J submodels, we can obtain J predictions \({{\hat{S}}_j}^{CV}\left(y\boldsymbol{x}\right)=\sum_{k=1}^K{{\hat{S}}_j}^k\left({y}^k\boldsymbol{x}\right),\kern0.5em j=1,2,\dots, J\). The second layer uses a super learner to fit the CV survival predictions of J submodels over a set of time points. The resulting coefficients are the estimated weights \({\hat{w}}_j\) for J submodels. The predictive survival function \(\hat{S}\left(y\boldsymbol{x}\right)\) can be estimated by combining the predictions of J submodels \({\hat{S}}_j\left(y\boldsymbol{x}\right)\) (refit in the original data) using the weights \({\hat{w}}_j\).
Method to estimate weights \({\hat{w}}_j\)
Linear combination approach
Typically, the predictive survival function \(\hat{S}\left(y\boldsymbol{x}\right)\) is a linear combination of the predictions of J candidate submodels given as,
We optimize the weights \(\hat{w}\) by minimizing the IBS loss. The other loss function, such as, AUCbased loss, should be a favorable alternative [22]. IBS measures the squared distance between the probabilities and observed events over a set of time points y_{1}, …, y_{s} [23], which can be written as,
where R(y_{r}) represents patients who are still at risk at the time y_{r}, Z_{i}(y_{r}) = I(y_{i} > y_{r}). We can estimate \(\hat{w}\) by minimizing IBS. Generally, the estimated weights \({\hat{w}}_j\) are constrained to nonnegative for lower variance and better prediction. This constraint can be achieved by employing a nonlinear optimization algorithm based on the augmented Lagrange method which can be implemented in R function solnp [24]. Concerning the selection of time sets y_{1}, …, y_{s}, we use nine evenly spaced quantiles of the observed event distribution as Andrew Wey advocated [19].
Bayesian combination approach
In addition to the IBS solutions, if we treat the survival predictions of the submodels as covariates, and treat the timedependent status Z_{i}(y_{r}) (0 for dead and 1 for alive at each time point y_{r}) as a binary outcome, the predicted survival can be expressed as,
which is a generalized linear model (GLM). h is a link function such as a sigmoid function, to ensure the expected predicted survival probability to be 0–1.
Nonnegative lasso (nLasso)
The advance of formula (4) is that we can add the l1 penalty term into the above GLM and thereby expanding the usage of the survival stacking, such as handling numerous submodels (in a highdimensional scenario), which is impractical for solnp.
It is well known that the Lasso is equivalent to a Bayesian hierarchical model with DE prior on coefficients [25], with coefficients qualified as nonnegative in this study,
where the scale, s, controls the degree of shrinkage; a smaller scale induces stronger shrinkage, driving the estimates of w_{j} toward zero. The weights fitted with nLasso are given by,
The weights above can be estimated by the cyclic coordinate descent algorithm using the glmnet package in R. The restriction of w to be nonnegative can be conveniently performed using the glmnet package.
Nonnegative spikeandslab lasso (nsslasso)
We further extended the nonnegative DE prior to the nonnegative spikeandslab mixture DE prior (Supplementary Fig. 1),
where s_{j} = (1 − γ_{j})s_{0} + γ_{j}s_{1} is called the total scale parameter; γ_{j} is an indicator (γ_{j} ∈ {0, 1}) following a binomial distribution; s_{0} and s_{1} (s_{1} > s_{0} > 0) are the scale parameters for spike and slab distribution, respectively. s_{1} applies weaker compression to the pathways of strong effects and is usually fixed at a larger value, say s_{1} = 1; while s_{0} gives stronger compression to the pathways of weak effects (or even compress to zero) and is a flexible smaller value selected from a set of predefined candidate values via crossvalidation. Usually, spikeandslab Lasso is more adaptive than Lasso [26]. The weights can be estimated by the EM coordinate descent algorithm [26] using the glmnet package and the BhGLM package in R. The restriction of weights to be nonnegative can also be performed with the glmnet package.
Artificial neural network
Considering that the ANN can act as a classifier and give restricted (nonnegative) weights to the input data, we can use it as a super learner. ANN uses backpropagation algorithm and gradient descent algorithm to iteratively estimate the weights.
Evaluation of model performance
In principle, the survival stacking model is a binary classification problem for a given time [21]. Here, we employed the timedependent AUC and timedependent Brier Score (BS), which calculate the AUC and BS of the objects in the risk set of any time point, as recomended by Robert Tibshirani [21]. The timedependent AUC is used to examine a model’s ability to discriminate between different outcomes at a given time point. The timedependent BS is used to measure the calibration performance at a given time point: \(\textrm{BS}(y)=\frac{1}{n}\sum_{i=1}^n{\left({Z}_i(y)\hat{S}\left(y\boldsymbol{x}\right)\right)}^2\). We selected three evaluated time points, namely 25, 50, and 75% quantiles of the total observation time of the test data.
Competitive statistical methods
In our proposed survival stacking model, Lasso Cox was used to build pathwaybased submodels. To combine submodels, we used the solnp (implemented by R function solnp), nLasso/nsslasso (implemented in the package glmnet and BhGLM), and ANN (implemented using TensorFlow library (2.3.0) of Python (3.7), the weights can be limited to nonnegative by using kernel_constraint = non_neg()) as super learners. The fitting process of ANN see Supplementary Fig. 2 & 3. For time points, we used nine evenly spaced quantiles of the observed event distribution, that is {0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1}. We compared the performance of our proposed method with several existing singlemodel approaches, including the widely used Lasso Cox regression (glmnet) [27] and extensions that incorporate the group structures: the group spikeandslab Lasso (gsslasso) (BhGLM) [28], overlap group Lasso (grlasso), overlap group cMCP, and overlap group smoothly clipped absolute deviation (grSCAD) (grpregOverlap) [29]. The performance of these methods was evaluated using simulated and realworld data. All singlemodel methods are executed using default parameters. All analyses were performed using the R (4.1.3) software on Dale T7920 INTEL Windows 10 Gold 5117 CPU @ 2.00GHz.
Simulation study
Simulation design
The present study designed six scenarios with varied theoretical generalized R^{2} and covariate coefficients (β) (Table 1). In each scenario, we generated two homogeneous datasets with equal sample sizes, one for training data and the other for test data. To assess the performance of the methods, we conducted 100 duplicated runs in each scenario and calculated the average results for comparison. This process is conducted using the R package BhGLM.
Specifically, in each dataset, we generated 500 samples, each with a survival outcome of d_{i} = {(y_{i}, δ_{i})} and 1000 continuous covariates x_{i} = (x_{i, 1}, x_{i, 2}, .., x_{i, 1000}), for i = 1, 2, …, 500. The vector x_{i} was randomly sampled from the multivariate normal distribution i.e. x_{i}~N(0, Σ), where Σ ∈ ℝ^{1000×1000} is the variancecovariance matrix. These covariates were assigned to 20 distinct groups, allowing for overlap between the groups, which is a mimic of pathway overlapping (Supplementary Table 1). The correlation coefficient r of covariates within groups was 0.6, and covariates between groups were independent. The observed survival time y_{i} was generated from the Weibull distribution [30]: \({y}_i={\left(\frac{\mathit{\log}(U)}{\lambda exp\left({z}_i\right)}\right)}^{1/\nu }\) and the censored ratio was set to 50%. δ_{i} = 1 indicates the occurrence of events and δ_{i} = 0 indicates censored. The variable U was uniformly distributed over an interval between 0 to 1; We set the scale parameter λ = 3; shape parameter ν = 3; and intermediate variable z_{i} followed a univariate normal distribution z_{i}~N(μ_{i}, σ^{2}), where \({\mu}_i={\sum}_{l=1}^{1000}{x}_{il}{\beta}_l\). σ^{2} denotes the residual variance, which was determined by fixing three theoretical generalized R^{2}: 0.50, 0.25, and 0.10. We set eight nonzero covariate coefficients of two types: the absolute values of one range between 0.7 to 1, and the other range from 0.2 to 1.5.
Results of the simulation
Prediction performance
Table 2 summarizes the average timeAUC and timeBS of each method at 50% quantiles of the total observation time in the test data under six simulation scenarios. The results of the other two time points are shown in Supplementary Table 2. According to the simulation, the methods considering group structures, e.g., grlasso, and grSCAD, did not exhibit apparent advantages over Lasso Cox. However, gsslasso Cox and cMCP were competitive across all scenarios.
All the four survival stacking methods outperformed the singlemodel methods except for Scenario 1 and 4. However, there was no significant difference between the four stacking methods based on AUC. The calibration of solnp(Lasso) performed well across all scenarios at 25 and 50% time points while nLasso(Lasso) / nsslasso(Lasso) performed superior at 75% time point. ANN(Lasso) performed moderately at the three time points. Of note, although the AUCs of the three time points are very close, the BS increases with the time.
Distribution of estimated weights
We further compared the estimated weights between super learners. Theoretically, the weights for group1, group5, and group20 should be nonzero due to the presence of relevant nonzero variables. In general, all of the four super learners consistently identified the nonzero weights across most scenarios (Fig. 2 and Supplementary Fig. 4). solnp(Lasso) did a good job of giving very small weights to zero weights (Fig. 2C/D) while ANN(Lasso) had the narrowest interval range of nonzero weights. nLasso(Lasso) and nsslasso(Lasso) presented moderate results.
Applications to real data
We applied the proposed method to three realworld cancer datasets with survival records and largescale gene expression profiles. For these datasets, gene expression data were standardized using covariates function in BhGLM package. We randomly partitioned the original data into two subsets of equal sample size: one for training models and the other for evaluating model performance. The process was repeated 100 times in case of casual results due to data split. To ensure a balanced response, we performed a logrank test on the survival curves between training and test data and considered those with P_{log − rank} > 0.5 being balanced splits that would be retained for further analysis. Genes were mapped to pathways using genome annotation tools. More precisely, we first mapped gene symbols to Entrez Ids using annotateI package and then mapped genes to KEGG pathways (default parameter) using clusterProfiler package [31].
TCGA breast cancer dataset
We obtained the transcriptome profiles (in TPM format) and the corresponding latest survival information for TCGA Breast Cancer (BRCA) from “GDC Data Portal” (https://portal.gdc.cancer.gov/). We selected the female samples that had both survival outcomes and gene expression profiles. Genes with > 50% of zero expression were filtered out and those with > 20% quantile variance were retained. Eventually, we ended up with a dataset consisting of 1060 samples and 13,745 genes. These genes were mapped to 140 pathways involving 3855 genes (see Supplementary Table 3).
Prior to the stacking process, we performed an initial pathways screening to identify those with potential predictive value. We fitted a Lasso Cox for all 140 pathways in the original data separately and obtained the Cindex for each pathway. A total of 116 pathways had a Cindex > 0.5. However, many of them were not predictive but introduced variance, which was detrimental to the ensembled prediction. We further constrained the enrolled candidate pathways to these with Cindex > 0.55, resulting in 48 pathways for the subsequent analysis.
Table 3 summarizes the average timeAUC and timeBS at the three time points of various methods applied to BRCA dataset. In general, gsslasso and grlasso showed superior predictive performance over other singlemodel methods. Pathwaystacking methods outperformed singlemodel methods in terms of discrimination. The stacking methods also demonstrated a high calibration in the early and middle survival time. Among the survival stacking methods, solnp(Lasso) exhibited a preferable calibration consistently across time but inferior discrimination. Nsslasso(Lasso) had a favorable performance in the early and middle periods while ANN(Lasso) performed better discrimination at middlelate survival time.
An advantage of nLasso and nsslasso is that they can identify important pathways owing to their sparsity nature. When applied to the whole dataset of TCGA BRCA, nsslasso(Lasso) and nLasso(Lasso) could select similar pathways. Nsslasso(Lasso) found three pathways including Huntington’s disease (w = 0.962), HIF1 signaling pathway (relative weight, w = 0.076), and Leishmaniasis (w = 0.062) (see Supplementary Table 4). nLasso(Lasso) found four pathways including Huntington’s disease (w = 0.749), HIF1 signaling pathway (w = 0.114), Leishmaniasis (w = 0.086), and Oxidative phosphorylation (w = 0.051), with the former three being selected by both methods.
Metabric dataset
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) data consists of comprehensive information on more than 2000 breast cancer patients, including clinical data, gene expression data, and mutation data. We obtained gene expression data and survival data from cBipPortal (https://www.cbioportal.org/). After data preprocessing (as described in 4.1), we obtained a dataset with 1420 samples and 19,494 genes. These genes were mapped to 146 pathways involving 3709 genes (see Supplementary Table 5).
After the pathways prescreening, we included 138 of 146 pathways with a Cindex > 0.60 for the following analysis. Among the singlemodel methods, grlasso still had the most superior predictive performance. Pathwaystacking methods showed favorable discrimination compared to grlasso (Table 4). nLasso(Lasso) and nsslasso(Lasso) performed well both in discrimination and calibration.
The survival stacking model (nsslasso(Lasso)) fitted using the METABRIC dataset identified seven pathways (Supplementary Table 6). nLasso(Lasso) also found the same seven pathways: MAPK signaling pathway (W = 0.018), Focal adhesion (W = 0.041), Cellular senescence (W = 0.170), Choline metabolism in cancer (W = 0.125), Endocytosis (W = 0.014), Carbon metabolism (W = 0.311), Apoptosis (W = 0.215); and another two pathways: PPAR signaling pathway (W = 0.099) and p53 signaling pathway (W = 0.007).
TCGA ovarian cancer dataset
Alike BRCA data, we acquired TCGA ovarian cancer (OV) dataset from the “GDC Data Portal”. After data preprocessing, we obtained a dataset with 415 samples and 13,764 genes. These genes were mapped to 124 pathways involving 3596 genes (see Supplementary Table 7).
After prescreening, a total of 90 pathways had a Cindex > 0.5 and the highest Cindex was 0.58. We selected all 90 pathways for the following analysis. Table 5 showed that the pathwaystacking methods outperformed the singlemodel methods in prediction accuracy and variance (lower standard deviation especially for BS). The four stacking methods had similar and stable prediction performance.
In application, nsslasso(Lasso) identified four pathways (Supplementary Table 8). nLasso(Lasso) found another two pathways, namely, Cell cycle (w = 0.038) and Proteasome (w = 0.079), in addition to the four pathways that were selected by nsslasso(Lasso) but with different weights: Influenza A (w = 0.360), Peroxisome (w = 0.268), B cell receptor signaling pathway (w = 0.128), and T cell receptor signaling pathway (w = 0.129).
Discussion
The present study proposed a novel survival stacking strategy that can incorporate genome pathway information into the development of cancer prognosis models. This strategy demonstrated an advantage over existing methods that rely on a single group model (such as grlasso, grSCAD, gsslasso) by using a stacking method to improve prediction robustness. Additionally, we extended the super learner to hierarchical GLM and ANN, thereby enriching the combination of submodels. Generally, solnp uses IBS as an optimization function to obtain a lower timeBS. Hierarchical Lasso and sslasso inherit the sparse property that makes them effective at handling multiple submodels. The sslasso super learner could outperform Lasso in certain cases, while in others, the two methods performed similarly. The ANN method can capture more nonlinear relationships, leading to better prediction performance. However, it may also capture more noise information and overfit the data.
In the simulation study, stacking methods consistently exhibited superior performance in terms of discrimination over the methods using a single model, except for Scenarios 1 and 4. Scenarios 1 and 4 represented the situation of a higher theoretical generalized R^{2} or a small residual variance, in which the predictive information was easy to capture. The advantage of the stacking methods was not evident since these methods based on a single model had achieved a fairly well prediction. However, stacking methods demonstrated superior discrimination performance than any single model in the situation with more noise because they could borrow advantages from various models. Realworld data is typically characterized by a higher level of noise, which may account for the favorable performance of the proposed methods in the realworld data applications [32]. However, this may come at the expense of some calibration accuracy.
A noted point of the stacking using nsslasso is the interpretability of the resulting models. Firstly, the proposed stacking method demonstrates increased sensitivity in identifying diseaserelated pathways, which may be too subtle for genelevel models to detect [33]. Second, we implemented the methods considering group structure (e.g, gsslasso) to the realworld data (see Supplementary Table 9). The results indicated that while gsslasso exhibited good predictive performance, it did not effectively indicate pathway importance. Third, unlike Lasso which imposes an equal penalty on all coefficients, sslasso adaptively employed weak compression to strong effects and strong compression to weak effects [33]. We observed that sslasso tended to retain fewer pathways, while Lasso prefers to include more pathways with small effects. For instance, nsslasso(Lasso) identified several important pathways in METABRIC dataset, such as cellular senescence, choline metabolism in cancer, carbon metabolism, apoptosis, and PPAR signaling pathway. These pathways are deeply involved in the cell cycle and carcinogenesis process [34, 35]. nLasso(Lasso) could find two additional weak signal pathways, namely MAPK and p53 signaling pathways. These two popular pathways are associated with the prognosis of breast cancer [36, 37]. However, many MAPK family genes and TP53 are also contained in the other four pathways, indicating limited information that the two pathways can provide (Supplementary Table 6). Similarly, Huntington’s disease pathway identified in TCGA BRCA contains TP53. Huntington’s disease seems to be unrelated to the prognosis of breast cancer. However, several epidemiology studies have shown a lower risk of cancer among patients with Huntington’s [38,39,40]. Additional research has delved into their relationship at the molecular level, including the impact of Huntington and ErbB2/HER2 signaling on the development and metastasis of breast cancer [41, 42].
In total, the proposed methods possess advantageous features in identifying pathways that offer prognostic information. Also, the weights assigned to these submodels (based on pathways) signify their predictive significance. We anticipate that focused research on these prioritized pathways will aid in discovering cancer targets. Another obvious property of the pathwaybased stacking strategy is that submodels are constructed independently, circumventing the geneoverlapping issue. In addition, one commonality of the stacking methods is having an improved discrimination than the singlebased models, which may help identify highrisk patients. A limitation of our approach is that it takes more time due to the CV procedure in the submodel construction. But the cost pays off in the more robust and accurate prediction. Last but not least, although the proposed survival stacking strategy is based on a twolevel process of genepathway structure, our ideas can be naturally generalized to other biological processes with similarly hierarchical levels.
Availability of data and materials
We acquired the dataset for Breast Invasive Carcinoma (ldentifier/Accession Number: TCGABRCA) from the TCGA (The Cancer Genome Atlas) database, accessible at https://portal.gdc.cancer.gov/projects/TCGABRCA. We obtained another breast cancer dataset with the identifier “Breast Invasive Ductal Carcinoma” from METABRIC (Molecular Taxonomy of Breast Cancer International Consortium, https://www.cbioportal.org/study/summary?id=brca_metabric). We acquired the dataset for Ovarian Cancer (ldentifier: TCGAOA) from the TCGA database, accessible at https://portal.gdc.cancer.gov/projects/TCGAOA. The main code for the proposed method is freely available on the GitHub website at: https://github.com/JasonLnzi/ABayesianStackingModelingMethodforSurvivalPredictionUsingHighdimensionalData/tree/main.
Abbreviations
 Lasso:

Least absolute shrinkage and selection operator
 grlasso:

Group Lasso
 cMCP:

Composite minimax concave penalty
 CV:

Crossvalidated
 IPCWBS:

Inverse probabilityofcensoringweighted Brier Score
 IBS:

Integrated Brier Score
 BhGLM:

Bayesian hierarchical generalized linear model
 DE:

Doubleexponential
 ANN:

Artificial neural network
 lp:

Linear predictor
 GLM:

Generalized linear model
 sslasso:

Spikeandslab Lasso
 nsslasso:

Nonnegative spikeandslab Lasso
 AUC:

Area under the ROC curve
 BS:

Brier Score
 grSCAD:

Overlap group SCAD
 KEGG:

Kyoto Encyclopedia of Genes and Genomes
 TCGA:

The Cancer Genome Atlas
 BRCA:

Breast cancer
 METABRIC:

Molecular Taxonomy of Breast Cancer International Consortiu
 OV:

Ovarian cancer
References
Tang ZX, Lei SF, Zhang XY, et al. Gsslasso cox: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information. BMC Bioinformat. 2019;20(1):94.
Ashley EA. Towards precision medicine. Nat Rev Genet. 2016;17(9):507–22.
Gupta GK, Collier AL, Lee D, et al. Perspectives on triplenegative breast cancer: current treatment strategies, unmet needs, and potential targets for future therapies. Cancers. 2020;12(9):2392.
Fisher R, Pusztai L, Swanton C. Cancer heterogeneity: implications for targeted therapeutics. Br J Cancer. 2013;108(3):479–85.
Jiang T, Shi W, Natowicz R, et al. Statistical measures of transcriptional diversity capture genomic heterogeneity of cancer. BMC Genomics. 2014;15(1):876.
Shao W, Wang T, Sun L, et al. Multitask multimodal learning for joint diagnosis and prognosis of human cancers. Med Image Anal. 2020;65:101795.
Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc B. 2011;73:273–82.
Simon N, Friedman J, Hastie T, et al. Regularization paths for Cox's proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1–13.
Vogelstein B, Kinzler KW. Cancer genes and the pathways they control. Nat Med. 2004;10(8):789–99.
Wei Z, Li HZ. Nonparametric pathwaybased regression models for analysis of genomic data. Biostatistics. 2007;8(2):265–84.
Huang SJ, Yee C, Ching T, et al. A novel model to combine clinical and pathwaybased transcriptomic information for the prognosis prediction of breast cancer. Plos Comput Biol. 2014;10(9):e1003851.
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc B. 2006;68:49–67.
Breheny P, Huang J. Penalized methods for bilevel variable selection. Stat Interface. 2009;2(3):369–80.
Chen X, Wang LL. Integrating biological knowledge with gene expression profiles for survival prediction of Cancer. J Comput Biol. 2009;16(2):265–78.
Zhang XY, Li Y, Akinyemiju T, et al. PathwayStructured Predictive Model for Cancer Survival Prediction: A TwoStage Approach. Genetics. 2017;205(1):89.
Kim SY, Jeong HH, Kim J, et al. Robust pathwaybased multiomics data integration using directed random walks for survival prediction in multiple cancer studies. Biol Direct. 2019;14(1):8.
Wolpert DH. Stacked generalization. Neural Netw. 1992;5(2):241–59.
Kim M, Rai N, Zorraquino V, et al. Multiomics integration accurately predicts cellular state in unexplored conditions for Escherichia coli. Nat Commun. 2016;7:13090.
Wey A, Connett J, Rudser K. Combining parametric, semiparametric, and nonparametric survival models with stacked survival models. Biostatist. 2015;16(3):537–49.
Golmakani MK, Polley EC. Super learner for survival data prediction. Int J Biostat. 2020;16(2):20190065.
Craig E, Zhong CY, Tibshirani R. Survival stacking: casting survival analysis as a classification problem. 2021;arXiv:2107.13480.
Ginestet PG, Gabriel EE, Sachs MC. Survival stacking with multiple data types using pseudoobservationbasedAUC loss. J Biopharm Stat. 2022;32(6):858–70.
Gerds TA, Schumacher M. Consistent estimation of the expected brier score in general survival models with rightcensored event times. Biom J. 2006;48(6):1029–40.
McVittie JH, Wolfson DB, Addona V, et al. Stacked survival models for residual lifetime data. BMC Med Res Methodol. 2022;22(1):10.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B. 1996;58(1):267–88.
Tang ZX, Shen YP, Zhang XY, et al. The SpikeandSlab Lasso Generalized Linear Models for Prediction and Associated Genes Detection. Genetics. 2017;205(1):77.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22.
Yi NJ, Tang ZX, Zhang XY, et al. BhGLM: Bayesian hierarchical GLMs and survival models, with applications to genomics and epidemiology. Bioinformat. 2019;35(8):1419–21.
Zeng Y, Breheny P. Overlapping group logistic regression with applications to genetic pathway selection. Cancer Inform. 2016;15:179–87.
Bender R, Augustin T, Blettner M. Generating survival times to simulate cox proportional hazards models. Stat Med. 2005;24(11):1713–23.
Yu GC, Wang LG, Han YY, et al. clusterProfiler: an R package for comparing biological themes among gene clusters. Omics. 2012;16(5):284–7.
Breiman L. Stacked regressions. Mach Learn. 1996;24(1):49–64.
Chen X, Wang L, Ishwaran H. An integrative pathwaybased clinicalgenomic model for cancer survival prediction. Stat Probabil Lett. 2010;80(17–18):1313–9.
Mariotto E, Viola G, Ronca R, et al. Choline kinase alpha inhibition by EB3D triggers cellular senescence, reduces tumor growth and metastatic dissemination in breast cancer. Cancers (Basel). 2018;10(10):391.
Bocca C, Bozzo F, Francica S, et al. Involvement of PPAR gamma and Ecadherin/betacatenin pathway in the antiproliferative effect of conjugated linoleic acid in MCF7 cells. Int J Cancer. 2007;121(2):248–56.
Adams CM, Mitra R, Xiao Y, et al. Targeted MDM2 degradation reveals a new vulnerability for p53inactivated triplenegative breast Cancer. Cancer Discov. 2023;13(5):1210–29.
Marin A, Mamun AA, Patel H, et al. Acquired secondary HER2 mutations enhance HER2/MAPK signaling and promote resistance to HER2 kinase inhibition in breast cancer. Cancer Res. 2023;83(18):3145–58.
Sorensen SA, Fenger K, Olsen JH. Significantly lower incidence of cancer among patients with Huntington disease: an apoptotic effect of an expanded polyglutamine tract? Cancer. 1999;86(7):1342–6.
Ji J, Sundquist K, Sundquist J. Cancer incidence in patients with polyglutamine diseases: a populationbased study in Sweden. Lancet Oncol. 2012;13(6):642–8.
McNulty P, Pilcher R, Ramesh R, et al. Reduced Cancer incidence in Huntington's disease: analysis in the registry study. J Huntingtons Dis. 2018;7(3):209–22.
Moreira Sousa C, McGuire JR, Thion MS, et al. The Huntington disease protein accelerates breast tumour development and metastasis through ErbB2/HER2 signalling. EMBO Mol Med. 2013;5(2):309–25.
Thion MS, McGuire JR, Sousa CM, et al. Unraveling the role of Huntingtin in breast cancer metastasis. J Natl Cancer Inst. 2015;107(10):djv208.
Acknowledgments
We thank the reviewers and the associate editor for their constructive suggestions and comments that have improved the manuscript. We acknowledge the contributions of the TCGA Research Network and the METABRIC team.
Funding
This study was supported by the National Natural Science Foundation of China [81773541 & 82373688], and funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions at Soochow University, the State Key Laboratory of Radiation Medicine and Protection [GZK1201919], Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases (KJS2222), and Suzhou Key Laboratory of Neurooncology and Nanobionics (SZZD003). The funding body had no role in the design or conduct of this research.
Author information
Authors and Affiliations
Contributions
Study conception and design: ZXT, JJS, and SW. Data collection and cleaning: SW, LB, XCW, and YFD. Real data analysis and interpretation: JJS, HS, and JH. Drafting and revising of the manuscript: JJS, SW, and ZXT All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Shen, J., Wang, S., Sun, H. et al. A novel nonnegative Bayesian stacking modeling method for Cancer survival prediction using highdimensional omics data. BMC Med Res Methodol 24, 105 (2024). https://doi.org/10.1186/s12874024022323
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874024022323