 Technical advance
 Open Access
 Published:
A novel approach to determine two optimal cutpoints of a continuous predictor with a Ushaped relationship to hazard ratio in survival data: simulation and application
BMC Medical Research Methodology volume 19, Article number: 96 (2019)
Abstract
Background
In clinical and epidemiological researches, continuous predictors are often discretized into categorical variables for classification of patients. When the relationship between a continuous predictor and log relative hazards is Ushaped in survival data, there is a lack of a satisfying solution to find optimal cutpoints to discretize the continuous predictor. In this study, we propose a novel approach named optimal equalHR method to discretize a continuous variable that has a Ushaped relationship with log relative hazards in survival data.
Methods
The main idea of the optimal equalHR method is to find two optimal cutpoints that have equal log relative hazard values and result in Cox models with minimum AIC value. An R package ‘CutpointsOEHR’ has been developed for easy implementation of the optimal equalHR method. A Monte Carlo simulation study was carried out to investigate the performance of the optimal equalHR method. In the simulation process, different censoring proportions, baseline hazard functions and asymmetry levels of Ushaped relationships were chosen. To compare the optimal equalHR method with other common approaches, the predictive performance of Cox models with variables discretized by different cutpoints was assessed.
Results
Simulation results showed that in asymmetric Ushape scenarios the optimal equalHR method had better performance than the median split method, the upper and lower quantiles method, and the minimum pvalue method regarding discrimination ability and overall performance of Cox models. The optimal equalHR method was applied to a real dataset of small cell lung cancer. The real data example demonstrated that the optimal equalHR method could provide clinical meaningful cutpoints and had good predictive performance in Cox models.
Conclusions
In general, the optimal equalHR method is recommended to discretize a continuous predictor with rightcensored outcomes if the predictor has an asymmetric Ushaped relationship with log relative hazards based on Cox regression models.
Background
In survival analysis, Cox regression models [1], which are the most popular model in this field, are frequently used to investigate the effects of explanatory variables on rightcensored survival outcomes. The explanatory variables may be continuous, such as age or weight, or they may be discrete variables, such as gender or treatment factors. When continuous explanatory variables have nonlinear effects on outcomes, it is of interest to investigate Ushaped relationships [2,3,4,5] between continuous explanatory variables and healthrelated outcomes in many researches. Although the Ushaped effects of continuous variables can be modeled in Cox models with flexible smoothing techniques [6,7,8], such as penalized splines and restricted cubic splines, many clinical and epidemiological researchers would rather discretize continuous explanatory variables [9, 10] to reflect highrisk and lowrisk values of the independent variables and compare the risks of developing survival outcomes (i.e. deaths or relapses) between different groups of patients. Moreover, optimal cutpoints could help identify thresholds of important predictors, which could be used to provide classification schemes of the patients and assist in making clinical treatment decisions. In practice, it is sensible to use standard clinical reference values as cutpoints to discretize continuous predictors. But when it comes to lack of standard reference ranges for newly discovered risk factors or the reference ranges can’t be applied to the population with different characteristics, how to find the scientific and reasonable cutpoints to categorize continuous independent variables has been an important issue to be addressed [11,12,13].
There are two widely adopted approaches to discretize continuous independent variables in survival analysis. One is the dataoriented cutpoints approach [14, 15], which uses the median value, quartiles or other percentile values based on the distribution of continuous variables as cutpoints. Owing to its simplicity and easiness of implementation, median value and upper and lower quantiles (noted as Q1Q3) have been widely used in many studies as cutpoints. However, this approach provides arbitrary cutpoints regardless of the relationships with survival outcomes and might lead to wrong estimates of the true effects. Another approach named maximum statistic approach or minimum pvalue approach was first developed by Miller and Siegmund [16] to dichotomize continuous predictors with binary outcomes. The minimum pvalue approach selects a cutpoint with maximum χ^{2} statistic as the optimal cutpoint when the outcomes are binary. When it is extended to survival outcomes, the optimal cutpoint is the one that results in a minimum pvalue of logrank tests [17]. In the simulation studies of the minimum pvalue approach, it is usually assumed that there is a single theoretical threshold of continuous variables, which means relationships between independent variables and survival outcomes are stepwise functional relations. In practice, independent variables and survival outcomes generally have smooth relationships instead of biologically implausible stepwise functional relationships. In addition, Ushaped relationships between continuous variables and outcomes are commonly seen in the clinical and epidemiological studies [2,3,4,5] but little considered in the study of the discretization methods. In the case of body mass index (BMI), a too low and a high BMI value both cause harmful effects on overall health [3, 18]. When a prognostic variable has a Ushaped relationship with outcomes, the effect of the prognostic variable may be underestimated using high and lowrisk groups divided by a single cutpoint.
To overcome the shortcomings of the common discretization methods in survival data and meet the needs of finding optimal cutpoints for a continuous predictor that has a Ushaped relationship with survival outcomes, we propose a new approach named optimal equalHR method to estimate two optimal cutpoints that have approximately equal log hazard values based on Cox regression models. The main idea of the optimal equalHR method is derived from the clinical need of classifying patients into highrisk and lowrisk groups according to a categorized prognostic variable. In clinical practice, the classification of highrisk or lowrisk patients is based on their risks of developing unfavored survival outcomes, such as deaths or relapses. And the larger log relative hazards correspond to the higher risks of developing the observed survival outcomes. Despite the reference ranges, common methods for classification in practical epidemiology studies, such as the Q1Q3 method, are not accurate to define such lowrisk or highrisk groups when there exists a Ushaped relationship. It is because the individuals with the same risks (the same log relative hazards) could be divided into opposite risk groups by the Q1Q3 method. During the procedures of finding optimal cutpoints, it is important to consider the effects of log relative hazards. Hence, we use log relative hazards to find two optimal cutpoints (P_{1}, P_{2}) of the prognostic variable X (as shown in Fig. 1). The patients with X smaller than P_{1} or larger than P_{2} are classified into two highrisk groups (or two lowrisk groups in an inverse Ushaped relationship) respectively. By doing this, the two highrisk groups are defined according to the same threshold of log relative hazards, which makes the classification more reasonable in clinical explanations. Our method can provide more accurate optimal cutpoints and avoid the individuals with the same risks (the same log relative hazards) being divided into opposite risk groups. An R package named ‘CutpointsOEHR’ has been originally developed to help investigators easily implement the optimal equalHR method.
The rest of this paper is organized as follows. The details of the optimal equalHR method are presented in next Section ‘Methods’. The performance of the optimal equalHR method is compared with other commonly used discretization methods regarding discrimination power and overall performance via a simulation study. We present the simulation settings in Section ‘The Simulation Study’. The results of the simulation study and the application of the optimal equalHR method on a real dataset of small cell lung cancer are presented in Section ‘Results’. Finally, there are discussion and conclusions.
Methods
The optimal equalHR method is based on Cox proportional hazards regression models [1] which have the following structure:
where h(t) denotes the hazard function, h_{0}(t) denotes the baseline hazard function, t is the observed survival times, X is a vector of covariates, and β is a vector of estimated regression coefficients. The relative hazard λ can be calculated as the above equation divided by h_{0}(t) in both side:
The optimal equalHR method uses log(λ) values to search for optimal cutpoints. Hazard ratios (HR) could be easily calculated from the relative hazards λ when investigators choose a reference value of an independent variable and control other variables at average levels. And the relationship between HR and a continuous variable is identical to that between λ and the continuous variable. Therefore, the method of finding the optimal cutpoints with approximate equal log(λ) values is named as optimal equalHR method. The procedure of the optimal equalHR method contains two main steps described as follows.
Graphical diagnostic plot
The optimal equalHR method proposed in this study aims to solve the problem of discretizing a continuous variable that has a Ushaped relationship with log(λ) in the Cox model. Therefore, the first step of adopting the optimal equalHR method is to determine the relationship between a continuous covariate and log(λ) and plot the curve. Previous researches have already proposed several methods for estimating nonlinear relationships, including the multiple β method [19], martingale residual based method [20], spline methods [21], etc. The performance of the multiple β method for Cox models is unstable and largely depends on the number of selected groups because a Cox model is a semiparametric model and its likelihood is based on the order of events rather than their distributions. The martingale residual based method, which uses martingale residuals from Cox models to test the loglinearity, cannot plot the relationship between a continuous covariate and log(λ). Due to the limitations of the above two methods, this study used Cox regression models with penalized Bsplines (Psplines) [7, 22], which balances goodness of fit and variance, to curve the relationship and determine whether the nonlinear term is statistically significant. The smoothing parameter of the Psplines of degree 3 with 22 evenly spaced knots is automatically chosen by minimizing AIC, which is achieved through the Rfunction ‘pspline’ in the ‘suevival’ package under the univariate situation. If there are two or above covariates with nonlinear effects, the Rfunction ‘dfmacox’ in the ‘smoothHR’ package [23] could be used to obtain the optimal smoothing parameter. Then the estimated log(λ) values are plotted against the continuous variable to give an insight into the biological nature of the continuous variable.
Find two optimal cutpoints
If the plotted curve suggests of a Ushaped relationship (such as Fig. 1), two optimal cutpoints of the continuous variable are searched based on the relationship curve of the continuous variable and log(λ). Specific steps of the optimal equalHR method are as follows:

(1)
Calculate the percentiles of the estimated log(λ) values, denoted as Q_{k}, k = 1, 2, ⋯, 100. For each percentile value between the 5th and the 95th percentile of the estimated log(λ), draw a straight line parallel to the xaxis, y = Q_{k}, k = 5, 6, … , 95. The line crosses the fitted Ushaped curve with two intersections (illustrated in Fig. 1). The two observations \( {P}_{1k}\left({X}_{1k},\mathit{\log}\left(\widehat{\lambda}X={X}_{1k}\right)\right),{P}_{2k}\left({X}_{2k},\mathit{\log}\left(\widehat{\lambda}X={X}_{2k}\right)\right),{X}_{1k}<{X}_{2k} \), which are closest to the two intersections respectively, are found as a pair of candidate cutpoints with a constraint that \( \left\log \left(\widehat{\lambda}X={X}_{1k}\right)\log \left(\widehat{\lambda}X={X}_{2k}\right)\right\le 0.01 \). If the constraint is violated, the linear interpolate method is used to construct new data points as candidate cutpoints, which ensures candidate cutpoints have equal log(λ) values.

(2)
The continuous predictor X is discretized into a categorical covariate X^{′} with low range (X < X_{1k}), median range (X_{1k} < X < X_{2k}), and high range (X > X_{2k}) according to each pair of candidate cutpoints.

(3)
Then the categorical covariate X^{′} (reference level is the median range) is fitted in a Cox model and the concomitant Akaike Information Criterion (AIC) value is calculated. The pair of cutpoints that minimizes AIC values is defined as optimal cutpoints. Moreover, choosing cutpoints by the Bayesian information criterion (BIC) has the same results as AIC (Additional file 1: Tables S1, S2 and S3).
Implementation in R
The optimal equalHR method was implemented in the language R (version 3.3.3). The freely available R package ‘survival’ was used to fit Cox models with Psplines. The R package ‘pec’ was employed for computing the Integrated Brier Score (IBS). The R package ‘maxstat’ was used to implement the minimum pvalue method with logrank statistics. And an R package named ‘CutpointsOEHR’ was developed for the optimal equalHR method. This package could be installed in R by coding devtools::install_github(“yimichen/CutpointsOEHR”). All tests were twosided and considered statistically significant at P < 0.05.
The simulation study
A Monte Carlo simulation study was used to evaluate the performance of the optimal equalHR method and other discretization methods including the median split (Median), the upper and lower quartiles values (Q1Q3), and the minimum logrank test pvalue method (minP). We generated rightcensored survival data with known Ushaped exposureresponse relationships. To investigate the performance of these methods, the predictive performance of Cox models fitted with different discretized variables was assessed.
Design of the simulation study
The survival times T_{0} were generated from the Weibull distribution using the method from Bender’s research [24] as
where U followed a uniform distribution on the interval from 0 to 1, abbreviated as U~U(0, 1), λ was the scale parameter of Weibull distribution, v was the shape parameter of Weibull distribution, x was a continuous covariate from a standard normal distribution, and s(x) was the given function of interest. To simulate Ushaped relationships between x and log(λ), the form of s(x) was set to be
where parameters k_{1}, k_{2} and a were used to control the symmetric and asymmetric Ushaped relationships. When k_{1} was equal to k_{2}, the relationship was almost symmetric. For each subject, censoring time C was simulated by the uniform distribution with [0, r]. The final observed survival time was T = min(T_{0}, C), and d was a censoring indicator of whether the event happened or not in the observed time T (d = 1 if T_{0} ≤ C, else d = 0). The parameter r was used to control the censoring proportion P_{c}.
One hundred independent datasets were simulated with n = 500 subjects per dataset for various combinations of parameters k_{1}, k_{2}, a, v and P_{c}. Moreover, the simulation results of different sample sizes were shown in the supplementary file, Additional file 1: Figures S1 and S2. The values of (k_{1}, k_{2}, a) were set to be (− 2, 2, 0), (− 8/3, 8/5, − 1/2), (− 8/5, 8/3, 1/2), (− 4, 4/3, − 1), and (− 4/3, 4, 1), which were intuitively presented in Fig. 2. Large absolute values of a meant that the Ushaped relationship was more asymmetric than that with small absolute values of a. Peak asymmetry factor [25] of the above (k_{1}, k_{2}, a) values were 1, 5/3, 3/5, 3, 1/3, respectively. The survival times were Weibull distributed with the decreasing (v = 1/2), constant (v = 1) and increasing (v = 5) hazard rates. The scale parameter of Weibull distribution was set to be 1. The censoring proportion P_{c} was set to be 0, 20 and 50%. For each scenario, the median method, the Q1Q3 method, the minP method and the optimal equalHR method were performed to find the optimal cutpoints.
Measures of predictive performance
The predictive performance of Cox regression models fitted with covariates discretized by different methods was assessed in two aspects, which were discrimination power and overall performance (explained variance). Harrel’s concordance index (cindex) [26], Gönen and Heller’s concordance probability estimate (CPE) [27], and Graf’s integrated Brier Score (IBS) [28], Kent’s R_{PM}^{2} [29] and Royston’s R_{D}^{2} [30] were used to access the performance of Cox models. Both cindex and CPE are estimators of concordance probability. In general, the concordance probability is the probability that the one of two randomly selected patients with the shorter survival time has a higher predicted risk. Integrated Brier Score (IBS), which ranges from 0 to 1, measures the mean squared difference between forecast probability and true disease status. A predictive model with the lower IBS value has more accurate forecasts. R_{PM}^{2} and R_{D}^{2} both measure the proportion of the explained variance of outcomes by the model. The twofold crossvalidation approach [31] was applied to confirm the significance of the cutpoints and obtain almost unbiased estimates of HR, cindex, CPE, IBS, R_{PM}^{2} and R_{D}^{2}.
Results
Results of the simulation study
In this simulation study, we found that different v values didn’t influence the results. Therefore, we only illustrated the results when v was equal to 1 (constant hazards). There were similar results when (k_{1}, k_{2}, a) was set to be (− 8/3, 8/5, − 1/2) and (− 8/5, 8/3, 1/2), as well as when (k_{1}, k_{2}, a) was set to be (− 4, 4/3, − 1) and (− 4/3, 4, 1). Therefore, we only showed the results when (k_{1}, k_{2}, a) equaled (− 2, 2, 0), (− 8/5, 8/3, 1/2), and (− 4/3, 4, 1), which corresponded to symmetric, moderate asymmetric and severe asymmetric Ushaped relationships respectively.
Tables 1, 2, 3 presented the estimated cutpoints of the median method, the Q1Q3 method, the minP method and the optimal equalHR method under different parameter scenarios. Figures 3, 4, 5 examined the performance of Cox models with a continuous covariate discretized by the four different discretization methods.
When (k_{1}, k_{2}, a) equaled (− 2, 2, 0), the relationship between the continuous covariate and log(λ) was almost symmetric (censoring might cause mild asymmetry). The median method had the worst performance out of the four discretization methods (Fig. 3). The cutpoints found by the minP method had much larger simulation standard errors than the other methods (Table 1). The median values of cutpoints selected by the minP method were 0.60 (P_{c} = 0%), 0.00 (P_{c} = 20%), and − 0.76 (P_{c} = 50%), which also reflected the considerable variation. The minP method performed better than the median method, but worse than the optimal equalHR method and the Q1Q3 method. In terms of cindex, CPE, and IBS measures, using Q1Q3 values as cutpoints was slightly better than the other methods including the optimal equalHR method (Fig. 3). As for R_{PM}^{2} and R_{D}^{2} measures, the optimal equalHR method had the better performance than the other three methods.
When (k_{1}, k_{2}, a) equaled (− 8/5, 8/3, 1/2), the relationship between the continuous covariate and log(λ) was moderate asymmetric. The median value method had the worst performance among the four discretization methods (Fig. 4). In this case, the variation of cutpoints found by the minP method was smaller than that in the symmetric situation (Table 2), and the minP method ranked the third regarding overall performance. The Q1Q3 method ranked the second. Overall, the optimal equalHR method had the best performance out of the four methods. However, the advantage of the optimal equalHR method over the Q1Q3 method was quite slight in terms of cindex, CPE, and IBS.
When (k_{1}, k_{2}, a) equaled (− 4/3, 4, 1), the relationship between the continuous covariate and log(λ) was severe asymmetric. In this case, the optimal equalHR method had an obvious advantage over the other three methods in terms of cindex, CPE, IBS, R_{PM}^{2} and R_{D}^{2} measures (Fig. 5). Quantitatively, the differences between the performance of the optimal equalHR method and the Q1Q3 method were larger than those in the moderate situation under different censoring proportions. For example, comparing the optimal equalHR method with the Q1Q3 method, the difference of their cindexes changed from 0.008 (moderate asymmetric situation) to 0.043 (severe asymmetric situation) when the P_{c} was 0%. Therefore, when the relationship between the continuous covariate and log(λ) was severe asymmetric, the optimal equalHR method had much better performance than the Q1Q3 method, the median split method and the minP method.
According to the above results, we could get the following messages

The Q1Q3 method and the optimal equalHR method have comparable good performance when the Ushaped relationship between the continuous covariate and log(λ) is almost symmetric. As the Ushaped relationship becomes asymmetric, the optimal equalHR method has better performance than the Q1Q3 method.

The median method and the minP method are not recommended when there exist Ushaped relationships between the continuous covariates and log(λ) since these two methods don’t have good performance under all simulation scenarios.
Results of the application on a real dataset
Small cell lung cancer (SCLC) is a subtype of lung cancer with terrible survival outcomes. It is characterized by high invasiveness, high growth fraction, and poor prognosis. Fibrinogen (FIB), a protein that is synthesized by the liver and has a blood coagulation function, plays an important role in the pathogenesis of cardiovascular diseases. The reference range of FIB is 2–4 g/L. Several studies have found that the higher FIB level was associated with shorter overall survival (OS) [32,33,34]. Some authors used the median value of FIB as a cutpoint of low and high FIB level [32, 33], when some others used the upper limit of the reference range as a cutpoint [34].
We used the data of 275 patients with SCLC in the First Affiliated Hospital of Guangzhou Medical University from January 2009 to December 2013 [35]. FIB ranged from 0.98 to 9.23 g/L (mean ± SD: 4.78 ± 1.56 g/L). There were 235 events (deaths) during the study period. The OS ranged from 0 to 86 months (median: 12 months), and the 1, 2year OS rates were 50 and 21%, respectively.
First, the effect of FIB on log relative hazards was curved by a Cox model with Psplines. The Ushaped graph (Fig. 6) indicated that the patients with low and high FIB values might have a higher risk of deaths when compared to those with FIB values in the median range. It meant that the data was suitable to apply the optimal equalHR method. Then, the median method, the Q1Q3 method, the minP method and the optimal equalHR method were applied in the data.
Table 4 showed the cutpoints found by the four methods. The median method selected 4.52 as a cutpoint, the Q1Q3 method selected two cutpoints at 3.59 and 5.84, the minP method selected 4.02, and the optimal equalHR method chose 2.62 and 4.06 as cutpoints. The cutpoints of the optimal equalHR method were closest to the reference range of FIB (2–4 g/L).
Four Cox models were fitted with the discretized FIB variable. Reference levels of the discretized FIB were < 4.52 for the median method, 3.59–5.84 for the Q1Q3 method, < 4.02 for the minP method, and 2.62–4.06 for the optimal equalHR method. In the Cox model with the FIB discretized by the optimal equalHR method, both low FIB level (< 2.62) and high FIB level (> 4.06 g/L) had adverse effects (Pvalue < 0.05) on survival outcomes, while it was not statistically significant using the Q1Q3 method.
Table 5 illustrated the performance of different estimated cutpoints in Cox models. 100times fivefold crossvalidation was used to obtain accurate estimations of the performance. The cutpoints estimated by the optimal equalHR method had the best performance in discrimination power and overall performance (explained variance) in terms of cindex, CPE, IBS, R_{PM}^{2} and R_{D}^{2} measures. In conclusion, the optimal equalHR method provided clinical meaningful and wellperformed cutpoints in the case study.
Discussion
In this study, we propose the optimal equalHR method to discretize a continuous predictor when the relationship between the predictor and log(λ) is Ushaped. We demonstrate the results of the Monte Carlo simulation study with different censoring proportions, baseline functions, and relationship curves. When the relationship between the predictor and log(λ) is symmetric (peak asymmetry factor equals 1), it’s hard to describe whether the Q1Q3 method or the optimal equalHR method has better performance. At a first look, it surprises us a little that the Q1Q3 method has such good performance. The two possible reasons for the good performance of the Q1Q3 method are as follows. One is that under the symmetric scenario, the Q1Q3 method finds two cutpoints with close log(λ) values, which conforms to the principle of the optimal equalHR method. The other is that the Q1Q3 method divides samples into three groups with 1:2:1 sample sizes while the optimal equalHR method might result in more unbalanced ratios of sample sizes. When the relationship becomes asymmetric, which is common in practice, the optimal equalHR method has better performance than the Q1Q3 method, the median method, and the minP method. The SCLC case study has also proved that the cutpoints selects by the optimal equalHR method have more plausible biological sense and better model performance in the asymmetric situation.
We have only used the optimal equalHR method under univariate Cox regression models in this study. The effect of continuous predictors on survival outcome might be influenced by other confounding variables. Therefore, it is important to include those variables in Cox models. Mazumber, et al [36] suggested that when we add a new categorical predictor to an existing multivariate model, searching the cutpoints for the new variable should be done in the existing multivariate setting. The optimal equalHR method could easily extend to multivariate Cox models after the selection of confounding variables is made depend on the ground of clinical and epidemiological understandings. As mentioned before, the optimal equalHR method includes two main steps: (1). the graphical diagnostic plots, (2). find two optimal cutpoints. The optimal equalHR method could be applied in a multivariate context as follows. First, the diagnostic plots and estimated log relative hazards could be obtained from a multivariate Cox regression model with penalized Bsplines, which means the functional form for the interested covariate is determined under the condition that all other covariates remain constant. Then, the optimal cutpoints of the interested covariate will be the cutpoints in the multivariate Cox regression model with minimum AIC value. Additionally, the simulation results in Additional file 1: Tables S4 and S5 showed that applying our method in a multivariate context will provide more reasonable estimates of both optimal cutpoints (smaller variations) and Cox regression coefficients (closer to the true regression coefficients) than that in a univariate context if the survival outcomes are truly affected by other covariates. Nevertheless, the R package ‘CutpointsOEHR’ already supports seeking optimal cutpoints in a multivariate context. However, when there are two or more covariates that need to be categorized, the implementation of the optimal equalHR method remains to be addressed in further studies.
Other researchers have also proposed different ways to find two optimal cutpoints. Camp, et al [37] developed XTile, a bioinformatics tool to find optimal cutpoints, which is widely used in genetic data. The XTile accesses all combinations of two cutpoints and selects the one with the highest logrank χ^{2} value as the optimal pair of cutpoints. Compared to the XTile, the optimal equalHR method finds two cutpoints with approximately equal log(λ) values, which is based on the underlying idea of classifying highrisk or lowrisk population according to their log relative hazards. In this way, the optimal equalHR method might provide more clinical valuable cutpoints than the Xtile, offer clues for finding reasonable reference ranges, and reduce the number of computer operations. We might expect a better prediction performance if we allow different hazard values for two cutpoints. Therefore, it is possible to improve our strategy with a tradeoff between prediction performance and clinical meaning in further studies.
There are more works to be done in the future to generalize the utilization of the optimal equalHR method: (1) This method uses Cox models with penalized Bsplines to curve Ushaped relationships, which requires covariates to satisfy the proportional hazard (PH) assumption. Therefore, further studies are needed to release the PH assumption. (2) We focus on the Ushaped relationships between covariates and log relative hazards in this study. The modification and application of the optimal equalHR method to other types of nonlinear relationships remain to be explored.
It is important to remember that lots of literature have proved that discretization will inevitably result in information loss [38,39,40]. This study did not encourage researchers to blindly discretize continuous independent variables when fitting Cox models. The decision to discretize a continuous covariate should be cautiously made by the investigators based on clinical needs. What’s more, the precondition of our method is the Ushaped relationship between an interested continuous covariate and survival outcomes. We recommend using the graphical diagnostic plots, which are based on Cox models with penalized Bsplines, to visualize the data and determine whether there exists Ushaped relationships or not. Besides the graphical diagnostic plots, formal tests, such as the twolines test [41], could also facilitate the judgement of Ushaped relationships.
Conclusions
In general, the optimal equalHR method proposed in our study offers researchers a solution to find optimal cutpoints of continuous predictors that have Ushaped relationships with log(λ) in survival analysis. If researchers have decided to discretize a continuous predictor in Cox models, we highly advise them to explore the relationships between continuous predictors and survival outcomes firstly. When the relationships are U shapes, of which the majority are asymmetric in realworld data, the optimal equalHR method is recommended to find two optimal cutpoints. In addition, an R package called ‘CutpointsOEHR’ has been developed for easy use of this methodology in practice.
Abbreviations
 AIC :

Akaike Information Criterion
 CPE:

Concordance probability estimate
 HR:

Hazard ratio
 IBS:

Integrated brier score
 MinP :

Minimum pvalue approach to find optimal cutpoints
 P _{c} :

Censoring proportion
 Q1Q3:

Upper quantile and lower quantile values
 SE:

Standard error
References
Cox DR. Regression models and lifetables. J R Stat Soc. 1972;34(2):187–220.
Calabrese EJ, Baldwin LA. Ushaped doseresponses in biology, toxicology, and public health. Annu Rev Public Health. 2001;22:15–33.
Hamer M, Stamatakis E. Ushaped association between body mass index and psychological distress in a population sample of 114,218 British adults. Mayo Clin Proc. 2017;92(12):1865–6.
Ervasti J, Kivimaki M, Head J, Goldberg M, Airagnes G, Pentti J, Oksanen T, Salo P, Suominen S, Jokela M, et al. Sickness absence diagnoses among abstainers, lowrisk drinkers and atrisk drinkers: consideration of the Ushaped association between alcohol use and sickness absence in four cohort studies. Addiction. 2018;113(9):1633–42.
Du X, Zhu B, Hu G, Mao W, Wang S, Zhang H, Wang F, Shi Z. Ushape association between white blood cell count and the risk of diabetes in young Chinese adults. Diabet Med. 2009;26(10):955–60.
Govindarajulu US, Spiegelman D, Thurston SW, Ganguli B, Eisen EA. Comparing smoothing techniques in Cox models for exposureresponse relationships. Stat Med. 2007;26(20):3735–52.
Roshani D, Ghaderi E. Comparing smoothing techniques for fitting the nonlinear effect of covariate in Cox models. Acta Inform Med. 2016;24(1):38–41.
Therneau TM, Grambsch PM. Modeling survival data: extending the Cox model. New York: Springer; 2000.
Bouwmeester W, Zuithoff NP, Mallett S, Geerlings MI, Vergouwe Y, Steyerberg EW, Altman DG, Moons KG. Reporting and methods in clinical prediction research: a systematic review. PLoS Med. 2012;9(5):1–12.
Mabikwa OV, Greenwood DC, Baxter PD, Fleming SJ. Assessing the reporting of categorised quantitative variables in observational epidemiological studies. BMC Health Serv Res. 2017;17(1):201.
Budczies J, Klauschen F, Sinn BV, Gyorffy B, Schmitt WD, DarbEsfahani S, Denkert C. Cutoff finder: a comprehensive and straightforward web application enabling rapid biomarker cutoff optimization. PLoS One. 2012;7(12):e51862.
Raghavan R, Ashour FS, Bailey R. A review of cutoffs for nutritional biomarkers. Adv Nutr. 2016;7(1):112–20.
Prince Nelson SL, Ramakrishnan V, Nietert PJ, Kamen DL, Ramos PS, Wolf BJ. An evaluation of common methods for dichotomization of continuous variables to discriminate disease status. Commun Stat Theory Methods. 2017;46(21):10823–34.
Knuppel L, Hermsen O. Median split, kgroup split, and optimality in continuous populations. AstaAdv Stat Anal. 2010;94(1):53–74.
Iacobucci D, Posavac SS, Kardes FR, Schneider MJ, Popovich DL. Toward a more nuanced understanding of the statistical properties of a median split. J Consum Psychol. 2015;25(4):652–65.
Miller R, Siegmund D. Maximally selected chi square statistics. Biometrics. 1982;38(4):1011–6.
Mazumdar M, Glassman JR. Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decisionmaking about cancer treatments. Stat Med. 2000;19(1):113–32.
Thinggaard M, Jacobsen R, Jeune B, Martinussen T, Christensen K. Is the relationship between BMI and mortality increasingly Ushaped with advancing age? A 10year followup of persons aged 7095 years. J Gerontol Ser ABiol Sci Med Sci. 2010;65(5):526–31.
Gamel JW, McLean IW. A method for determining the optimum transform for covariates of the Cox model with application to 3680 cases of intraocular melanoma. Comput Biomed Res. 1988;21(5):471–7.
Klein JP, Wu JT. Discretizing a continuous covariate in survival studies. Handbook Stat. 2003;23(03):27–42.
Molinari N, Daures JP, Durand JF. Regression splines for threshold selection in survival data analysis. Stat Med. 2001;20(2):237–47.
Eilers PHC, Marx BD. Flexible smoothing with Bsplines and penalties. In: Statistical Science, vol. 1996; 1996. p. 89–121.
MeiraMachado L, CadarsoSuarez C, Gude F, Araujo A. smoothHR: an R package for pointwise nonparametric estimation of hazard ratio curves of continuous predictors. Comput Math Methods Med. 2013;2013:745742.
Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Stat Med. 2005;24(11):1713–23.
Foley JP, Dorsey JG. Equations for calculation of chromatographic figures of merit for ideal and skewed peaks. Anal Chem. 1983;55(4):730–7.
Harrell FE Jr, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modelling strategies for improved prognostic prediction. Stat Med. 1984;3(2):143–52.
Gonen M, Heller G. Concordance probability and discriminatory power in proportional hazards regression. Biometrika. 2005;92(4):965–70.
Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999;18(17–18):2529–45.
Tawn JA. Measures of dependence for censored survival data. Biometrika. 1988;75(3):525–34.
Royston P, Sauerbrei W. A new measure of prognostic separation in survival data. Stat Med. 2004;23(5):723–48.
Faraggi D, Simon R. A simulation study of crossvalidation for selecting an optimal cutpoint in univariate survival analysis. Stat Med. 1996;15(20):2203–13.
Ferrigno D, Buccheri G, Ricca I. Prognostic significance of blood coagulation tests in lung cancer. Eur Respir J. 2001;17(4):667–73.
Tas F, Kilic L, Serilmez M, Keskin S, Sen F, Duranyildiz D. Clinical and prognostic significance of coagulation assays in lung cancer. Respir Med. 2013;107(3):451–7.
Zhu LR, Li J, Chen P, Jiang Q, Tang XP. Clinical significance of plasma fibrinogen and Ddimer in predicting the chemotherapy efficacy and prognosis for small cell lung cancer patients. Clin Transl Oncol. 2016;18(2):178–88.
Pan H, Shi X, Xiao D, He J, Zhang Y, Liang W, Zhao Z, Guo Z, Zou X, Zhang J, et al. Nomogram prediction for the survival of the patients with small cell lung cancer. J Thorac Dis. 2017;9(3):507–18.
Mazumdar M, Smith A, Bacik J. Methods for categorizing a prognostic variable in a multivariable setting. Stat Med. 2003;22(4):559–71.
Camp RL, DolledFilhart M, Rimm DL. Xtile: a new bioinformatics tool for biomarker assessment and outcomebased cutpoint optimization. Clin Cancer Res. 2004;10(21):7252–9.
Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25(1):127–41.
Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ. 2006;332(7549):1080.
BarnwellMenard JL, Li Q, Cohen AA. Effects of categorization method, regression type, and variable distribution on the inflation of typeI error rate when categorizing a confounding variable. Stat Med. 2015;34(6):936–49.
Simonsohn U. Two lines: a valid alternative to the invalid testing of Ushaped relationships with quadratic regressions. Adv Methods Pract Psychol Sci. 2018.
Acknowledgments
We would like to thanks Hui Pan for giving access to the small cell lung cancer data for this study.
Funding
This study was supported by the Natural Science Foundation of Guangdong Province, China (2016A030313365), the Science and Technology Program of Guangdong Province, China (2014A020212713) and the National Natural Science Foundation of China (NSFC, 81773545) in the design of the study, simulation and analysis of data and in writing the manuscript.
Availability of data and materials
The main codes of the simulation study are provided on the Supplementary Material. The small cell lung cancer data applied in this study are available from Hui Pan but restrictions apply to the availability of these data, which were used under license for the current study but not publicly available. The small cell lung cancer data are however available from the authors upon reasonable request and with permission of Hui Pan.
Author information
Authors and Affiliations
Contributions
JZ, YC, JH and XH conceived and designed the study. All authors have made contributions to the development of the optimal equalHR method. YC performed the simulations and made the data analyses. YC and GM drafted the manuscript. All authors interpreted the study. JZ supervised YC at all stages. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The small cell lung cancer data from Hui Pan’s published article [35] was used in the manuscript as an example. The ethical statement in the published article is as follows: This retrospective study was approved by the Institutional Review Board of GMUFAH and Cancer Center of Guangzhou Medical University (GMUCC) (GMUFAH approval No: 2017–03; GMUCC approval No: P2017–001).
Consent for publication
The manuscript does not contain any individual person’s data.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1:
Table S1. Comparison of the cutpoints selected by the optimal equalHR method using AIC and BIC in simulated datasets when (k_{1}, k_{2}, a) equals (− 2, 2, 0). Table S2. Comparison of the cutpoints selected by the optimal equalHR method using AIC and BIC in simulated datasets when (k_{1}, k_{2}, a) equals (− 8/5, 8/3, 1/2). Table S3. Comparison of the cutpoints selected by the optimal equalHR method using AIC and BIC in simulated datasets when (k_{1}, k_{2}, a) equals (− 4/3, 4, 1). Table S4. The estimated cutpoints ^{a} by the multivariate and univariate approaches of the optimal equalHR method. Table S5. The estimated Cox regression coefficients ^{a} of covariates discretized by the multivariate and univariate approaches of the optimal equalHR method. Figure S1. Predictive performance of estimated cutpoints when sample sizes are 250. Figure S2. Predictive performance of estimated cutpoints when sample sizes range from 100 to 500 (DOCX 441 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Chen, Y., Huang, J., He, X. et al. A novel approach to determine two optimal cutpoints of a continuous predictor with a Ushaped relationship to hazard ratio in survival data: simulation and application. BMC Med Res Methodol 19, 96 (2019). https://doi.org/10.1186/s1287401907384
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287401907384
Keywords
 Optimal cutpoints
 Discretization
 Categorize
 U shape
 Cox regression model
 Survival analysis