Flexible combination of multiple diagnostic biomarkers to improve diagnostic accuracy

Background In medical research, it is common to collect information of multiple continuous biomarkers to improve the accuracy of diagnostic tests. Combining the measurements of these biomarkers into one single score is a popular practice to integrate the collected information, where the accuracy of the resultant diagnostic test is usually improved. To measure the accuracy of a diagnostic test, the Youden index has been widely used in literature. Various parametric and nonparametric methods have been proposed to linearly combine biomarkers so that the corresponding Youden index can be optimized. Yet there seems to be little justification of enforcing such a linear combination. Methods This paper proposes a flexible approach that allows both linear and nonlinear combinations of biomarkers. The proposed approach formulates the problem in a large margin classification framework, where the combination function is embedded in a flexible reproducing kernel Hilbert space. Results Advantages of the proposed approach are demonstrated in a variety of simulated experiments as well as a real application to a liver disorder study. Conclusion Linear combination of multiple diagnostic biomarkers are widely used without proper justification. Additional research on flexible framework allowing both linear and nonlinear combinations is in need. Electronic supplementary material The online version of this article (doi:10.1186/s12874-015-0085-z) contains supplementary material, which is available to authorized users.


Background
In medical research, continuous biomarkers have been commonly explored as diagnostic tools to distinguish subjects, such as diseased and non-diseased groups [1]. The accuracy of a diagnostic test is usually evaluated through sensitivity and specificity, or the probabilities of true positive and true negative for any given cut-point. Particularly, the receiver operating characteristic (ROC) curve is defined as sensitivity versus 1−specificity over all possible cut-points for a given biomarker [2,3], which is a comprehensive plot that displays the influence of a biomarker as the cut-point varies. To summarize the overall information of an ROC curve, different summarizing indices have been proposed, including the Youden index [4] and the area under the ROC curve (AUC; [5]).
The Youden index, defined as the maximum vertical distance between the ROC curve and the 45 • line, is an *Correspondence: Tu.Xu@gilead.com 1 Gilead Sciences Inc., 333 Lakeside Dr, Foster City, CA 94404, USA Full list of author information is available at the end of the article indicator of how far the ROC curve is from the uninformative test [3]. Normally, it ranges from 0 to 1 with 0 for an uninformative test and 1 for an ideal test. The Youden index has been successfully applied in many clinical studies and served as an appropriate summary for the diagnostic accuracy of a single quantitative measurement (e.g., [2,6,7]).
It has been widely accepted by medical researchers that diagnosis based on one single biomarker may not provide sufficient accuracy [8,9]. Consequently, it is becoming more and more common that multiple biomarker tests are performed on each individual, and the corresponding measurements are combined into one single score to help clinicians make better diagnostic judgment. In literature, various statistical modeling strategies have been proposed to combine biomarkers in a linear fashion. For instance, Su and Liu [10] derived the analytical results of optimal linear combination based on AUC under multivariate normal assumption. Pepe and Thompson [11] proposed to relax the distributional assumption and perform a grid search for the optimal linear combination, while its computation becomes expensive when the number of biomarkers gets large. Recently, a number of alternatives were proposed to alleviate the computational burden. For instances, the min-max approach [12] combines only the minimum and maximum values of biomarker measurements linearly; the stepwise approach [13] combines all biomarker measurements in a stepwise manner. By targeting directly on the optimal diagnostic accuracy, Yin and Tian [14] extended these two methods to optimize the Youden index and demonstrated their improved performance in a number of numerical examples.
In recent years, nonlinear methods have been popularly employed to combine multiple biomarkers in various fields, including genotype classification [15], medical diagnosis [16], and treatment selection [17]. In this paper, a new model-free approach is proposed and formulated in a large margin classification framework, where the biomarkers are flexibly combined into one single diagnostic score so that the corresponding Youdex index [4] is maximized. Specifically, the combination function is modeled non-parametrically in a flexible reproducing kernel Hilbert space (RKHS; [18]), where both linear and nonlinear combinations could be accommodated via a pre-specified kernel function.
The rest of the paper is organized as follows. In Section 'Methods' , we provide some preliminary background of combining multiple biomarkers based on the Youden index. In Section 'Results and discussion' , we discuss the motivation for flexible combinations and formulate the proposed flexible approach in a framework of large margin classification for combining multiple biomarkers. In Section 'Results and discussion' , we conduct numerical experiments to demonstrate the advantages of the proposed approach, apply the proposed approach to a liver disorder study, and extend the proposed framework to incorporate the effect of covariates. Section 'Conclusions' contains some discussion.

Preliminaries
Suppose that every subject has m biomarker measurements X = (X (1) , X (2) , . . . , X (m) ) T with a probability density function f (X), where X (j) is a continuous measurement of the j-th biomarker. It also has a binary response variable Y ∈ {1, −1} indicating the subject is diseased or not. In literature, researchers from different fields [8,9,14] have discussed and explored the validity of combining m biomarker measurements into one single score function g(X) as a more powerful diagnostic tool. A subject is diagnosed as diseased if the combined score g(X) is higher than a given cut-point c, and non-diseased otherwise. To summarize its diagnostic accuracy, the Youden index is commonly used in practice. With sensitivity and specificity defined as sen(g, c) = Pr(g(X) ≥ c|Y = 1) and spe(g, c) = Pr(g(X) < c|Y = −1) respectively, the Youden index is formulated as The Youden index normally ranges from 0 to 1, where J = 1 corresponds to a perfect separation, and J = 0 corresponds to a random guess.
To estimate the Youden index, various modeling strategies have been proposed. Schisterman et al. [19] provided a closed form for the Youden index assuming the conditional distribution of X|Y = ±1 follows a multivariate Gaussian distribution. Further relaxing the distributional assumption, kernel smoothing techniques were adopted by Yin and Tian [14] and Fluss et al. [20], where the sensitivity and specificity were estimated in a nonparametric fashion.
Note that the formulation of J can be rewritten as where w(1) = 1/π, w(−1) = 1/(1 − π), π = Pr(Y = 1), and sign(u) = 1 if u ≥ 0 and −1 otherwise. Denote the ideal combination function g * (x) and cut-point c * as the ones that maximize J over all possible functionals and cutpoints. Following the proof of Proposition 1 in [21], the ideal g * (x) and c * must satisfy where p(x) = Pr(Y = 1|x) is the conditional probability of disease given the biomarker measurements.
The optimization in (3) is generally intractable without a specified candidate space of g. In literature, linear functional space g(x) = β T x is often used [10][11][12][13][14], mainly due to its convenient implementation and natural interpretation. Yet there seems to be lack of scientific support for the use of linear combination of biomarkers.
Consider a toy example, where π = 1/2, X|Y = 1 ∼ N 2 (1, 1) T , I 2 and X|Y = −1 ∼ N 2 (0, 0) T , I 2 , where I 2 is a 2-dimensional identity matrix. Then for any given x, However, if the biomarkers are heterocedastic in the positive and negative groups, the ideal combination would be no longer linear. For instance, when Clearly, the ideal combination of biomarkers is a (2) ) with c = log(2) − 1. Furthermore, if the conditional distribution X|Y is unknown, then the ideal combination of biomarkers may take various forms, and thus a prespecified assumption on linear combination can be too restrictive and lead to suboptimal combinations.

Model-free estimation formulation
To allow more flexible g(x) than linear functions, it is natural to optimize (3) over a bigger functional space consisting of nonlinear functions. The objective function in (3) can be simplified as However, it involves a sign operator, which makes it discontinuous in g and thus difficult to optimize in general [22]. To circumvent the difficulty, surrogate loss functions are often used to facilitate the computation, so that the estimation formulation becomes Popularly used surrogate loss functions include the hinge loss L(u) = (1 − u) + [23], the logistic loss L(u) = log(1 + exp(−u)) [24], the ψ-loss L(u) = min((1 − u) + , 1) [22,25], and the extended [26]. It is showed that all these surrogate loss functions are Fisher consistent in estimating the 0-1 loss 1 − sign(u). The general proofs are given in [22,27,28], and thus omitted here. Figure 1 displays the 0-1 loss, the hinge loss, the logistic loss, the ψ-loss, and the ψ 0.5 -loss as functions of u. For illustration, we focus on the ψ δ -loss in the sequel considering its extendability to a more flexible framework with covariate effects as discussed in Section 6.
With the ψ δ -loss, the proposed model-free estimation framework for (g(x), c) is formulated as where λ is a tuning parameter, H K is set as a RKHS associated with a pre-specified kernel function K(·, ·), and J (g) = 1 2 g 2 H K is the RKHS norm penalizing the complexity of g(x). The popular kernel functions include the linear kernel K(u,v) = u T v, the m-th order polynomial kernel K(u,v) = 1 + u T v m , and the Gaussian kernel When the linear kernel is used, the resultant H K contains all linear functions; when the Gaussian kernel is used, H K becomes much richer and admits more flexible nonlinear functions. More interestingly, the representer theorem [18] implies that the solution to (4) must be of the form g(x) = n i=1 a i K(x i , x), and thus g 2 H K = a T Ka with a = (a 1 , · · · , a n ) T and K = (K(x i , x j )) n i,j=1 . The representor theorem greatly simplifies the optimization task by turning the minimization over a functional space into whereã = (a T , c) T is an (n + 1)-dim vector. The minimization task in (5) involves a non-convex function L δ (·), and thus we employ the difference convex algorithm (DCA; [29]) to tackle the non-convex optimization task. The DCA decomposes the non-convex objective function in to the difference of two convex functions, and iteratively approximates it through a refined convex objective function. It has been widely used for non-convex optimization and delivers superior numerical performance [17,21,30]. Its computational complexity is of order o(log(1/ )n 3 ) [30], where denotes the computational precision. The details of solving (5) are similar to that in [21], and thus omitted here.

Simulation examples
This section examines the proposed estimation method for combining biomarkers in a number of simulated examples. The numerical performance of the proposed kernel machine estimation (KME) method is compared against some existing popular alternatives, including the min-max method (MMM) [12], the parametric method under multivariate normality assumption (MVN) [31], the non-parametric kernel smoothing method (KSM) with Gaussian kernel [14], the stepwise method (SWM) [13], and the other two classification methods in [15], the logistic regression (LR) and the classification tree (TREE).
For illustration, the kernel function used in all methods is set as the linear kernel K(z 1 , z 2 ) = z T 1 z 2 and the Gaussian kernel K(z 1 , z 2 ) = e − z 1 −z 2 2 /2τ 2 , where the scale parameter τ 2 is set as the median of pairwise Euclidean distances between the positive and negative instances within the training set [32]. The tuning parameter λ for our proposed method is selected by 5-fold cross validation that maximizes the empirical Youden index where I(·) is an indicator function and V k is the validation set of k-th folder. The maximization is conducted via a grid search, where the grid for selecting λ is set as 10 (s−41)/10 ; s = 1, · · · , 81 . The optimal solutions of MVN and KSM are searched by routine optim() in R as suggested in Ying and Tian [14]. SWM and MMM are based on the grid search with the same grid. TREE is tuned by default in R. Furthermore, for the proposed KME method, δ is set as 0.1 for all simulated examples as suggested in Hedayat et al. [26].
Four simulated examples are examined. Example 1 is similar to Example 5.1.1 in [14]. Example 2 modifies Example 1 by using multivariate Gamma distribution, which appears to be a popular model assumption in literature [19]. Examples 3 and 4 are similar to Setting 2 in [17] and Example II(b) in [33], which simulate data from logistic models with nonlinear effect terms n} is generated as follows. First, Y i is generated from Bernoulli(0.5). Second, if Y i = 1, then X i is generated from a multivariate gamma distribution with mean μ 1 = (0.55, 0.7, 0.85, 1) T and covariance matrix 1 = 0.25J 4 + diag(0.025, 0.1, 0.175, 0.25); if Y i = −1, then X i is generated from multivariate gamma distribution with mean μ 2 = (0.55, 0.55, 0.55,0.55) T and covariance matrix 2 = 0.025I 4 + 0.25J 4 . The multivariate gamma distributed samples are generated with normal copula.
n} is generated as follows. First, X i is generated from t 4 μ, , where μ = (0, 0, 0, 0) T and = I 4 . Second, Y i is generated from a logistic model with logit(p(x)) = 8 sin(0.5πx (1) ) + cos(πx (1) x (2) In all examples, the sample sizes for training n tr and testing n te are set as n tr = 100, 250, 500 and n te = 2000, respectively. Each scenario is replicated 100 times. The averaged empirical Youden indexĴ estimated on the testing sets, as well as the corresponding standard deviations, are summarized in Table 1.
It is evident that our proposed methods, linear kernel machine estimation method (LKME) and Gaussian kernel machine estimation method (GKME), yield competitive performance in all examples. The performance of MVN,  SWM, and LR is competitive in Example 1 as the data within each class indeed follows a Gaussian distribution sharing a common covariance structure, and thus the linear combination is optimal. Their performance becomes less competitive in other examples when linear combination is no longer optimal. It is evident that in Examples 3 and 4, with nonlinear patterns specified, the GKME outperforms all other methods. Especially, in Example 4, the performance of GKME is outstanding due to a strong nonlinear pattern specified. In general, the performance of KSM is less competitive. It could be due to the overfitting issue when applying the Gaussian kernel to estimate sensitivity and specificity. With similar exhaustive grid search, the performance of SWM is better than MMM in Examples 1 and 4 but worse in Examples 2 and 3. As for the two classification methods, LR yields competitive performance in Examples 1 and 2 and becomes less competitive when logistic models with nonlinear patterns are applied in Examples 3 and 4. The performance of TREE is modest considering the nature of recursive partition. Furthermore, it is of interest to conduct a numerical comparison on the performance of various surrogate loss functions in estimating the Youden index J. Figure 2 displays their estimated empirical Youden indexĴ in Example 3 with training size 500 over 100 replications. It is evident that the performances of all loss functions are similar.

Real application
In this section, our proposed method is applied to a study of liver disorder. The dataset consists of 345 male Fig. 2 The boxplot of the empirical Youden index J for the hinge loss, the logistic loss, ψ-loss, and ψ 0.1 -loss in Example 3 with n tr = 500 over 100 replications subjects with 200 subjects in the control group and 145 subjects in the case group. For each subject, there are five blood tests (mean corpuscular volume, alkaline phosphotase, alamine aminotransferase, aspartate aminotransferase, and gamma-glutamyl transpeptidase) which are thought to be sensitive to liver disorders that may be related to excessive alcohol consumption, and another covariate with the average daily alcoholic beverages consumption information. The corresponding empirical estimates of the Youden index of all six markers are 0.141, 0.178, 0.174, 0.144, 0.240, and 0.121, respectively. The dataset was created by BUPA Medical Research Ltd., and is publicly available at University of California at Irvine Machine Learning Repository (https://archive.ics. uci.edu/ml/datasets/Liver+Disorders).
The total 345 samples are randomly split into a training set of 200 samples and a testing set of 145 samples. We also set δ = 0.1 and select the tuning parameter λ by 5-fold cross validation targeting on maximizing (6). The experiment is replicated 100 times, and Fig. 3 summarizes the averaged performance measures of our proposed method, MMM, MVN, KSM, SWM, LR, and TREE.
It is evident that our proposed method delivers competitive performance in comparison with other methods. It is also interesting to notice the significant improvement on diagnostic accuracy by combining biomakers nonlinearly. It is encouraging to note that our proposed methods with Gaussian kernel outperforms all other methods.

Combining biomarkers with covariate-adjusted formulation
In many situations, the accuracy of diagnostic tests could be largely influenced by various factors, which To incorporate the effect of covariates, a natural idea is to consider personalized cut-point function c(z) as proposed in [21]. The covariate-adjusted formulation of J is then expressed as (7) where w(1, z) = 1/π z , w(−1, z) = 1/(1 − π z ), and π z = Pr(Y = 1|Z = z). Under this extended framework, the hinge loss, the logistic loss, and the ψ-loss are not longer Fisher consistent in estimating sign(g(x) − c(z)), as the candidate function is restricted to the form of g(x) − c(z) [26]. Proposition 1 shows that the surrogate ψ δ -loss can still achieve the Fisher consistency when δ approaches 0. Proposition 1. Denote D g,c, = {x : g(x) − c(z) ≥ 0 and |p z (x) − π z | ≥ }, wherex = (x, z), and p z (x) = Pr(Y = 1|x). Given any > 0, let (g * δ , c * δ ) = argmin g,c E w(Y , Z)L δ (Y (g(X) − c(Z))) , then as δ → 0, where denotes the symmetric difference of two sets.
With the surrogate ψ δ -loss, the covariate-adjusted estimation formulation becomes min g∈H K 1 ,c∈H K 2 1 n n i=1ŵ i (y i , z i )L δ y i g(x i ) − c(z i ) where K 1 (·, ·) and K 2 (·, ·) are two per-specified kernel functions, H K 1 and H K 2 are their corresponding RKHS's, and g 2 K 1 and c 2 K 2 are the corresponding RKHS norms. The optimization in (8) can be solved by DCA as for the population-based framework, and the details are omitted here.

Conclusions
This paper proposes a flexible model-free framework for combining multiple biomarkers. As opposed to most existing methods focusing on the optimal linear combinations, the framework admits both linear and nonlinear combinations. The superior numerical performance of the proposed approach is demonstrated in a number of simulated examples and a real application to the liver disorder study, especially when the sample size is relatively large. Furthermore, the proposed method is especially efficient with a relatively large number of biomarkers present, where most existing methods relying on grid search are often inefficient. An extension of the proposed framework to the covariate-adjusted formulation is also included.
Further development on estimating confidence interval using perturbation resampling procedure [34] and variable selection for biomarkers are still under investigation.

Additional file
Additional file 1: The Appendix includes the proof of Proposition 1. (PDF 28.6 kb)