Open Access
Open Peer Review

This article has Open Peer Review reports available.

How does Open Peer Review work?

Flexible combination of multiple diagnostic biomarkers to improve diagnostic accuracy

BMC Medical Research Methodology201515:94

Received: 11 July 2015

Accepted: 17 October 2015

Published: 31 October 2015



In medical research, it is common to collect information of multiple continuous biomarkers to improve the accuracy of diagnostic tests. Combining the measurements of these biomarkers into one single score is a popular practice to integrate the collected information, where the accuracy of the resultant diagnostic test is usually improved. To measure the accuracy of a diagnostic test, the Youden index has been widely used in literature. Various parametric and nonparametric methods have been proposed to linearly combine biomarkers so that the corresponding Youden index can be optimized. Yet there seems to be little justification of enforcing such a linear combination.


This paper proposes a flexible approach that allows both linear and nonlinear combinations of biomarkers. The proposed approach formulates the problem in a large margin classification framework, where the combination function is embedded in a flexible reproducing kernel Hilbert space.


Advantages of the proposed approach are demonstrated in a variety of simulated experiments as well as a real application to a liver disorder study.


Linear combination of multiple diagnostic biomarkers are widely used without proper justification. Additional research on flexible framework allowing both linear and nonlinear combinations is in need.


Biomarker Diagnostic accuracy Margin Receiver operating characteristic curve Reproducing kernel Hilbert space Youden index


In medical research, continuous biomarkers have been commonly explored as diagnostic tools to distinguish subjects, such as diseased and non-diseased groups [1]. The accuracy of a diagnostic test is usually evaluated through sensitivity and specificity, or the probabilities of true positive and true negative for any given cut-point. Particularly, the receiver operating characteristic (ROC) curve is defined as sensitivity versus 1−specificity over all possible cut-points for a given biomarker [2, 3], which is a comprehensive plot that displays the influence of a biomarker as the cut-point varies. To summarize the overall information of an ROC curve, different summarizing indices have been proposed, including the Youden index [4] and the area under the ROC curve (AUC; [5]).

The Youden index, defined as the maximum vertical distance between the ROC curve and the 45° line, is an indicator of how far the ROC curve is from the uninformative test [3]. Normally, it ranges from 0 to 1 with 0 for an uninformative test and 1 for an ideal test. The Youden index has been successfully applied in many clinical studies and served as an appropriate summary for the diagnostic accuracy of a single quantitative measurement (e.g., [2, 6, 7]).

It has been widely accepted by medical researchers that diagnosis based on one single biomarker may not provide sufficient accuracy [8, 9]. Consequently, it is becoming more and more common that multiple biomarker tests are performed on each individual, and the corresponding measurements are combined into one single score to help clinicians make better diagnostic judgment. In literature, various statistical modeling strategies have been proposed to combine biomarkers in a linear fashion. For instance, Su and Liu [10] derived the analytical results of optimal linear combination based on AUC under multivariate normal assumption. Pepe and Thompson [11] proposed to relax the distributional assumption and perform a grid search for the optimal linear combination, while its computation becomes expensive when the number of biomarkers gets large. Recently, a number of alternatives were proposed to alleviate the computational burden. For instances, the min-max approach [12] combines only the minimum and maximum values of biomarker measurements linearly; the stepwise approach [13] combines all biomarker measurements in a stepwise manner. By targeting directly on the optimal diagnostic accuracy, Yin and Tian [14] extended these two methods to optimize the Youden index and demonstrated their improved performance in a number of numerical examples.

In recent years, nonlinear methods have been popularly employed to combine multiple biomarkers in various fields, including genotype classification [15], medical diagnosis [16], and treatment selection [17]. In this paper, a new model-free approach is proposed and formulated in a large margin classification framework, where the biomarkers are flexibly combined into one single diagnostic score so that the corresponding Youdex index [4] is maximized. Specifically, the combination function is modeled non-parametrically in a flexible reproducing kernel Hilbert space (RKHS; [18]), where both linear and nonlinear combinations could be accommodated via a pre-specified kernel function.

The rest of the paper is organized as follows. In Section ‘Methods’, we provide some preliminary background of combining multiple biomarkers based on the Youden index. In Section ‘Results and discussion’, we discuss the motivation for flexible combinations and formulate the proposed flexible approach in a framework of large margin classification for combining multiple biomarkers. In Section ‘Results and discussion’, we conduct numerical experiments to demonstrate the advantages of the proposed approach, apply the proposed approach to a liver disorder study, and extend the proposed framework to incorporate the effect of covariates. Section ‘Conclusions’ contains some discussion.



Suppose that every subject has m biomarker measurements X=(X (1),X (2),…,X (m)) T with a probability density function f(X), where X (j) is a continuous measurement of the j-th biomarker. It also has a binary response variable Y{1,−1} indicating the subject is diseased or not. In literature, researchers from different fields [8, 9, 14] have discussed and explored the validity of combining m biomarker measurements into one single score function g(X) as a more powerful diagnostic tool. A subject is diagnosed as diseased if the combined score g(X) is higher than a given cut-point c, and non-diseased otherwise. To summarize its diagnostic accuracy, the Youden index is commonly used in practice. With sensitivity and specificity defined as sen(g,c)=P r(g(X)≥c|Y=1) and spe(g,c)=P r(g(X)<c|Y=−1) respectively, the Youden index is formulated as
$$J= \max_{g,c}~\{\text{sen}(g,c) + \text{spe}(g,c)- 1\}. $$

The Youden index normally ranges from 0 to 1, where J=1 corresponds to a perfect separation, and J=0 corresponds to a random guess.

To estimate the Youden index, various modeling strategies have been proposed. Schisterman et al. [19] provided a closed form for the Youden index assuming the conditional distribution of X|Y=±1 follows a multivariate Gaussian distribution. Further relaxing the distributional assumption, kernel smoothing techniques were adopted by Yin and Tian [14] and Fluss et al. [20], where the sensitivity and specificity were estimated in a nonparametric fashion.

Note that the formulation of J can be rewritten as
$$\begin{array}{@{}rcl@{}} J&=& \max_{g,c}~w(1)Pr \big(g(\textbf{X}) \geq c, Y=1 \big) + w(-1)\\ &&Pr \big(g(\textbf{X})< c, Y = -1 \big) -1 \\ &=& \max_{g,c}~ \frac{1}{2} E\Big(w(Y) \big(1 + Y \text{sign}(g(\textbf{X})-c)\big)\Big)-1, \end{array} $$
where w(1)=1/π, w(−1)=1/(1−π), π=P r(Y=1), and sign(u)=1 if u≥0 and −1 otherwise. Denote the ideal combination function g (x) and cut-point c as the ones that maximize J over all possible functionals and cut-points. Following the proof of Proposition 1 in [21], the ideal g (x) and c must satisfy
$$ \text{sign}(g^{*}(\textbf{x}) - c^{*}) = \text{sign}\left(p(\textbf{x}) - \pi\right), $$

where p(x)=P r(Y=1|x) is the conditional probability of disease given the biomarker measurements.

Linear or nonlinear combination

In (2), the ideal g (x) and c are defined based on p(x) that is often unavailable in practice. Hence the expectation in (1) needs to be estimated based on the given sample \((\textbf {x}_{i},y_{i})_{i\,=\,1}^{n}\). Specifically, a natural estimate \(\hat J\) can be obtained as
$$\begin{array}{@{}rcl@{}} \hat J &=& \max_{g,c}~ \frac{1}{2n} \sum_{i\,=\,1}^{n} \hat w(y_{i})(1 + y_{i} \text{sign}(g(\textbf{x}_{i})-c)) -1 \\ &=& \max_{g,c}~ \frac{1}{2|{\mathcal{S}}_{1}|}\sum\limits_{i \in {\mathcal{S}}_{1}} (1+\text{sign}(g(\textbf{x}_{i})-c))\\ &&+\, \frac{1}{2|{\mathcal{S}}_{-1}|}\sum\limits_{i \in {\mathcal{S}}_{-1}} (1-\text{sign}(g(\textbf{x}_{i})-c))-1, \end{array} $$

where \(\hat w(1)=1/\hat \pi =n/|{\mathcal {S}}_{1}|\), \(\hat w(-1)=n/|{\mathcal {S}}_{-1}|\), \({\mathcal {S}}_{1} = \{i:y_{i} = 1\}\), \({\mathcal {S}}_{-1} = \{i: y_{i} = -1\}\), and |·| denotes the set cardinality.

The optimization in (3) is generally intractable without a specified candidate space of g. In literature, linear functional space g(x)=β T x is often used [1014], mainly due to its convenient implementation and natural interpretation. Yet there seems to be lack of scientific support for the use of linear combination of biomarkers.

Consider a toy example, where π=1/2, \(\textbf {X}|Y=1 \sim N_{2}\!\left ((1,1)^{T},I_{2}\right)\) and \(\textbf {X}|Y=-1 \sim N_{2}\!\left ((0,0)^{T},I_{2}\right)\), where I 2 is a 2-dimensional identity matrix. Then for any given x,
$$ p(\textbf{x}) \,=\, \frac{f(\textbf{x}|Y=1)}{f(\textbf{x}|Y=1)+f(\textbf{x}|Y=-1)} = \frac{1}{1+e^{1-\left(x_{(1)}\,+\,\,x_{(2)}\right)}}. $$
where \(\textbf {x}=\left (x_{(1)},x_{(2)}\right)^{T}\). Thus, the ideal combination of biomarkers g (x) can take the linear form g (x)=x 1+x 2, leading to sign(g (x)−c)=sign(p(x)−1/2) with c=1. However, if the biomarkers are heterocedastic in the positive and negative groups, the ideal combination would be no longer linear. For instance, when \(\textbf {X}|Y=1 \sim N_{2}\left ((1,1)^{T},I_{2}\right)\) but \(\textbf {X}|Y=-1 \sim N_{2}\left ((0,0)^{T},2I_{2}\right)\),
$$\begin{array}{*{20}l} p(\textbf{x}) &= \frac{f(\textbf{x}|Y=1)}{f(\textbf{x}|Y=1)+f(\textbf{x}|Y=-1)}\\ &= \frac{2}{2+e^{1-\left(x_{(1)}\,+\,x_{(2)}\right)\,+\,\left(x_{(1)}^{2}\,+\,x_{(2)}^{2}\right)/4}}. \end{array} $$

Clearly, the ideal combination of biomarkers is a quadratic function \({g^{*}(\textbf {x})= \frac {x_{(1)}^{2}+x_{(2)}^{2}}{4}-(x_{(1)}+x_{(2)})}\) with \({c=\log (2) - 1}\). Furthermore, if the conditional distribution X|Y is unknown, then the ideal combination of biomarkers may take various forms, and thus a pre-specified assumption on linear combination can be too restrictive and lead to suboptimal combinations.

Model-free estimation formulation

To allow more flexible g(x) than linear functions, it is natural to optimize (3) over a bigger functional space consisting of nonlinear functions. The objective function in (3) can be simplified as
$$\min_{g,c} ~ \frac{1}{n} \sum_{i=1}^{n} \hat w(y_{i}) \big (1 - \text{sign}(u_{i}) \big), $$
where u i =y i (g(x i )−c). However, it involves a sign operator, which makes it discontinuous in g and thus difficult to optimize in general [22]. To circumvent the difficulty, surrogate loss functions are often used to facilitate the computation, so that the estimation formulation becomes
$$\min_{g,c} \ \frac{1}{n} \sum_{i=1}^{n} \hat w(y_{i}) L(u_{i}). $$
Popularly used surrogate loss functions include the hinge loss L(u)=(1−u)+ [23], the logistic loss \({L(u)=\log (1+\exp (-u))}\) [24], the ψ-loss \(L(u) = \min ((1-u)_{+},1)\) [22, 25], and the extended ψ δ -loss \(L_{\delta }(u)=\min \left \lbrace \frac {1}{\delta }(\delta -u)_{+},1 \right \rbrace \) [26]. It is showed that all these surrogate loss functions are Fisher consistent in estimating the 0-1 loss 1−sign(u). The general proofs are given in [22, 27, 28], and thus omitted here. Figure 1 displays the 0-1 loss, the hinge loss, the logistic loss, the ψ-loss, and the ψ 0.5-loss as functions of u. For illustration, we focus on the ψ δ -loss in the sequel considering its extendability to a more flexible framework with covariate effects as discussed in Section 6.
Fig. 1

Various loss functions, including the 0–1 loss, the hinge loss, the logistic loss, the ψ loss and the ψ 0.5 loss

With the ψ δ -loss, the proposed model-free estimation framework for (g(x),c) is formulated as
$$ \min_{g \in {\mathcal{H}}_{K},c \in \mathbb{R}} \frac{1}{n} \sum_{i=1}^{n} \hat{w}(y_{i}) L_{\delta}(y_{i}(g(\textbf{x}_{i})-c)) + \lambda {\mathcal{J}}(g), $$

where λ is a tuning parameter, \({\mathcal {H}}_{K}\) is set as a RKHS associated with a pre-specified kernel function K(·,·), and \({\mathcal {J}}(g)=\frac {1}{2} \|g\|^{2}_{{\mathcal {H}}_{K}}\) is the RKHS norm penalizing the complexity of g(x). The popular kernel functions include the linear kernel K(u,v) = u T v, the m-th order polynomial kernel \(K(\textbf {u,v})=\left (1+\textbf {u}^{T} \textbf {v}\right)^{m}\), and the Gaussian kernel \(K(\textbf {u,v})=\exp \left \{-\|\textbf {u}-\textbf {v}\|^{2}/2\tau ^{2} \right \}\) with a scale parameter τ 2. When the linear kernel is used, the resultant \({\mathcal {H}}_{K}\) contains all linear functions; when the Gaussian kernel is used, \({\mathcal {H}}_{K}\) becomes much richer and admits more flexible nonlinear functions.

More interestingly, the representer theorem [18] implies that the solution to (4) must be of the form \(\hat g(\textbf {x}) = \sum _{i=1}^{n} a_{i} K(\textbf {x}_{i},\textbf {x})\), and thus \(\| g\|_{{\mathcal {H}}_{K}}^{2}=\textbf {a}^{T} \textbf {K} \textbf {a}\) with a=(a 1,,a n ) T and \(\textbf {K}=(K(\textbf {x}_{i},\textbf {x}_{j}))_{i,j=1}^{n}\). The representor theorem greatly simplifies the optimization task by turning the minimization over a functional space into the minimization over a finite-dimensional vector space. Specifically, the minimization task in (4) becomes
$$\begin{array}{*{20}l} \min_{a \in \mathbb{R}^{n},c \in \mathbb{R}}~ s(\tilde{\mathbf{{a}}}) &= \frac{1}{n} \sum_{i=1}^{n} \hat{w}_{i}(y_{i}) L_{\delta} \left(y_{i} \left(\sum_{j=1}^{n} a_{j} K(\textbf{x}_{i},\textbf{x}_{j})-c \right) \right)\\ &\quad+ \frac{\lambda}{2} \textbf{a}^{T} \textbf{K a}, \end{array} $$

where \(\tilde {\mathbf {a}}=(\mathbf {a}^{T},c)^{T}\) is an (n+1)-dim vector.

The minimization task in (5) involves a non-convex function L δ (·), and thus we employ the difference convex algorithm (DCA; [29]) to tackle the non-convex optimization task. The DCA decomposes the non-convex objective function in to the difference of two convex functions, and iteratively approximates it through a refined convex objective function. It has been widely used for non-convex optimization and delivers superior numerical performance [17, 21, 30]. Its computational complexity is of order o(l o g(1/ε)n 3) [30], where ε denotes the computational precision. The details of solving (5) are similar to that in [21], and thus omitted here.

Results and discussion

Simulation examples

This section examines the proposed estimation method for combining biomarkers in a number of simulated examples. The numerical performance of the proposed kernel machine estimation (KME) method is compared against some existing popular alternatives, including the min-max method (MMM) [12], the parametric method under multivariate normality assumption (MVN) [31], the non-parametric kernel smoothing method (KSM) with Gaussian kernel [14], the stepwise method (SWM) [13], and the other two classification methods in [15], the logistic regression (LR) and the classification tree (TREE).

For illustration, the kernel function used in all methods is set as the linear kernel \(K(\textbf {z}_{1},\textbf {z}_{2})=\textbf {z}_{1}^{T} \textbf {z}_{2}\) and the Gaussian kernel \(K(\textbf {z}_{1},\textbf {z}_{2})=e^{-\|\textbf {z}_{1}-\textbf {z}_{2}\|^{2}/2 \tau ^{2}}\), where the scale parameter τ 2 is set as the median of pairwise Euclidean distances between the positive and negative instances within the training set [32]. The tuning parameter λ for our proposed method is selected by 5-fold cross validation that maximizes the empirical Youden index
$${\kern20pt} \begin{aligned} \tilde{J} &= \frac{1}{5}\sum\limits_{k = 1}^{5} \left(\frac{\sum\limits_{i \in V_{k}} I(y_{i}=-1)I(\hat{g}(\textbf{x}_{i}) \leq c)}{\sum\limits_{i \in V_{k}} I(y_{i}=-1)}\right.\\ &\quad \left. - \frac{\sum\limits_{i \in V_{k}} I(y_{i}=1) I(\hat{g}(\textbf{x}_{i}) \leq c)}{\sum\limits_{i \in V_{k}} I(y_{i}=1)}\right), \end{aligned} $$

where I(·) is an indicator function and V k is the validation set of k-th folder. The maximization is conducted via a grid search, where the grid for selecting λ is set as \(\left \{10^{(s-41)/10}; s=1,\cdots,81\right \}\). The optimal solutions of MVN and KSM are searched by routine optim() in R as suggested in Ying and Tian [14]. SWM and MMM are based on the grid search with the same grid. TREE is tuned by default in R. Furthermore, for the proposed KME method, δ is set as 0.1 for all simulated examples as suggested in Hedayat et al. [26].

Four simulated examples are examined. Example 1 is similar to Example 5.1.1 in [14]. Example 2 modifies Example 1 by using multivariate Gamma distribution, which appears to be a popular model assumption in literature [19]. Examples 3 and 4 are similar to Setting 2 in [17] and Example II(b) in [33], which simulate data from logistic models with nonlinear effect terms

Example 1.

A random sample {(X i ,Y i );i=1,,n} is generated as follows. First, Y i is generated from Bernoulli(0.5). Second, if Y i =1, then X i is generated from MVN(μ 1,Σ 1), where μ 1=(0.4,1.0,1.5,1.2) T and Σ 1=0.3I 4 + 0.7J 4 with I 4 a 4-dimensional identity matrix and J 4 a 4×4 matrix of all 1’s; if Y i =−1, then X i is generated from MVN(μ 2,Σ 1) with μ 2=(0,0,0,0) T .

Example 2.

A random sample {(X i ,Y i );i=1,,n} is generated as follows. First, Y i is generated from Bernoulli(0.5). Second, if Y i =1, then X i is generated from a multivariate gamma distribution with mean μ 1=(0.55,0.7,0.85,1) T and covariance matrix Σ 1=0.25J 4+diag(0.025,0.1,0.175,0.25); if Y i =−1, then X i is generated from multivariate gamma distribution with mean μ 2=(0.55,0.55,0.55,0.55) T and covariance matrix Σ 2=0.025I 4+0.25J 4. The multivariate gamma distributed samples are generated with normal copula.

Example 3.

A random sample {(X i ,Y i );i=1,,n} is generated as follows. First, X i is generated from MVN(μ,Σ), where μ=(0,0,0,0) T and Σ=0.3I 4+0.7J 4. Second, Y i is generated from a logistic model with \(\text {logit}(p(\textbf {x})) = x_{(1)} + x_{(2)}^{2} + x_{(3)}^{3} + x_{(4)}^{4} - 1.5\).

Example 4.

A random sample {(X i ,Y i );i=1,,n} is generated as follows. First, X i is generated from t 4(μ,Σ), where μ=(0,0,0,0) T and Σ=I 4. Second, Y i is generated from a logistic model with \(\text {logit}(p(\textbf {x})) = 8\Big (\text {sin}(0.5\pi x_{(1)}) + \text {cos}(\pi x_{(1)} x_{(2)})+ x_{(3)}^{2}\) + \(3x_{(3)} x_{(4)} + x_{(4)}^{2}\Big).\)

In all examples, the sample sizes for training n tr and testing n te are set as n tr =100,250,500 and n te =2000, respectively. Each scenario is replicated 100 times. The averaged empirical Youden index \(\hat {J}\) estimated on the testing sets, as well as the corresponding standard deviations, are summarized in Table 1.
Table 1

Simulation examples: estimated means and standard deviations (in parentheses) of the empirical Youden index J over 100 replications





Example 1



0.604 (0.0042)

0.628 (0.0019)

0.641 (0.0018)


0.572 (0.0063)

0.604 (0.0029)

0.623 (0.0023)


0.455 (0.0032)

0.470 (0.0021)

0.483 (0.0020)


0.633 (0.0018)

0.638 (0.0014)

0.647 (0.0012)


0.388 (0.0180)

0.458 (0.0104)

0.490 (0.0106)


0.555 (0.0065)

0.594 (0.0044)

0.611 (0.0035)


0.628 (0.0022)

0.639 (0.0017)

0.646 (0.0017)


0.490 (0.0068)

0.525 (0.0047)

0.559 (0.0029)

Example 2



0.636 (0.0075)

0.690 (0.0025)

0.710 (0.0015)


0.612 (0.0054)

0.654 (0.0045)

0.696 (0.0016)


0.609 (0.0033)

0.622 (0.0025)

0.622 (0.0022)


0.573 (0.0065)

0.571 (0.0047)

0.563 (0.0040)


0.214 (0.0281)

0.046 (0.0164)

0.047 (0.0171)


0.447 (0.0094)

0.426 (0.0078)

0.429 (0.0065)


0.648 (0.0054)

0.675 (0.0028)

0.678 (0.0025)


0.433 (0.0052)

0.512 (0.0039)

0.555 (0.0036)

Example 3



0.296 (0.0091)

0.367 (0.0053)

0.389 (0.0049)


0.511 (0.0052)

0.568 (0.0028)

0.592 (0.0022)


0.423 (0.0035)

0.434 (0.0021)

0.443 (0.0018)


0.344 (0.0050)

0.371 (0.0045)

0.377 (0.0041)


0.192 (0.0085)

0.193 (0.0084)

0.202 (0.0086)


0.370 (0.0057)

0.406 (0.0028)

0.417 (0.0025)


0.307 (0.0043)

0.316 (0.0030)

0.320 (0.0026)


0.424 (0.0059)

0.477 (0.0042)

0.528 (0.0031)

Example 4



0.103 (0.0102)

0.150 (0.0098)

0.209 (0.0089)


0.529 (0.0078)

0.626 (0.0050)

0.682 (0.0028)


0.184 (0.0084)

0.227 (0.0034)

0.236 (0.0026)


0.109 (0.0071)

0.152 (0.0056)

0.189 (0.0054)


0.188 (0.0050)

0.213 (0.0035)

0.220 (0.0028)


0.255 (0.0078)

0.293 (0.0050)

0.307 (0.0039)


0.002 (0.0023)

0.004 (0.0008)

0.011 (0.0007)


0.257 (0.0143)

0.364 (0.0111)

0.368 (0.0101)

It is evident that our proposed methods, linear kernel machine estimation method (LKME) and Gaussian kernel machine estimation method (GKME), yield competitive performance in all examples. The performance of MVN, SWM, and LR is competitive in Example 1 as the data within each class indeed follows a Gaussian distribution sharing a common covariance structure, and thus the linear combination is optimal. Their performance becomes less competitive in other examples when linear combination is no longer optimal. It is evident that in Examples 3 and 4, with nonlinear patterns specified, the GKME outperforms all other methods. Especially, in Example 4, the performance of GKME is outstanding due to a strong nonlinear pattern specified. In general, the performance of KSM is less competitive. It could be due to the over-fitting issue when applying the Gaussian kernel to estimate sensitivity and specificity. With similar exhaustive grid search, the performance of SWM is better than MMM in Examples 1 and 4 but worse in Examples 2 and 3. As for the two classification methods, LR yields competitive performance in Examples 1 and 2 and becomes less competitive when logistic models with nonlinear patterns are applied in Examples 3 and 4. The performance of TREE is modest considering the nature of recursive partition.

Furthermore, it is of interest to conduct a numerical comparison on the performance of various surrogate loss functions in estimating the Youden index J. Figure 2 displays their estimated empirical Youden index \(\hat {J}\) in Example 3 with training size 500 over 100 replications. It is evident that the performances of all loss functions are similar.
Fig. 2

The boxplot of the empirical Youden index J for the hinge loss, the logistic loss, ψ-loss, and ψ 0.1-loss in Example 3 with n tr =500 over 100 replications

Real application

In this section, our proposed method is applied to a study of liver disorder. The dataset consists of 345 male subjects with 200 subjects in the control group and 145 subjects in the case group. For each subject, there are five blood tests (mean corpuscular volume, alkaline phosphotase, alamine aminotransferase, aspartate aminotransferase, and gamma-glutamyl transpeptidase) which are thought to be sensitive to liver disorders that may be related to excessive alcohol consumption, and another covariate with the average daily alcoholic beverages consumption information. The corresponding empirical estimates of the Youden index of all six markers are 0.141, 0.178, 0.174, 0.144, 0.240, and 0.121, respectively. The dataset was created by BUPA Medical Research Ltd., and is publicly available at University of California at Irvine Machine Learning Repository ( ).

The total 345 samples are randomly split into a training set of 200 samples and a testing set of 145 samples. We also set δ=0.1 and select the tuning parameter λ by 5-fold cross validation targeting on maximizing (6). The experiment is replicated 100 times, and Fig. 3 summarizes the averaged performance measures of our proposed method, MMM, MVN, KSM, SWM, LR, and TREE.
Fig. 3

Real application: boxplot of the empirical Youden index J over 100 replications

It is evident that our proposed method delivers competitive performance in comparison with other methods. It is also interesting to notice the significant improvement on diagnostic accuracy by combining biomakers nonlinearly. It is encouraging to note that our proposed methods with Gaussian kernel outperforms all other methods.

Combining biomarkers with covariate-adjusted formulation

In many situations, the accuracy of diagnostic tests could be largely influenced by various factors, which population-based cut-point c does not take into account. To incorporate the effect of covariates, a natural idea is to consider personalized cut-point function c(z) as proposed in [21]. The covariate-adjusted formulation of J is then expressed as
$$\begin{array}{@{}rcl@{}} J &=& \max_{g,c}~ \frac{1}{2} E\Big(w(Y,\textbf{Z}) \big(1 + Y \text{sign}(g(\textbf{X})-c(\textbf{Z})\big)\Big)-1, \end{array} $$

where w(1,z)=1/π z , w(−1,z)=1/(1−π z ), and π z =P r(Y=1|Z=z).

Under this extended framework, the hinge loss, the logistic loss, and the ψ-loss are not longer Fisher consistent in estimating sign(g(x)−c(z)), as the candidate function is restricted to the form of g(x)−c(z) [26]. Proposition 1 shows that the surrogate ψ δ -loss can still achieve the Fisher consistency when δ approaches 0.

Proposition 1.

Denote \({\mathcal {D}}_{g,c,\epsilon } = \{\tilde {\mathbf {x}}: g(\textbf {x})-c(\textbf {z}) \geq 0 \ \text {and} \ |p_{\textbf {z}}(\textbf {x}) -\pi _{\textbf {z}}| \geq \epsilon \}\), where \(\tilde {\textbf {x}} = (\textbf {x},\textbf {z})\), and \(p_{\textbf {z}}(\textbf {x}) = Pr(Y=1|\tilde {\mathbf {x}}) \). Given any ε>0, let \((g^{*}_{\delta }, c^{*}_{\delta })={\textit {argmin}}_{g,c} E \big (w(Y,\textbf {Z}) L_{\delta }(Y(g(\textbf {X})-c(\textbf {Z}))) \big)\), then as \(\delta \rightarrow 0\),
$$Pr\left({\mathcal{D}}_{g^{*}_{\delta}, c^{*}_{\delta},\epsilon} \triangle {\mathcal{D}}_{g^{*},c^{*},\epsilon}\right) \rightarrow 0, $$
where denotes the symmetric difference of two sets.
With the surrogate ψ δ -loss, the covariate-adjusted estimation formulation becomes
$$\begin{array}{@{}rcl@{}} \min_{g \in {\mathcal{H}}_{K_{1}}, c \in {\mathcal{H}}_{K_{2}}} \frac{1}{n} \sum_{i=1}^{n}& \hat{w}_{i}(y_{i},\textbf{z}_{i}) L_{\delta} \left(y_{i} \left (g(\textbf{x}_{i})-c(\textbf{z}_{i}) \right) \right)\\ &+ \frac{\lambda}{2} \left(\|g\|_{K_{1}}^{2} + \|c\|_{K_{2}}^{2}\right), \end{array} $$

where K 1(·,·) and K 2(·,·) are two per-specified kernel functions, \({\mathcal {H}}_{K_{1}}\) and \({\mathcal {H}}_{K_{2}}\) are their corresponding RKHS’s, and \(\|g\|_{K_{1}}^{2}\) and \(\|c\|_{K_{2}}^{2}\) are the corresponding RKHS norms. The optimization in (8) can be solved by DCA as for the population-based framework, and the details are omitted here.


This paper proposes a flexible model-free framework for combining multiple biomarkers. As opposed to most existing methods focusing on the optimal linear combinations, the framework admits both linear and nonlinear combinations. The superior numerical performance of the proposed approach is demonstrated in a number of simulated examples and a real application to the liver disorder study, especially when the sample size is relatively large. Furthermore, the proposed method is especially efficient with a relatively large number of biomarkers present, where most existing methods relying on grid search are often inefficient. An extension of the proposed framework to the covariate-adjusted formulation is also included. Further development on estimating confidence interval using perturbation resampling procedure [34] and variable selection for biomarkers are still under investigation.



JW’s research is partly supported by HK GRF Grant 11302615, CityU SRG Grant 7004244 and CityU Startup Grant 7200380. The authors thank the Associate Editor and two referees for their constructive comments and suggestions.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Authors’ Affiliations

Gilead Sciences Inc.
Division of Biostatistics, Department of Population Health, New York University
Astellas Pharma Inc.
Department of Mathematics, City University of Hong Kong


  1. Shapiro D. The interpretation of diagnostic tests. Stat Methods Med Res. 1999; 8:113–34.View ArticlePubMedGoogle Scholar
  2. Zhou X, McClish D, Obuchowski N. Statistical methods in diagnostic medicine. New York: Wiley; 2002.View ArticleGoogle Scholar
  3. Pepe M. The statistical evaluation of medical tests for classification and prediction. Oxford, UK: Oxford University Press; 2003.Google Scholar
  4. Youden W. An index for rating diagnostic tests. Cancer. 1950; 3:32–5.View ArticlePubMedGoogle Scholar
  5. Bamber D. The area above the ordinal dominance graph and the area below the receive operating characteristic graph. J Math Psychol. 1975; 12:387–415.View ArticleGoogle Scholar
  6. Aoki K, Misumi J, Kimura T, Zhao W, Xie T. Evaluation of cutoff levels for screening of gastric cancer using serum pepsinogens and distributions of levels of serum pepsinogens I, II and of PG I/PG II ratios in a gastric cancer case-control study. J Epidemiol. 1997; 7:143–51.View ArticlePubMedGoogle Scholar
  7. Perkins N, Schisterman E. The inconsistency of optimal cutpoints obtained using two criteria based on the receiver operating characteristic curve. J Epidemiol. 2006; 163:670–5.View ArticleGoogle Scholar
  8. Sidransky D. Emerging molecular markers of cancer. Nat Rev Cancer. 2002; 2:210–9.View ArticlePubMedGoogle Scholar
  9. Kumar S, Mohan A, Guleria R. Biomarkers in cancer screening, research and detection: present and future: a review. Biomarkers. 2006; 11:385–405.View ArticlePubMedGoogle Scholar
  10. Su J, Liu J. Linear combinations of multiple diagnostic markers. J Am Stat Assoc. 1993; 88:1350–5.View ArticleGoogle Scholar
  11. Pepe M, Thompson M. Combining diagnostic test results to increase accuracy. Biostatistics. 2000; 1:123–40.View ArticlePubMedGoogle Scholar
  12. Liu C, Liu A, Halabi S. A min-max combination of biomarkers to improve diagnostic accuracy. Stat Med. 2011; 30:2005–14.PubMed CentralView ArticlePubMedGoogle Scholar
  13. Kang L, Liu A, Tian L. Linear combination methods to improve diagnostic/prognostic accuracy on future observations. Stat Methods Med Res. 2013; 22. In press.Google Scholar
  14. Yin J, Tian L. Optimal linear combinations of multiple diagnostic biomarkers based on Youden index. Stat Med. 2014; 33:1426–40.View ArticlePubMedGoogle Scholar
  15. Kouskoumvekaki I, Yang Z, Jónsdóttir S, Olsson L, Panagiotu G. Identification of biomarkers for genotyping Aspergilli using non-linear methods for clustering and classification. BMC Bioinformatics. 2008; 9:59.PubMed CentralView ArticlePubMedGoogle Scholar
  16. Turck C. Biomarkers for psychiatric disorders. USA: Springer-Verlag; 2009.View ArticleGoogle Scholar
  17. Huang Y, Fong Y. Identifying optimal biomarker combinations for treatment selection via a robust kernel method. Biometrics. 2014; 70:891–901.PubMed CentralView ArticlePubMedGoogle Scholar
  18. Wahba G. Spline models for observational data: CBMS-NSF Regional Conference Series in Applied Mathematics; 1990.Google Scholar
  19. Schisterman E, Perkins N, Liu A, Bondell H. Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology. 2005; 16:73–81.View ArticlePubMedGoogle Scholar
  20. Fluss R, Faraggi D, Reiser B. Estimation of the Youden index and its associated cutoff point. Biom J. 2005; 47:458–72.View ArticlePubMedGoogle Scholar
  21. Xu T, Wang J, Fang Y. A model-free estimation for the covariate-adjusted Youden index and its associated cut-point. Stat Med. 2014. in press.Google Scholar
  22. Shen X, Tseng G, Zhang X, Wong W. On ψ-learning. J Am Stat Assoc. 2003; 98:724–34.View ArticleGoogle Scholar
  23. Vapnik V. Statistical learning theory. Chichester, UK: Wiley; 1998.Google Scholar
  24. Zhu J, Hastie T. Kernel logistic regression and the import vector machine. J Comput Graph Stat. 2005; 14:185–205.View ArticleGoogle Scholar
  25. Liu Y, Shen X. Multicategory psi-learning. J Am Stat Assoc. 2006; 101:500–9.View ArticleGoogle Scholar
  26. Hedayat AS, Wang J, Xu T. Minimum clinically important difference in medical studies. Biometrics. 2014; 71:33–41.View ArticlePubMedGoogle Scholar
  27. Wang J, Shen X. Probability estimation for large-margin classifiers. Biometrika. 2008; 95:149–67.View ArticleGoogle Scholar
  28. Lin Y. Support vector machines and the Bayes rule in classification. Data Mining Knowl Discov. 2002; 6:259–75.View ArticleGoogle Scholar
  29. An L, Tao P. Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. J Glob Optim. 1997; 11:253–85.View ArticleGoogle Scholar
  30. Liu S, Shen X, Wong W. Computational development of ψ-learning. In: Proceedings of the SIAM International Conference on Data Mining. Newport, CA: 2005. p. 1–12.Google Scholar
  31. Schisterman EF, Perkins N. Confidence intervals for the Youden index and corresponding optimal cut-point. Communications in Statistics-Simulation and Computation. 2007; 36:549–63.View ArticleGoogle Scholar
  32. Jaakkola T, Diekhans M, Haussler D. Using the Fisher kernel method to detect remote protein homologies. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology: 1999. p. 149–158.Google Scholar
  33. Fong Y, Yin S, Huang Y. Combining biomarkers linearly and nonlinearly for classification using the area under the ROC curve; 2014.
  34. Jiang B, Zhang X, Cai T. Estimating the confidence interval for prediction errors of support vector machine classifiers. J Mach Learn Res. 2008; 9:521–40.Google Scholar


© Xu et al. 2015