 Research Article
 Open Access
 Open Peer Review
A Wild Bootstrap approach for the selection of biomarkers in early diagnostic trials
 Antonia Zapf^{1}Email author,
 Edgar Brunner^{1} and
 Frank Konietschke^{2}
https://doi.org/10.1186/s128740150025y
© Zapf et al.; licensee BioMed Central. 2015
 Received: 15 October 2014
 Accepted: 25 March 2015
 Published: 1 May 2015
Abstract
Background
In early diagnostic trials, particularly in biomarker studies, the aim is often to select diagnostic tests among several methods. In case of metric, discrete, or even ordered categorical data, the area under the receiver operating characteristic (ROC) curve (denoted by AUC) is an appropriate overall accuracy measure for the selection, because the AUC is independent of cutoff points.
Methods
For selection of biomarkers the individual AUC’s are compared with a predefined threshold. To keep the overall coverage probability or the multiple typeI error rate, simultaneous confidence intervals and multiple contrast tests are considered. We propose a purely nonparametric approach for the estimation of the AUC’s with the corresponding confidence intervals and statistical tests. This approach uses the correlation among the statistics to account for multiplicity. For small sample sizes, a WildBootstrap approach is presented. It is shown that the corresponding intervals and tests are asymptotically exact.
Results
Extensive simulation studies indicate that the derived WildBootstrap approach keeps and exploits the nominal typeI error at best, even for high accuracies and in case of small samples sizes. The strength of the correlation, the type of covariance structure, a skewed distribution, and also a moderate imbalanced casecontrol ratio do not have any impact on the behavior of the approach. A real data set illustrates the application of the proposed methods.
Conclusion
We recommend the new Wild Bootstrap approach for the selection of biomarkers in early diagnostic trials, especially for high accuracies and small samples sizes.
Keywords
 AUC
 Diagnostic study
 Resampling
 Simultaneous intervals
 Wild bootstrap
Background
The aim of early diagnostic trials, particularly of biomarker studies, is often to select the most promising markers from a candidate set. For convenience, all different kinds of diagnostic tests, e.g., imaging techniques or biomarkers, will be denoted by diagnostic tests throughout the paper. In these studies, response variables are often not binary, but measured on a continuous, discrete or even ordinal scale and a cutoff value c has not yet been chosen. Therefore, the sensitivity (i.e. true positive proportion) and the specificity (true negative proportion) both being computed based on c cannot be used as selection criteria. In contrast, the Receiver Operating Characteristic (ROC) curve illustrates the overall diagnostic performance because it is independent of the chosen cutoff values (see, e.g., DeLong, DeLong and ClarkPearson [1]). Because the ROC curve of a diagnostic test is invariant with respect to any monotone transformation of the test measurement scale, it is an adequate measure for comparing diagnostic tests being measured even on different scales. The Area Under the ROCcurve (AUC) represents an accuracy measure which is independent from the selected cutoff value c and which is invariant under any monotone transformation of the data. Therefore, it is an appropriate selection criterion for promising diagnostic tests, and in particular Xia et al. [2] (p. 286) state in their tutorial about translational biomarker discovery in clinical metabolomics that the “AUC is widely used for performance comparison across different biomarker models”.
Thus, the remaining question is which biomarkers have sufficient diagnostic accuracy. There is no consensus about the threshold for sufficent diagnostic accuracy. Xia et al. [2] characterize a biomarker with an AUC<0.7 as a quite “weak” biomarker. In their study about a bloodbased biomarker panel for stratifying current risk for colorectal cancer Marshall et al. [4] accept a candidate model with an AUC>0.75 as a predictive model. In contrast, Broadhurst and Kell [5] refer to an AUC>0.9 as excellent and to an AUC>0.8 as good. Depending on previous knowledge or expectations a threshold for the AUC as indicator for sufficient diagnostic accuracy should be chosen during the planning of the trial.
Note that the aim of such trials is not to test multiple hypotheses formulated in terms of AUC differences across the biomarkers, but to verify sufficient diagnostic accuracy for all biomarkers individually. Then comparing the lower limit of the confidence interval for the estimated AUC with this threshold indicates whether or not the diagnostic test has sufficient diagnostic accuracy. The “Guideline on the choice of the noninferiority margin” of the European Medicines Agency [6] recommends to demonstrate noninferiority by use of twosided 95% or onesided 97.5% confidence intervals.
If several diagnostic tests are evaluated in the same trial, it is important to adjust the confidence intervals for multiplicity. Otherwise there is a high risk that the accuracy of some diagnostic tests is overestimated. Xia et al. [2] (p.288) point out that “The probability of finding a random association between a given metabolite and the outcome increases with the total number of comparisons”. Furthermore they note that the Bonferroni correction is a simple but very conservative method. If the diagnostic tests are repeatedly measured on the same subjects, hence, these measurements are correlated in general. Therefore it is of highly practical importance to take into account these correlations in the estimation of the diagnostic accuracy.
The multiplicity expert group of the ‘Statisticians in the Pharmaceutical Industry’ [7] (p.258) states that “The participants did, however, agree that for noninferiority and equivalence trials, compatible simultaneous CIs for the primary endpoint(s) should be presented in all cases”. Furthermore Strassburger and Bretz [8] recommend the use of singlestep procedures if the aim is not to reject as many hypotheses as possible. Therefore we will confine ourselves to simultaneous confidence intervals from singlestep procedures which are compatible with the results obtained by hypotheses tests. Among others, Hothorn et al. [9] proposed parametric simultaneous confidence intervals, which correspond to multiple contrast tests. However, since these parametric approaches are limited to normally distributed data, Konietschke et al. [10] proposed nonparametric multiple contrast tests and compatible asymptotic simultaneous confidence intervals for relative treatment effects for independent samples (based on some theoretical results developed by Brunner et al. [11]). In the particular case of two samples (cases and controls) the relative treatment effect is equivalent to the AUC (see Bamber [12]). In this article we will use this approach in the framework of diagnostic studies, but for paired samples in a multivariate layout.
The challenge in early diagnostic trials is often that smaller sample sizes and higher AUC’s occur. For example in the systematic review of 10 studies about the diagnostic accuracy of pleural fluid NTproBNP for pleural effusions of cardiac origin, performed by Janda and Swiston [13], the median total sample size was 104 (mean 112), and the pooled AUC was 98%. Wang et al. [14] reported in another systematic review about cardiac testing for coronary artery disease in potential kidney transplant recipients AUC’s between 0.78 and 0.92. Kottas et al. [15] found that the Logit tranformation based confidence interval for a single AUC leads to slightly conservative results for small sample sizes. Here we suggest Wild Bootstrap based simultaneous confidence intervals to obtain robust methods for small sample sizes and potentially quite large AUC’s. Hereby, we generalize the method proposed by Arlot et al. [16] for multivarite highdimensional normal data.
In this article nonparametric simultaneous confidence intervals for multiple AUC’s in diagnostic studies are presented. Asymptotic intervals will be derived as well as intervals using the Wild Bootstrap approach. The properties of these simultaneous intervals are investigated in a simulation study regarding the typeI error rate and the statistical power. Furthermore, the results of all intervals are given for the example data set presented before in this section. In the next section we present the methods, including the statistical model with the corresponding hypotheses, and the point estimators with their asymptotic distribution. Furthermore multiple contrast tests and corresponding simultaneous confidence intervals (with or without Logit transformation) are derived, and the Wild Bootstrap approach is presented (in particular for small sample sizes). The results of a simulation study including robustness evaluations, and the application of the methods to the example presented above are given in the Results section. Finally, all results are summarized and discussed, and a recommendation is given.
Methods
Statistical model and hypotheses
where d denotes the number of diagnostic tests. The partition of the data in cases (i=1) or controls (i=0) is based on the gold or reference standard, which is assumed to represent the true disease status of the subjects. In order to allow for continuous, discrete or even ordered categorical data in a unified way, we use the normalized version of the marginal distribution functions, i.e., \(F_{i}^{(\ell)}(x) = \tfrac 12\left (F_{i}^{(+,\ell)}(x) + F_{i}^{(,\ell)}(x)\right)\), where \(F_{i}^{(+,\ell)}(x) = P\left (X_{i1}^{(\ell)}\leq x \right)\) denotes the rightcontinuous and \(F_{i}^{(,\ell)}(x) = P\left (X_{i1}^{(\ell)}< x\right)\) denotes the leftcontinuous version of the distribution function respectively. In the context of nonparametric models, the normalized version of the distribution function was first mentioned by Kruskal [17] and generally dates back to Lévy [18]. Later on, it was used by Ruymgaart [19], Munzel [20], Brunner and Puri [21], Kaufmann et al. [22], among others, to derive asymptotic results for rank statistics including the case of ties. We note that \(F_{i}^{(\ell)}(x)\) may be arbitrary distribution functions, with the exception of the trivial case that both distributions are onepoint distributions (see Lange and Brunner [23]).
The withinsubject design given in (1), which means that all diagnostic tests are performed in each individual, is recommended in the EMA guideline about diagnostic agents [24] and refers to Design 1 in Brunner and Zapf [25].
For a convenient derivation of asymptotic results, the AUC’s are collected in the vector AUC = (AUC^{(1)}, …,AUC^{(d)})^{′}.
with strong control of the familywise error rate (FWER) α simultaneously. The noninferiority margin AUC_{0} is assumed to have been fixed during the planning phase of the trial. Thus, the set of promising diagnostic tests consists of all markers, whose corresponding AUC^{(ℓ)} have been declared to be larger than AUC_{0} by an adequate multiple testing procedure.
Point estimators and asymptotic distribution
can easily be computed using the means \(\overline {R}_{i.}^{(\ell)} = n_{i}^{1} \sum _{s=1}^{n_{i}} R_{\textit {is}}^{(\ell)}\) of the (mid) ranks \(R_{\textit {is}}^{(\ell)}\), i=0,1. Here, \(R_{\textit {is}}^{(\ell)}\) denotes the rank of \(X_{\textit {is}}^{(\ell)}\) among all N=n_{0}+n_{1} observations \(X_{01}^{(\ell)}, \ldots, X_{0n_{0}}^{(\ell)}\), \( X_{11}^{(\ell)}, \ldots, X_{1n_{1}}^{(\ell)}\) per marker ℓ=1,…,d. Further let \(\mathbf {R}_{\textit {is}}=\left (R_{\textit {is}}^{(1)},\ldots,R_{\textit {is}}^{(d)}\right)'\) denote the vectors of the midranks and let \(\widehat {\mathbf {AUC}}=\left (\widehat {AUC}^{(1)},\ldots,\widehat {AUC}^{(d)}\right)'\) denote the vector of the point estimators.
Here, \( \overline {\mathbf {Z}}_{i\cdot } = \frac {1}{n_{i}} \sum _{s=1}^{n_{i}}\mathbf {Z}_{is}\) denotes the vector of means of the normed placements. For more details we refer to Brunner et al. [11] and Kaufmann et al. [22].
Test statistics and confidence intervals
where \(\widehat {v}^{(\ell,\ell)}\) denotes the diagonal elements of \(\widehat {V}_{N}\), defined in (9). In particular, each statistic is studentized with an individual consistent variance estimator and thus, the set of hypotheses and test statistics \(\mathbf {\Omega } = \left \{ \left (H_{0}^{(\ell)}, T^{(\ell)}\right), \ell =1,\ldots,d \right \}\) constitutes a jointtesting family in the sense of Gabriel [26]. Attention should be paid to the fact that the estimated variance \(\widehat {v}^{(\ell,\ell)}\) is equal to zero if \(\widehat {AUC}^{(\ell)}=0\) or 1. Thus, the test statistic T^{(ℓ)} can not be computed. One possibility to solve this problem is to modify the data slightly (see the analysis of the example in the Results section).
The global null hypothesis H_{0}:AUC≤AUC_{0}·1 as defined in (4) will be rejected, if max{T^{(1)},…,T^{(d)}}>z_{1−α/d,1} or, equivalently, if the maximum of the lower limits of the confidence intervals \(\max \{{CI}_{Bonf,l}^{(1)},\ldots, {CI}_{Bonf,l}^{(d)}\} > {AUC}_{0}\). Here 1=(1,…,1)^{′} denotes a ddimensional vector of 1s. The Bonferroni method is, however, a quite conservative selection approach (see Results section for more details). The reason for this is that the apparent correlations among the different pivotal quantitites T^{(1)},…,T^{(d)} are not taken into account by this method.
Multiple contrast tests and simultaneous confidence intervals
In order to use the correlation in the selection approach, it is our idea to apply the multiple contrast test principle (denoted by MCP), which uses the correlation among different test statistics. The key point of these procedures is to use the joint distribution of a set of statistics to adjust for multiplicity. Thus, the asymptotic multivariate distribution of the vector T=(T^{(1)},…,T^{(d)})^{′} is required. The details are stated in the next theorem.
Theorem1.
Under the assumption that N→∞ such that N/n_{ i }≤N_{0}<∞, i=0,1, the vector T follows, asymptotically, a multivariate normal distribution with expectation 0 and correlation matrix R, where R=[r^{(ℓ,m)}]_{ℓ,m=1…,d}, and \(r^{(\ell,m)} = \tfrac {v^{(\ell,m)}}{\sqrt {v^{(\ell,\ell)} v^{(m,m)}}}\).
The global null hypothesis will be rejected if max{T^{(1)},…,T^{(d)}}>z_{1−α,1}(R) or if \(\max \{{CI}_{MCP,l}^{(1)},\ldots, {CI}_{MCP,l}^{(d)}\} > {AUC}_{0}\). The correlation matrix R, however, is unknown and must be replaced by a consistent estimator \(\widehat {\mathbf {R}}\). We propose to replace R by \(\widehat {\mathbf {R}}\) in the considerations above, where \(\widehat {\mathbf {R}} = [\widehat {r}^{(\ell,m)}]_{\ell,m=1,\ldots,d}\) and \(\widehat {r}^{(\ell,m)}= \tfrac {\widehat {v}^{(\ell,m)}}{\sqrt {\widehat {v}^{(\ell,\ell)} \widehat {v}^{(m,m)}}}\), respectively.
Simulation studies indicate, however, that the speed of convergence of T to a multivariate normal distribution is quite slow, particularly when smaller sample sizes and larger numbers of diagnostic tests are considered. In a variety of applications, see e.g. Zou and Yue [28] or Konietschke et al. [10], it turns out that the use of adequate transformations (e.g., the Logittransformation) tend to increase the speed of convergence. Therefore, simultaneous confidence intervals with Logit transformation will be derived in the next section.
Multiple contrast tests and simultaneous confidence intervals with Logit transformation
where \(\widehat {s}^{(\ell,\ell)}\) denotes the ℓth diagonal element of \(\widehat {\mathbf {S}}_{N}.\) The joint distribution of the vector \(\widetilde {\mathbf {T}}=(\widetilde {T}^{(1)},\ldots, \widetilde {T}^{(d)})'\) is given in the next theorem.
Theorem2.
If N→∞ such that N/n_{ i }→f_{ i }<∞, then the vector \(\widetilde {\mathbf {T}}\,=\,(\widetilde {T}^{(1)},\ldots, \widetilde {T}^{(d)})'\) follows, asymptotically, a multivariate normal distribution with expectation 0 and correlation matrix R, where R is given in Theorem 1.
where \(expit(y) = \tfrac {exp(y)}{1+\exp (y)}\) denotes the inverse Logittransformation. The global null hypothesis H_{0}:AUC≤AUC_{0}·1 will be rejected, if \(\max \left \{\widetilde {T}^{(1)},\ldots,\widetilde {T}^{(d)}\right \} \geq z_{1\alpha,1}(\widehat {\mathbf {R}})\), or if \(\max \{{CI}_{Logit,l}^{(1)},\ldots, {CI}_{Logit,l}^{(d)}\} > {AUC}_{0}\). Since the Logitfunction is monotone, the procedure asymptotically controls the familywise error rate in the strong sense [26].
Small sample approximations with Wild Bootstrap
In the previous section approaches for the selection of diagnostic tests based on the AUC’s have been derived. The procedures are based on the asymptotic joint distribution of the vectors T or \(\widetilde {\mathbf {T}}\), respectively. The proposed approaches for selection of diagnostic tests are valid for large sample sizes. In order to investigate the accuracies of the procedures in terms of (i) controlling the preassigned typeI error level under the null hypothesis, (ii) maintaining the nominal coverage probability of the corresponding simultaneous confidence intervals, and (iii) their powers to detect certain alternatives, extensive simulation studies were conducted.
These simulation studies indicate, however, that both the statistics T in (12) and \(\widetilde {\mathbf {T}}\) in (15) tend to result in liberal or conservative decisions in case of smaller sample sizes (N≤100) and larger AUC (AUC≥0.8). The results are in concordance with the simulation results proposed for univariate statistics by Kottas et al. [15] or Qin and Hotilovac [30]. Therefore, we propose a Wild Bootstrap approach to approximate their sampling distributions for small sample sizes.
Empirical typeI error (theoretical 2.5% ) of the normal Bootstrap for d=5 and N=50 with varying casecontrolratio and varying AUC
ccr  AUC  

0.5  0.6  0.7  0.8  0.9  
1:1  1.68%  2.28%  3.10%  5.00%  7.80% 
1:4  1.90%  2.96%  4.70%  6.40%  12.10% 

Rademacher weights: \(P(W_{\textit {is}}=1) = P(W_{\textit {is}}=1)=\frac 12\).

Standard normal weights: \(\phantom {\dot {i}\!}W_{01},\ldots,W_{1n_{1}} \sim N(0,1)\).

Uniform weights: \(W_{01},\ldots, W_{1n_{1}} \sim U\left [\frac {\sqrt {12}}{2}, \frac {\sqrt {12}}{2}\right ]\).
mimics the distribution of both the vectors T and \(\widetilde {\mathbf {T}}\), asymptotically.
Theorem3.
If N→∞ such that \(\tfrac {N}{n_{i}}\) converges to some finite constant f_{ i }, then the conditional distribution of T^{∗} given the data X converges in probability to the multivariate normal distribution with expectation 0 and correlation matrix R.
 1.
Given the data X, compute the point estimators \(\widehat {\mathbf {AUC}}\) and \(\widehat {\mathbf {V}}_{N}\) as given in (5) and (9), respectively.
 2.
Generate N=n_{0}+n_{1} random weights \(W_{01},\ldots, W_{1n_{1}}\phantom {\dot {i}\!}\) as described in (18)
 3.
Compute \(A^{\ast }_{j}:=\max \{T^{\ast (1)},\ldots,T^{\ast (d)}\}\) as given in (20).
 4.
Repeat the steps 2.  3. nboot times (e.g. nboot=10,000) and obtain the values \(A_{1}^{\ast },\ldots, A_{\textit {nboot}}^{\ast }\).
 5a.
Compare each \(A_{j}^{\ast }\) with \(\max \left \{\widetilde {\mathbf {T}}\right \}\). Then the individual pvalue for \(H_{0}^{(\ell)}:AUC^{(\ell)}\leq {AUC}_{0}\) is obtained from \(\tfrac {1}{nboot}\sum _{j=1}^{nboot}\mathcal {I}\{\widetilde {T}^{(\ell)}\geq A^{\ast }_{j}\}\), where \(\mathcal {I}\{\cdot \}\) denotes the indicator function.
 5b.
Estimate the quantile z_{1−α,1}(R) by the onesided (1−α)quantile \(z^{\ast }_{1\alpha,1}\) of \(A_{1}^{\ast },\ldots, A_{\textit {nboot}}^{\ast }\) to obtain the onesided (1−α) simultaneous confidence intervals given by
Results
Simulation results
We performed a simulation study to investigate the properties of the different approaches. All simulations were conducted with R environment, version 2.15.2. (R Development Core Team, 2010), each with 5, 000 simulation runs and 5, 000 bootstrap repetitions. The nominal typeI error was set to 2.5% onesided and the global null hypothesis according to (4) was rejected, if at least one of the onesided pvalues was smaller than α=2.5%. This means, the family wise error rate in the strong sense (FWER) is controlled, and the onesided empirical typeI error should be closed to 2.5%. It is also possible to use the corresponding confidence intervals for decision. Then the global null hypothesis is rejected if the lower limit of at least one confidence interval was above AUC_{0}.

The true AUC (0.5,…, 0.9)

The number of diagnostic tests d (5, 10, 20)

The total sample size N (50, 100, 200)

The casecontrol ratio ccr (1:1, 1:2, 1:4, 1:9)

The true correlation between the diagnostic tests ρ (0.3, 0.6, 0.9)

The covariance structure in the data (compound symmetry, unstructured, and diagonal matrix with heterogeneous variances and positive or negative pairing)

The distribution of the data (normal, skewed = lognormal, ordinal)
The different parameter constellations and all simulation results can be seen in the Additional file 2. Due to computational complexity, and its weak behavior in standard situations, we did not further investigate the conventional Bootstrap in our simulation study.
In a first step, this standard scenario was used for the comparison of the three random weights for the Wild Bootstrap: Rademacher (WBRade), standard normal (WBNormal) and uniform (WBUnif) weights. The results are displayed in the Additional file 3. For an AUC of 0.5 the three weights lead to nearly the same empirical typeI error and are quite conservative (empirical α≈0.015). For larger AUC’s the results are less conservative and for AUC’s above 0.8 the empirical typeI error is around 2.5%. The Wild Bootstrap approach with uniform weights is, however, more conservative, while the standard normal and the Rademacher weights lead nearly to the same results. Therefore, and to present the simulation results more clearly, we only consider the standard normal weights in the following. The simulation results for the other weights are provided in the Additional file 2.
The strength of the correlation, the type of the covariance structure and a skewed distribution do not have any impact on the behavior of the test (see figures and tables in the Additional files 2, 4, 5, and 6).
Ordinal data was generated using discretised normal distributions with a given AUC. For this data, representing a 5point grading scale, the empirical typeI error decreases with increasing AUC (AUC=0.5: Logit = 2.3%, WBNormal = 2.2% to AUC=0.9: Logit = 1.7%, WBNormal = 1.6%). For details see Additional file 2.
The power was calculated for one example scenario (N=200, d=5, ccr=1:1, ρ=0.9, AUC_{0}=0.7), where the empirical typeI error of the Logit and of the WBNormal approach was nearly the same. The true AUC is increasing from 0.7 (which is equal to AUC_{0}) to 0.85, according ΔAUC=0,…,0.15. The power of the two approaches is basically the same. For an ΔAUC of 0.1 (i.e. AUC=0.8 vs. AUC_{0}=0.7) the power is greater than 80% (see Additional file 2).
Results for the analysis of the example
Number of selected biomarkers of the MCP, the Logit, and the WBNormal approach for different thresholds (based on onesided 97.5% confidence intervals)
Threshold  MCP  Logit  WBNormal 

0.8  4  3  4 
0.85  4  3  3 
0.9  3  2  3 
0.95  2  1  2 
Discussion
It is widely discussed in the literature, whether the typeI error should be adjusted for multiplicity and whether the Bonferroni correction is an appropriate approach. Among many others, Wittes [44] states that lack of adjustment can lead to a misinterpretation of the study results as well as Bonferroni adjustment can do. Furthermore Perneger [45] states that “In summary, Bonferroni adjustments have, at best, limited applications in biomedical research, and should not be used when assessing evidence about specific hypotheses”. Nevertheless, in practice often Bonferroni adjusted or even unadjusted confidence intervals for the single AUC’s are used (see for example [43]). Konietschke et al. [10] proposed nonparametric multiple contrast tests and simultaneous confidence intervals for adequate correction of the typeI error, which take the dependencies within the data into account. Furthermore the authors recommended the transformation method (for example the Logittransformation) to get less liberal results. However, Qin and Hotilovac [30] noticed that the Logittransformed intervals are conservative for high accuracies. The reason is that the estimator \(logit(\widehat {AUC})\) is quite unstable if \(\widehat {AUC}\) is close to 0 or 1 because of a possibly larger variance. Obuchowski and Lieber [46] compared different confidence intervals for the AUC and concluded that for small sample sizes none of them provides adequate coverage for high accuracies.
Conclusion
In this article we derived a Wild Bootstrap approach, which exploits the typeI error much better than the Logitapproach, even for high accuracies and small samples. Neither the strength of correlation, nor the structure of the covariance matrix, nor a skewed distribution, nor a moderate imbalanced casecontrol ratio has any impact on this desirable property of the Wild Bootstrap approach. Corresponding to these results we recommend to use the Wild Bootstrap approach with standard normally distributed weights for the selection of biomarkers in early diagnostic trials with the AUC as selection criterion.
Declarations
Acknowledgements
This work was supported by the Federal Ministry of Education and Research [05M10MGB]. The authors thank Prof. Ballmann from the DRKKinderklinik Siegen for supporting this work by providing study data.
Authors’ Affiliations
References
 DeLong E, DeLong D, ClarkPearson D. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988; 44:837–45.View ArticlePubMedGoogle Scholar
 Xia J, Broadhurst D, Wilson M, Wishart D. Translational biomarker discovery in clinical metabolomics: an introductory tutorial. Metabolomics. 2013; 9:280–99.View ArticlePubMedGoogle Scholar
 Derichs N, Sanz J, Von Kanel T, Stolpe C, Zapf A, Tümmler B, et al. Intestinal current measurement for diagnostic classification of patients with quastionable cystic fibrosis: validation and reference data. Thorax. 2010; 65:594–9.View ArticlePubMedGoogle Scholar
 Marshall K, Mohr S, Khettabi F, Nossova N, Chao S, Bao W, et al. A bloodbased biomarker panel for stratifying current risk for colorectal cancer. Int J Cancer. 2010; 126:1177–86.PubMedGoogle Scholar
 Broadhurst D, Kell D. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics. 2006; 2:171–96.View ArticleGoogle Scholar
 EMA. Guideline on the choice of the noninferiority margin. Doc. Ref. EMEA/CPMP/EWP/2158/99. 2005. www.ema.europa.eu/ema/pages/includes/document/open\_document.jsp?webContentId=WC500003636 (date of last access 13/04/15).
 Phillips A, Fletcher C, Atkinson G, Channon E, Douiri A, Jaki T, et al. Multiplicity: discussion points from the statisticians in the pharmaceutical industry multiplicity expert group. Pharm Stat. 2013; 12:255–9.View ArticlePubMedGoogle Scholar
 Strassburger K, Bretz F. Compatible simultaneous confidence bounds for the Holm procedure and other Bonferronibased closed tests. Stat Med. 2008; 27:4919–27.View ArticleGoogle Scholar
 Hothorn T, Bretz F, Westfall P. Simultaneous inference in general parametric models. Biometrical J. 2008; 50:346–63.View ArticleGoogle Scholar
 Konietschke F, Hothorn L, Brunner E. Rankbased multiple test procedures and simultaneous confidence intervals. Electron J Stat. 2012; 6:738–59.View ArticleGoogle Scholar
 Brunner E, Munzel U, Puri M. The multivariate nonparametric BehrensFisher problem. J Stat Planning Inference. 2002; 108:37–53.View ArticleGoogle Scholar
 Bamber D. The area above the ordinal dominance graph and the area below receiver operating characteristic graph. J Math Psychol. 1975; 12:387–415.View ArticleGoogle Scholar
 Janda S, Swiston J. Diagnostic accuracy of pleural fluid NTproBNP for pleural effusions of cardiac origin: a systematic review and metaanalysis. BMC Pulmonary Med. 2010; 10:58.View ArticleGoogle Scholar
 Wang L, Fahim M, Hayen A, Mitchell R, Baines L, Lord S. Cardiac testing for coronary artery disease in potential kidney transplant recipients. Cochrane Database Syst Rev. 2011; 12. DOI: 10.1002/14651858.CD008691.pub2.Google Scholar
 Kottas M, Kuss O, Zapf A. A modified Wald interval for the area under the ROC curve (AUC) in diagnostic casecontrol studies. BMC Med Res Methodology. 2014; 14:26.View ArticleGoogle Scholar
 Arlot S, Blanchard G, Roquain E. Some nonasymptotic results on resampling in high dimension, I: confidence regions. Ann Stat. 2010; 38:51–82.View ArticleGoogle Scholar
 Kruskal W. A nonparametric test for the several sample problem. Ann Math Stat. 1952; 23:525–40.View ArticleGoogle Scholar
 Lévy P. Calcul des Probabilitées. Paris: GauthiersVillars, Éditeurs; 1925.Google Scholar
 Ruymgaart F. A unified approach to the asymptotic distribution theory of certain midrank statistics In: Raoult JP, editor. Statistique Non Parametrique Asymptotique vol. Lecture Notes on Mathematics, No. 821. Springer, Berlin Heidelberg: 1980. p. 1–18.Google Scholar
 Munzel U. Linear rank score statistics when ties are present. Stat Probability Lett. 1999; 41:389–95.View ArticleGoogle Scholar
 Brunner E, Puri M. Nonparametric methods in factorial designs. Stat Pap. 2001; 42:1–52.View ArticleGoogle Scholar
 Kaufmann J, Werner C, Brunner E. Nonparametric methods for analysing the accuracy of diagnostic tests with multiple readers. Stat Methods Med Res. 2005; 14:129–46.View ArticlePubMedGoogle Scholar
 Lange K, Brunner E. Sensitivity, specificity and ROCcurves in multiple reader diagnostic trials  a unified, nonparametric approach. Stat Methodology. 2012; 9:490–500.View ArticleGoogle Scholar
 EMA. Guideline on clinical evaluation of diagnostic agents. Doc. Ref. CPMP/EWP/1119/98/Rev. 1. 2010. www.ema.europa.eu/ema/pages/includes/document/open\_document.jsp?webContentId=WC500003580 (date of last access 13/04/15).
 Brunner E, Zapf A. Nonparametric ROC analysis for diagnostic trials In: Balkrishnan N, editor. Methods and Applications of Statistics in Clinical Trials vol. Volume 2: Planning, Analysis, and Inferential Methods. Hoboken, New Jersey: John Wiley & Sons: 2014. p. 471–83.Google Scholar
 Gabriel K. Simultaneous test procedures  some theory of multiple comparisons. Ann Math Stat. 1969; 40:224–50.View ArticleGoogle Scholar
 Bretz F, Landgrebe J, Brunner E. Multiplicity issues in microarray experiments. Methods Inf Med. 2005; 44:431–7.PubMedGoogle Scholar
 Zou G, Yue L. Using confidence intervals to compare several correlated areas under the receiver operating characteristic curves. Stat Med. 2012; 32:5077–90.View ArticleGoogle Scholar
 Ferguson T. A Course in Large Sample Theory. London: Chapman & Hall; 1996.View ArticleGoogle Scholar
 Qin G, Hotilovac L. Comparison of nonparametric confidence interval for the area under the ROC curve of a continuousscale diagnostic test. Stat Methods Med Res. 2008; 17:207–21.Google Scholar
 Efron B. Bootstrap methods: Another look at the Jackknife. Ann Stat. 1979; 7:1–26.View ArticleGoogle Scholar
 Wu C. Jackknife, Bootstrap and other resampling methods in regression analysis. Ann Stat. 1986; 14:1261–95.View ArticleGoogle Scholar
 Mammen E. When does Bootstrap work? Asymptotic results and simulations. New York: Springer; 1992.View ArticleGoogle Scholar
 Beran R. Diagnosing Bootstrap success. Ann Inst Stat Mathematics. 1997; 49:1–24.View ArticleGoogle Scholar
 Janssen A. Nonparametric symmetry tests for statistical functionals. Math Methods Stat. 1999; 8:320–43.Google Scholar
 Kreiss J, Paparoditis E. Bootstrap for dependent data: a review, with discussion, and a rejoinder. J Korean Stat Soc. 2011; 40:357–78.View ArticleGoogle Scholar
 Kreiss J, Paparoditis E. Bootstrapping locally stationary processes. J R Stat Soc  Ser B. 2014; 77:267–90.View ArticleGoogle Scholar
 Konietschke F, Pauly M. Bootstrapping and permuting paired ttest type statistics. Stat Comput. 2014; 24:283–96.View ArticleGoogle Scholar
 Lin D. Nonparametric inference for cumulative incidence functions in competing risks studies. Stat Med. 1997; 16:901–10.View ArticlePubMedGoogle Scholar
 Beyersmann J, di Termini S, Pauly M. Weak convergence of the Wild Bootstrap for the AalenJohansen estimator of the cumulative incidence function of a competing risk. Scand J Stat. 2014; 40:387–402.View ArticleGoogle Scholar
 Pauly M. Weighted resampling of martingale difference arrays with applications. Electron J Stat. 2011; 5:41–2.View ArticleGoogle Scholar
 Dobler D, Pauly M. How to Bootstrap AalenJohansen processes for competing risks? Handicaps, solutions, limitations. Electron J Stat. 2014; 8:2779–803.View ArticleGoogle Scholar
 Shiotani A, Murao T, Kimura Y, Matsumoto H, Kamada T, Kusunoki H, et al. Identification of serum mirnas as novel noninvasive biomarkers for detection of high risk for early gastric cancer. Br J Cancer. 2013; 109:2323–30.View ArticlePubMedPubMed CentralGoogle Scholar
 Wittes J. Clinical trials must cope better with multiplicity. Nat Med. 2012; 18:1607.View ArticlePubMedGoogle Scholar
 Perneger T. What’s wrong with Bonferroni adjustments. Br Med J. 1998; 316:1236–8.View ArticleGoogle Scholar
 Obuchowski N, Lieber M. Confidence intervals for the receiver operating characteristic area in studies with small samples. Academic Radiology. 1998; 5:561–71.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.