This article has Open Peer Review reports available.
A modified Wald interval for the area under the ROC curve (AUC) in diagnostic case-control studies
© Kottas et al.; licensee BioMed Central Ltd. 2014
Received: 16 October 2013
Accepted: 7 February 2014
Published: 19 February 2014
The area under the receiver operating characteristic (ROC) curve, referred to as the AUC, is an appropriate measure for describing the overall accuracy of a diagnostic test or a biomarker in early phase trials without having to choose a threshold. There are many approaches for estimating the confidence interval for the AUC. However, all are relatively complicated to implement. Furthermore, many approaches perform poorly for large AUC values or small sample sizes.
The AUC is actually a probability. So we propose a modified Wald interval for a single proportion, which can be calculated on a pocket calculator. We performed a simulation study to compare this modified Wald interval (without and with continuity correction) with other intervals regarding coverage probability and statistical power.
The main result is that the proposed modified Wald intervals maintain and exploit the type I error much better than the intervals of Agresti-Coull, Wilson, and Clopper-Pearson. The interval suggested by Bamber, the Mann-Whitney interval without transformation and also the interval of the binormal AUC are very liberal. For small sample sizes the Wald interval with continuity has a comparable coverage probability as the LT interval and higher power. For large sample sizes the results of the LT interval and of the Wald interval without continuity correction are comparable.
If individual patient data is not available, but only the estimated AUC and the total sample size, the modified Wald intervals can be recommended as confidence intervals for the AUC. For small sample sizes the continuity correction should be used.
The result of a diagnostic test is in general not binary (positive/negative) but a quantitative parameter (such as a biomarker). If an appropriate threshold for the quantitative parameter has not yet been defined, the receiver operating characteristic (ROC) curve and in particular the area under this curve, are appropriate for evaluating the overall accuracy of the diagnostic test . The ROC curve is a plot of sensitivity (true positive rate) and one minus specificity (true negative rate) for each possible threshold value of the biomarker of interest. In the case of complete separation of cases and controls, the area under the ROC curve (AUC) is equal to one. For a diagnostic test, which is no better than chance, the AUC is 0.5. In early phase diagnostic studies, amongst others, the aim is in general to get a first impression of the overall diagnostic accuracy.
The sample sizes are small. For example, in the systematic review by Cochrane and Ebmeier of diffusion tensor imaging (DTI) as a candidate biomarker for the diagnosis of Parkinson disease, the median total sample size of the 21 selected studies was 32 (mean = 39) . The largest study in the systematic review by Wang et al. of cardiac testing for coronary artery disease in potential kidney transplant recipients included 219 patients .
A case-control study design with comparable sizes in the two groups is chosen [1, 4] (i.e. case-control ratio ≈1:1). Controls are generally healthy volunteers, patients with benign disease, or patients with a disease within the scope of the differential diagnosis (see for example [5–7]).
Diagnostic tests or biomarkers yield large values for the AUC. For example in the systematic review by Wang et al. the AUC’s of the different diagnostic tests were between 0.78 and 0.91 .
Many different confidence intervals have been proposed for the AUC. Bamber suggested in 1975 a variance estimator and corresponding confidence interval for the AUC, which was the starting point for many authors . Qin and Hotilovac compared in 2008 nine nonparametric intervals . Their conclusion was that the empirical likelihood-based interval and the Mann-Whitney interval with Logit transformation lead to good coverage accuracy. The Mann-Whitney interval without transformation was not recommended by the authors, however, it is used in the ROC statement of PROC LOGISTIC in SAS. A parametric approach is the AUC under the binormal ROC curve (see for example the book of Pepe ).
But because all confidence intervals for the AUC are relatively complicated to implement, and some of them either do not maintain or do not exploit the type I error probability α for small sample sizes or large values for the AUC, we investigated alternatives. Our basic approach was to use simple two-sided confidence intervals for a single proportion, because the AUC can be interpreted as a probability (that a randomly chosen diseased individual has a larger value for the biomarker than a randomly chosen non-diseased individual, see for example , formula (1.3)). The simplest confidence interval is the Wald interval, which tends to yield anti-conservative results . As an alternative we propose a conservative version with a modified variance estimator, based on Bamber . Newcombe compared seven confidence intervals for a single proportion and recommended the Wilson interval ("Of the methods that perform well, only the score method is calculator friendly.") . Wilson’s score interval is still suggested, particularly for proportions close to 0 or 1 (see for example the article of He et al. ). Agresti and Coull recommended a modified Wald interval, which has similar behaviour to the Wilson interval for a two-sided type I error of 5%, but a simpler formula . The Clopper-Pearson interval is another alternative. It is an exact interval but tends to yield conservative results.
In this article we compare the modified, conservative Wald confidence interval (with and without continuity correction) with the Mann-Whitney interval with Logit transformation interval as main reference. Furthermore Bamber’s interval, the Mann-Whitney interval without transformation, and the binormal AUC are included. For the family of intervals for a single proportion Wilson’s score interval (with and without continuity correction), the Agresti-Coull interval, and the Clopper-Pearson interval are added. In line with the recommendations of Burton et al.  we compare the intervals in terms of coverage probability, interval length, and statistical power. The aim of this article is to determine if one of the intervals is an appropriate alternative to the Mann-Whitny interval with Logit transformation; and if so, in which situations it performs well. In the next section we describe the statistical model and the different confidence intervals. Then the results of the simulation study and of an example are presented. Finally, the results are summarised and discussed, and recommendations are given.
Confidence intervals for the AUC
Needed notation for the formulas of the different confidence intervals in Table 2
Logit transformation of the AUC
Back transformation of the Logittransformation
z = z 1-α/2
1-α/2 quantile of the normal distribution
Empirical standard deviation of AUC
Standard error of AUC by Bamber 
AUC for the binormal ROC curve (s i ,i = 0,1,as empirical estimator of σ i )
Empirical standard deviation of AUC ∗
Factor for the Wilson interval
Modified AUC (for the A-C interval)
Estimated number of successes (for the C-Pinterval)
f(1 - α/2,d f1,d f2)
1-α/2 quantile of the F distribution with d f1 and d f2 degrees of freedom
Formulas for the different confidence intervals from section Methods
Confidence interval (denotation)
Result for AUC = 1
... with continuity correction (Wilson-cc)
lower: (2(n + z 2))
upper: (2(n + z 2))
lower: (k · f(α/2,2k,2(n - k + 1)))/ (n - k + 1 + k · f(α/2,2k,2(n - k + 1)))
upper: ((k + 1)f(1 - α/2,2(k + 1),2(n - k)))/ (n - k + (k + 1)f(1 - α/2,2(k + 1),2(n - k)))
Modified Wald (Wald)
... with continuity correction(Wald-cc)
A parametric approach is the binormal ROC curve (denoted Binormal), assuming normal distributions for the test results of the cases and of the controls (). The corresponding area under the resulting curve is called binormal AUC. The binormal AUC is estimated using the empirical estimators of the distribution functions (for formula see Table 2, for details see for example the book of Pepe ).
Confidence intervals for a single proportion
The Wilson score interval  and the Wilson interval with continuity correction (denoted Wilson and Wilson-cc) are known for their good properties in the case of proportions near to 0 or 1 . The formulas are more complicated than the Wald interval, but only the quantile, the total sample size n, the point estimator and constants are needed (see Table 2). The intervals can also be calculated in the case of AUC equal to 1, and the limits are always range-preserving. The 95% interval of Agresti and Coull  (denoted A-C) as a Wald interval adding two "successes" and two "failures" has a similar behaviour as the Wilson interval, but a simpler formula (see Table 2). In the usual setting in which it is applied, the exact confidence interval of Clopper and Pearson  (denoted C-P) maintains type I error by definition. However, this property is not valid here because the AUC is a probability relating two independent groups rather than to a group and a subgroup. The interval can be calculated with a finite formula (see for example the article of Agresti and Coull ). In the case of the interval cannot be calculated. The corresponding formulas for all intervals are given in Table 2.
Modified Wald intervals
The Wald confidence interval is very easy to calculate and in general has good properties. But it is known that for small sample sizes it becomes anti-conservative . Therefore we propose a Wald interval with a modified variance estimator. In his article Bamber gave beside the estimator for the variance (denoted Bamber interval, see above) also the maximum variance for the case of continuous X 0 and X 1 with monotonic posterior (= the larger the measured value, the larger the probability for the presence of the disease). According to Bamber the estimated asymptotic maximum variance is (for balanced sample sizes, derived from [8, 22]).
The formulas for the corresponding Wald intervals with and without continuity correction (denoted Wald and Wald-cc) are given in Table 2. One advantage of the Wald interval with continuity correction is that it can also be calculated for an estimated AUC equal to 1. The upper and the lower limit of the Wald interval without continuity correction would be equal to 1 for AUC = 1. The Wald intervals are not range preserving.
The simulation program was implemented in SAS/IML and 10 000 simulation runs were used. The binormal intervals were calculated only for the first 1 000 simulation runs, because of it’s high computation time. First we generated normally distributed data, independently for the two groups, with μ = 0 and as the variances of the controls and the cases, respectively. Then the values for the cases were shifted by to obtain the true AUC (AUC 0).
Varied factors in the simulation study
Results in paragraph
True AUC (AUC 0)
0.7,0.8,0.9 (each ±0.01)
Sample size (n)
Interval length, coverage probability
Variance of the cases (σ 1), σ 0 = 1
Ordinal with five categories
Variance of the cases (σ 1), σ 0 = 1
AUC under the alternative hypothesis (AUC 1)
Results and discussion
We first simulated data for the nine combinations of AUC 0 and total sample size n (with 1:1 case-control ratio). Under specific conditions the LT and M-W intervals () and the C-P interval () cannot be computed (see Methods). The LT interval could not be computed only for the combination of small sample size and high AUC (n = 40, AUC 0 = 0.9) and only for 14 of the 10 000 simulation runs. For the same scenario the C-P could not be computed for 125 simulation runs. The C-P interval could also not be computed for n = 40 and AUC 0 = 0.8 for two simulation runs.
For interval length, across the nine scenarios the Wald intervals tend to be the widest, while the A-C and the Wilson interval tend to be the narrowest. A box plot of the length of the different intervals is given in the Additional file 3: Figure S1. The simulation runs which did not yield intervals (141 runs overall, see above) were ignored.
The Wilson interval without continuity correction has a coverage probability of nearly 95% for an AUC 0 of 0.9, independent of the sample size. But for lower AUC 0‘s the coverage of the Wilson interval drops to 92%. The Agresti-Coull interval, the Wilson interval with continuity correction, and the Clopper-Pearson interval tend to be liberal for an AUC 0 of 0.7 (coverage between 92% and 94%), and become quite conservative for higher AUC’s (coverage up to 98%).
The modified Wald interval without continuity correction is liberal for small sample sizes (93%-94% coverage), for larger sample sizes the coverage is comparable to the LT interval. However, for a large sample size of n = 200 and a high AUC 0 of 0.9 the Wald interval becomes conservative (coverage of 97%). The coverage probability of the continuity corrected Wald interval is very similar to the LT interval, but for larger sample sizes the Wald-cc interval becomes conservative (coverage up to 98%).
Because overall the LT and the Wald intervals maintained the type I error best, we restricted subsequent investigations to these three intervals.
The LT and the Wald intervals are robust with respect to non-normal distributions, which is important because biomarker follow often a skewed distribution. This is not surprising, because the numerator of the Mann-Whitney test statistic as point estimator is based on the ranks of the measurements. Therefore the estimators and accordingly the LT interval are invariant under any monotone transformation. The Wald-interval is based only on the point estimator and the sample size. Thus the LT and the Wald intervals are robust with respect to non-normal distributions.
Because test results can also be ordinal (especially in studies involving imaging techniques), we investigated the coverage probability after categorizing the normally distributed data into five categories (using the percentiles 20, 40, 60, and 80). For continuous data, the median coverage probability of the LT and of the Wald interval is about 95%, while the median coverage of the Wald-cc interval is 96%, and the range of the LT interval is smaller than the range of the Wald intervals. For ordinal data the median coverage probability of the LT interval increases only from 95.3% to 95.4%, but the range becomes as large as the range of the Wald intervals. The median coverage of the Wald intervals increases from 95.3% to 95.6% for Wald, and from 96.1% to 96.6% for Wald-cc, while the range does not change much. The corresponding figure is given in the Additional file 4: Figure S2.
Results of the example from section Example
The aim of this article was to investigate whether a modified Wald interval (with or without continuity correction), which is quite easy to implement, is an alternative for the Mann-Whitney interval with logit transformation (LT) for use as a confidence interval for the AUC in diagnostic studies. The simulation study shows that for small sample sizes (here n = 40) the Wald interval with continuity correction is as good as the LT interval regarding the coverage probability, and has much more power than the LT interval. For large sample sizes (here n = 100,200) the Wald interval without continuity correction is comparable to the LT interval regarding the coverage probability for an AUC 0 up to 0.8, and has slightly more power. For an AUC 0 of 0.9 the Wald interval becomes slightly conservative. The LT interval as well as the Wald intervals are robust to unimodal departures from normality. However, while the LT-interval is quite robust to unbalanced smple sizes and also applicable for ordinal data, the Wald intervals cannot be recommended for very unbalanced or ordinal data. Neither the Wald intervals nor the LT interval are robust to variance heterogeneity.
The other intervals investigated (Mann-Whitney, Bamber, Binormal, Wilson, Wilson with continuity correction, Agresti-Coull, and Clopper-Pearson) cannot be recommended. In particular, the Mann-Whitney interval, which is used in the ROC statement of the PROC LOGISTIC in SAS (referred to there as a Wald interval), Bamber’s interval and the interval for the binormal AUC are much too liberal. This is especially disappointing with respect to the binormal AUC interval, because this one was the only parametric interval under study and had the advantage that the simulation data were generated under its true underlying normal model.
For rather balanced (ratio 1:1 to 1:2) diagnostic case-control studies (which are suitable for proof-of-concept and phase II studies according to the European guideline ) the modified Wald intervals are a reasonable alternative to the LT interval. For studies with small sample sizes (about 50 overall) we would recommend to use the Wald interval with continuity correction, for studies with large sample sizes (n ≥ 100) we would recommend the Wald interval without continuity correction.
Moreover it is an advantage of the Wald intervals that, in general, they can be computed from published data (only point estimator and total sample size is needed) while the LT interval needs individual patient data for the computation.
We thank David Couper for the language editing.
- EMA: Guideline on clinical evaluation of diagnostic agents. Doc. Ref. CPMP/EWP/1119/98/Rev. 1. 2010Google Scholar
- Cochrane C, Ebmeier K: Diffusion tensor imaging in parkinsonian syndromes. systematic review and meta-analysis. Neurology. 2013, 80 (9): 857-864. 10.1212/WNL.0b013e318284070c.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang L, Fahim M, Hayen A, Mitchell R, Baines L, Lord S, Craig J, Webster A: Cardiac testing for coronary artery disease in potential kidney transplant recipients. Cochrane Database Syst Rev. 2011, 12: 1-105.View ArticleGoogle Scholar
- Ziegler A, König I, Schulz-Knappe M: Challenges in planning and conducting diagnostic studies with molecular biomarkers. Dtsch Med Wochenschr. 2013, 138: 2-13.View ArticleGoogle Scholar
- Ostroff R, Mehan M, Stewart A, Ayers D, Brody E, Williams S, Levin S, Black B, Harbut M, Carbone M, Gobaraju C, Pass H: Early detection of malignant pleural mesothelioma in asbestos-exposed individuals with a non-invasive proteomics-based surveillance tool. PLoS One. 2012, 7 (10): 46091-101371. 10.1371/journal.pone.0046091.View ArticleGoogle Scholar
- Lim R, Lappas M, Riley C, Borregaard N, Moller H, Ahmed N, Rice G: Investigation of human cationic antimicrobial protein-18 (hcap-18), lactoferrin and cd163 as potential biomarkers for ovarian cancer. J Ovarian Res. 2013, 6 (1): 5-10.1186/1757-2215-6-5.View ArticlePubMedPubMed CentralGoogle Scholar
- Dellon E, Chen X, Miller C, Woosley J, Shaheen N: Diagnostic utility of major basic protein, eotaxin-3, and leukotriene enzyme staining in eosinophilic esophagitis. Am J Gastroenterol. 2012, 107: 1503-1511. 10.1038/ajg.2012.202.View ArticlePubMedPubMed CentralGoogle Scholar
- Bamber D: The area above the ordinal dominance graph and the area below receiver operating characteristic graph. J Math Psychol. 12: 387-415.Google Scholar
- Qin G, Hotilovac L: Comparison of non-parametric confidence intervals for the area under the roc curve of a continuous-scale diagnostic test. Stat Methods Med Res. 2008, 17: 207-221.Google Scholar
- Pepe M: The Statistical Evaluation of Medical Tests for Classification and Prediction. 2003, Oxford: Oxford University PressGoogle Scholar
- Brunner E, Puri M: Nonparametric methods in factorial designs. Stat Papers. 2001, 42: 1-52. 10.1007/s003620000039.View ArticleGoogle Scholar
- Newcombe R: Confidence Intervals for Proportions and Related Measures of Effect Size. 2013, London: Chapman & Hall/CRC Biostatistics SeriesGoogle Scholar
- Newcombe R: Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat Med. 1998, 17: 857-872. 10.1002/(SICI)1097-0258(19980430)17:8<857::AID-SIM777>3.0.CO;2-E.View ArticlePubMedGoogle Scholar
- He X, Wu S: Confidence intervals for the binomial proportion with zero frequency. Pharma SUG. 2009, 10-2009.Google Scholar
- Agresti A, Coull B: Approximate is better than "exact" for interval estimations of binomial proportions. Am Stat. 1998, 52 (2): 119-126.Google Scholar
- Burton A, Altman D, Royston P, Holder R: The design of simulation studies in medical statistics. Stat Med. 2006, 25: 4279-4292. 10.1002/sim.2673.View ArticlePubMedGoogle Scholar
- Ruymgaart F: A unified approach to the asymptotic distribution theory of certain midrank statistics. Lecture Notes on Mathematics, Statistique Non Parametrique Asymptotique, No 821. 1980, Berlin: Springer, 1-18.View ArticleGoogle Scholar
- Brunner E, Munzel U, Puri M: The multivariate nonparametric behrens-fisher problem. J Stat Plan Inference. 2002, 108: 37-53. 10.1016/S0378-3758(02)00269-0.View ArticleGoogle Scholar
- Inc SI: SAS/STAT®;9.3 User’s Guide. 2011, Cary, North Carolina: SAS Institute Inc.Google Scholar
- Wilson E: Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927, 22: 209-212. 10.1080/01621459.1927.10502953.View ArticleGoogle Scholar
- Clopper C, Pearson E: The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934, 26 (4): 404-413. 10.1093/biomet/26.4.404.View ArticleGoogle Scholar
- Birnbaum Z, Klose O: Bounds for the variance of the mann-whitney statistic. Ann Math Stat. 1957, 38: 933-945.View ArticleGoogle Scholar
- Wieand S, Gail M, James B, James K: A family of non-parametric statistics for comparing diagnostic markers with paired and unpaired data. Biometrika. 1989, 76: 585-592. 10.1093/biomet/76.3.585.View ArticleGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/14/26/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.