The result of a diagnostic test is in general not binary (positive/negative) but a quantitative parameter (such as a biomarker). If an appropriate threshold for the quantitative parameter has not yet been defined, the receiver operating characteristic (ROC) curve and in particular the area under this curve, are appropriate for evaluating the overall accuracy of the diagnostic test
[1]. The ROC curve is a plot of sensitivity (true positive rate) and one minus specificity (true negative rate) for each possible threshold value of the biomarker of interest. In the case of complete separation of cases and controls, the area under the ROC curve (AUC) is equal to one. For a diagnostic test, which is no better than chance, the AUC is 0.5. In early phase diagnostic studies, amongst others, the aim is in general to get a first impression of the overall diagnostic accuracy.

Early phase diagnostic studies often exhibit three characteristics:

- 1.
The sample sizes are small. For example, in the systematic review by Cochrane and Ebmeier of diffusion tensor imaging (DTI) as a candidate biomarker for the diagnosis of Parkinson disease, the median total sample size of the 21 selected studies was 32 (mean = 39) [2]. The largest study in the systematic review by Wang et al. of cardiac testing for coronary artery disease in potential kidney transplant recipients included 219 patients [3].

- 2.
A case-control study design with comparable sizes in the two groups is chosen [1, 4] (i.e. case-control ratio ≈1:1). Controls are generally healthy volunteers, patients with benign disease, or patients with a disease within the scope of the differential diagnosis (see for example [5–7]).

- 3.
Diagnostic tests or biomarkers yield large values for the AUC. For example in the systematic review by Wang et al. the AUC’s of the different diagnostic tests were between 0.78 and 0.91 [3].

Many different confidence intervals have been proposed for the AUC. Bamber suggested in 1975 a variance estimator and corresponding confidence interval for the AUC, which was the starting point for many authors
[8]. Qin and Hotilovac compared in 2008 nine nonparametric intervals
[9]. Their conclusion was that the empirical likelihood-based interval and the Mann-Whitney interval with Logit transformation lead to good coverage accuracy. The Mann-Whitney interval without transformation was not recommended by the authors, however, it is used in the ROC statement of PROC LOGISTIC in SAS. A parametric approach is the AUC under the binormal ROC curve (see for example the book of Pepe
[10]).

But because all confidence intervals for the AUC are relatively complicated to implement, and some of them either do not maintain or do not exploit the type I error probability *α* for small sample sizes or large values for the AUC, we investigated alternatives. Our basic approach was to use simple two-sided confidence intervals for a single proportion, because the AUC can be interpreted as a probability (that a randomly chosen diseased individual has a larger value for the biomarker than a randomly chosen non-diseased individual, see for example
[11], formula (1.3)). The simplest confidence interval is the Wald interval, which tends to yield anti-conservative results
[12]. As an alternative we propose a conservative version with a modified variance estimator, based on Bamber
[8]. Newcombe compared seven confidence intervals for a single proportion and recommended the Wilson interval ("Of the methods that perform well, only the score method is calculator friendly.")
[13]. Wilson’s score interval is still suggested, particularly for proportions close to 0 or 1 (see for example the article of He et al.
[14]). Agresti and Coull recommended a modified Wald interval, which has similar behaviour to the Wilson interval for a two-sided type I error of 5%, but a simpler formula
[15]. The Clopper-Pearson interval is another alternative. It is an exact interval but tends to yield conservative results.

In this article we compare the modified, conservative Wald confidence interval (with and without continuity correction) with the Mann-Whitney interval with Logit transformation interval as main reference. Furthermore Bamber’s interval, the Mann-Whitney interval without transformation, and the binormal AUC are included. For the family of intervals for a single proportion Wilson’s score interval (with and without continuity correction), the Agresti-Coull interval, and the Clopper-Pearson interval are added. In line with the recommendations of Burton et al.
[16] we compare the intervals in terms of coverage probability, interval length, and statistical power. The aim of this article is to determine if one of the intervals is an appropriate alternative to the Mann-Whitny interval with Logit transformation; and if so, in which situations it performs well. In the next section we describe the statistical model and the different confidence intervals. Then the results of the simulation study and of an example are presented. Finally, the results are summarised and discussed, and recommendations are given.