 Research Article
 Open Access
 Open Peer Review
Modelling multiple thresholds in metaanalysis of diagnostic test accuracy studies
 Susanne Steinhauser^{1, 2},
 Martin Schumacher^{1} and
 Gerta Rücker^{1}Email authorView ORCID ID profile
https://doi.org/10.1186/s1287401601961
© The Author(s) 2016
 Received: 5 May 2016
 Accepted: 26 July 2016
 Published: 12 August 2016
Abstract
Background
In metaanalyses of diagnostic test accuracy, routinely only one pair of sensitivity and specificity per study is used. However, for tests based on a biomarker or a questionnaire often more than one threshold and the corresponding values of true positives, true negatives, false positives and false negatives are known.
Methods
We present a new metaanalysis approach using this additional information. It is based on the idea of estimating the distribution functions of the underlying biomarker or questionnaire within the nondiseased and diseased individuals. Assuming a normal or logistic distribution, we estimate the distribution parameters in both groups applying a linear mixed effects model to the transformed data. The model accounts for acrossstudy heterogeneity and dependence of sensitivity and specificity. In addition, a simulation study is presented.
Results
We obtain a summary receiver operating characteristic (SROC) curve as well as the pooled sensitivity and specificity at every specific threshold. Furthermore, the determination of an optimal threshold across studies is possible through maximization of the Youden index. We demonstrate our approach using two metaanalyses of B type natriuretic peptide in heart failure and procalcitonin as a marker for sepsis.
Conclusions
Our approach uses all the available information and results in an estimation not only of the performance of the biomarker but also of the threshold at which the optimal performance can be expected.
Keywords
 Diagnostic accuracy study
 Metaanalysis
 Biomarker
 Threshold
 ROC curve
Background
Systematic reviews of diagnostic test accuracy (DTA) studies give an overview of the performance of a diagnostic test, e.g. based on a biomarker or a questionnaire. Metaanalysis of DTA studies is traditionally based on one pair of sensitivity and specificity (Se, Sp) per study. Thus each study contributes a two by two table, containing the numbers of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). The aims are twofold: On the one hand, one wants to estimate the pooled sensitivity and specificity with confidence regions. The assumption here is that all studies used similar thresholds for the biomarker underlying the test. On the other hand, if varying thresholds were used in the studies, a summary receiver operating characteristic (SROC) curve is estimated to describe the change in sensitivity and specificity while varying the threshold [1].
There are a number of published systematic reviews where several studies reported more than one threshold and the corresponding values of sensitivity and specificity, and also the thresholds were provided (see for example [2–6]). When using the standard bivariate metaanalysis model, however, one threshold value per study must be selected, and the additional information is ignored. In many cases the selected threshold is optimal with respect to the Youden index, which may lead to a too optimistic evaluation of the biomarker [5, 7, 8]. Thus, it is advantageous to use all the available data. As Leeflang et al. noted, ‘At present, the routinely used models for DTA metaanalysis utilise data on a single sensitivity and specificity pair for each study. Hence, current models do not fully utilise all of the available data. Some progress has been made in this area [9], but more general and robust methods are required’ [10].
Our motivation to work on a new approach is also due to our experience that clinicians often ask at which threshold of the biomarker the diagnostic test performs best. They expect metaanalysis to answer this question. Therefore methods to determine such an optimal threshold across all studies are urgently awaited. We note that methods focussing on ROC curves, ignoring the underlying biomarker, are not appropriate to answer this question.
There are already existing approaches which make use of more than one pair of sensitivity and specificity per study. An early approach was by Dukic and Gatsonis who used ordinal regression accounting for varying number of thresholds [11], including a Bayesian hierarchical approach. The multivariate random effects approach proposed by Hamza et al. [9] is a generalization of the standard bivariate model, which assumes an equal number of thresholds per study. Putter et al. [12], showing a case with common thresholds, used methods from survival analysis, modelling the marker distributions using a Poisson correlated gamma frailty model. MartínezCamblor [13] suggested a nonparametric approach directly averaging the withinstudy ROC curves. Riley and coauthors also proposed two multivariate regression models, both in a diagnostic and in a prognostic context. One of these (option (ii) in [14], subsection 3.2 in [15]) models a functional relationship and is related to our approach. The problem of incomplete reporting of thresholds is discussed in [8].
We present a new approach for metaanalyses of DTA studies adapted to this more extensive type of data. It leads to pooled estimates of sensitivity and specificity as well as to an SROC curve. Furthermore, an optimal threshold across studies can be determined. The fundamental idea is to estimate the distribution functions of the biomarker within the diseased and nondiseased individuals using a linear mixed effects model.
The article is structured as follows. In the next section, after reviewing the standard models, we present our new approach, including determination of an SROC curve and finding an optimal threshold. In the results section we describe the results of a simulation study and apply our approach to two metaanalyses from the literature. After the discussion section we end with conclusions.
Methods
Standard models for metaanalysis of DTA studies
The hierarchical model was originally presented in a Bayesian framework [16, 17]. The parameters in the hierarchical model are Θ and Λ, together with their variances, and a shape parameter β which is related to the variance ratio of the two distributions. Θ represents the average logit probability of a positive test result (‘positivity’ [16, 18]) across all studies and groups of patients. The θs for the studies are drawn from a normal distribution with mean Θ and model differences in ‘positivity’ which are due to different thresholds across studies. Λ is the average difference of the expectations of the distributions on the logit scale, that is, a log diagnostic odds ratio, and models accuracy.
Thus the twodimensional nature of the data is preserved and the variability between the studies is taken into account with random effects. It has been shown that in case of no covariates, the hierarchical model and the bivariate model are equivalent [18, 21].
These standard models are based on the assumption that each study in a metaanalysis contributes only one pair of sensitivity and specificity. This leads to the problem of a not uniquely defined SROC curve, as there are many different ways to define the straight line in logit space [21]. Furthermore, the SROC curve might be overestimated as most studies will report a kind of optimal pair of sensitivity and specificity [7]. If studies present more than one threshold, the metaanalyst needs to reduce the data and select a threshold. This procedure does not use the full information [10] and also may lead to bias. As the underlying threshold is ignored in the models, no optimal threshold can be determined.
New parametric approach based on several thresholds per study
The novel approach we want to present is characterized by the estimation of the cumulative distribution functions of the biomarker the test is based on within the nondiseased and diseased individuals, respectively [22]. This approach is applicable if several studies of a metaanalysis report more than one threshold and the corresponding values of sensitivity and specificity. More specifically, for each threshold reported by a study to be included in the metaanalysis, we need the threshold and the numbers of TP, FP, TN and FN.
We consider a continuous biomarker that is observed in each individual of two groups, nondiseased and diseased. Given a fixed threshold of the biomarker, without loss of generality, a test result is defined as positive if the observed value exceeds the threshold. We focus on the probability of negative test results within the nondiseased individuals (specificity) and within the diseased (1−sensitivity). Specificity and 1−sensitivity are interpreted as functions of the threshold x: the specificities provide data points of the cumulative distribution function (cdf) of the biomarker for the nondiseased individuals, the 1−sensitivities provide data points of the cdf for the diseased individuals. We make some distributional assumption for the biomarker, for example, we may assume a normal or logistic distribution. In parentheses, we note that we could as well, equivalently, model the ‘survival’ functions instead of the cdfs, which would mean to focus at 1−specificity and sensitivity, like in the ROC curve.
For each study, an arbitrary number of thresholds (not necessarily equal across studies) and the numbers of TP, FP, FN and TN for each threshold are assumed to be known. With this data we aim to estimate the parameters of the distribution functions of the biomarker within the nondiseased and diseased, respectively.
where h is the transformation.
The random intercepts of nondiseased and diseased individuals are denoted a_{0s} and a_{1s}, respectively, and the random slopes of nondiseased and diseased individuals b_{0s} and b_{1s}, respectively. Whereas diseased and nondiseased individuals within the same study are not correlated, the acrossstudy correlation must be modeled (parameters ρ_{1},…,ρ_{6}). The residual errors e_{ si } and f_{ si } are independent of the random intercepts and slopes.
Linear mixed effects models listed according to their random effects structure
Model  Specification 

DIDS  Different random intercepts and different random slopes 
CIDS  Common random intercept and different random slopes, 
a_{0s}=a_{1s}=a_{ s }  
DICS  Different random intercepts and common random slope, 
b_{0s}=b_{1s}=b_{ s }  
CICS  Common random intercept and common slope, 
a_{0s}=a_{1s}=a_{ s }, b_{0s}=b_{1s}=b_{ s }  
DS  Different random slopes, 
a_{0s}=a_{1s}=0  
CS  Common random slope, 
a_{0s}=a_{1s}=0, b_{0s}=b_{1s}=b_{ s }  
DI  Different random intercepts, 
b_{0s}=b_{1s}=0  
CI  Common random intercept, 
a_{0s}=a_{1s}=a_{ s }, b_{0s}=b_{1s}=0 
To choose between models, we first decided on using either the simplified models or the general ones. Then, we applied the REML (restricted maximum likelihood) criterion [23, 24], which selects the most suitable model of a range of models with same fixed effects and differing random effects. Finally, the model with the smallest REML criterion was selected.
Thus, it is necessary that the β_{ j } (j=0,1) are positive to obtain positive dispersions. That means specificity and 1−sensitivity, i.e. the probabilities of having a negative test result, should increase with increasing thresholds within both groups over all studies. If a logistic distribution assumption is used, the \(\hat \sigma _{j}~(j = 0,1)\) have to be multiplied with \(\pi /\sqrt {3}\) to obtain standard deviations. As we can see, if one fixes β_{0}=β_{1} in the linear regression models, one assumes that the distributions of the biomarker of nondiseased and diseased individuals have equal variances.
For estimation we used the lmer() function in R [25] with REML estimation and inverse variance weights scaled to mean one [26]. To avoid problems with zero values, we added a continuity correction of 0.5 to the numbers TN_{ si }, TP_{ si }, FN_{ si } and FP_{ si }. In case of the logit transformation, the Delta method (with continuity correction) leads to the variance estimates (TN_{ si }+0.5)^{−1}+(FP_{ si }+0.5)^{−1} (diseasefree) and (TP_{ si }+0.5)^{−1}+(FN_{ si }+0.5)^{−1} (diseased) and the corresponding inverse variance weights. For the probit transformation h=Φ^{−1}, the Delta method leads to analogous weights, see the R code provided in Additional file 1.
To demonstrate our models on examples, we used only models of the general form, i.e. where the fixed slopes of nondiseased and diseased individuals may differ, because these models performed better in the simulation study (models indicated by ‘*’). To choose one model of this range, we selected the one with the smallest REML criterion. We used a weighting parameter λ_{ w } of 0.5, meaning that sensitivity and specificity were equally weighted.
SROC curve and optimal threshold
Once the model parameters are estimated, the underlying distribution functions are determined. From these, one can read off the pooled sensitivity and specificity values at every threshold and also specify confidence regions. A SROC curve and an optimal threshold are also derived.
Sensitivity, specificity, confidence regions We derived confidence intervals as follows. From the given lmer() object, we extracted the estimates (hats omitted) of α_{0},α_{1},β_{0},β_{1},Var(α_{0}),Var(α_{1}),Var(β_{0}),Var(β_{1}),Cov(α_{0},β_{0}),Cov(α_{1},β_{1}).
Confidence bands were obtained by adding/subtracting the standard errors times the normal quantile z_{0.975} to the transformed estimates and backtransforming the confidence limits using h^{−1}.
where F_{μ,σ} is the distribution function with location and scaling parameters μ and σ, e.g., Φ_{μ,σ} under normal assumption with mean μ and standard deviation σ [1].
For the logistic distribution assumption of the biomarker, no analytical solution of the maximization problem of the Youden index has been found. Thus we implemented a fixed point iteration to compute the optimal threshold. For a discrete ordinal scale, the maximum can be found by maximizing the Youden index on the finite set of possible thresholds. For the normal distribution assumption, we also derived a confidence interval for the optimal threshold using the delta method which is implemented in our R code, see Additional file 1 (for details of the derivation, see (§3.3.6.3 [22])).
Simulation study

Number of studies: 10, 20, 30

True overall normal distributions of the biomarker:

Mean: 0/2.5 [nondiseased/diseased]

Standard deviation: 1.5/1.5 (‘same’), 1/2 (‘different’) [nondiseased/diseased]


Random noise:
To obtain studyspecific distributions, random noise was added to the true overall distributions. The extent of the random noise was determined by a visual comparison with the examples.
To mean: N(0,τ^{2}), τ= 0 (‘no heterogeneity’), 0.5 (‘moderate heterogeneity’), 1 (‘large heterogeneity’) or 1.5 (‘huge heterogeneity’),
symmetrically truncated so that the mean of the studyspecific distribution of the diseased individuals was greater than that of the nondiseased

To standard deviation: N(0,τ^{2}), τ= 0, 0.3, 0.4 or 0.5 likewise, symmetrically truncated in order to guarantee nonnegative studyspecific standard deviations


Total number of individuals per study:
Lognormal(5, 1)

Proportion of diseased individuals: N(0.5,0.04) truncated to the interval (0.2, 0.8)

Number of thresholds per study: Pois(λ=1.3 or 2), rejecting zeros, or fixed to 5

Values of thresholds: spaced equidistantly between the 40 % quantile of the studyspecific distribution of the nondiseased individuals and the 60 % quantile of the studyspecific distribution of the diseased individuals

True sensitivity and specificity: Once the distributions were fixed, the true sensitivity and the true specificity were derived as the areas under the respective curves to both sides of the threshold. Sensitivity and specificity were equally weighted.

True optimal threshold: The point where the densities cut was defined as the true optimal threshold. That is, we defined the optimal threshold as the point where the Youden index was maximized, weighting sensitivity and specificity equally.

Models: CI, DS, CICS, CIDS, *CI, *DS, *CICS, *CIDS
We did not include the most complex models DIDS and *DIDS because there was mostly insufficient data. For the computational implementation of the linear random effect models we used the lmer() function of the R package lme4_1.17 with REML estimation. For weighting of the studies we used inverse variance weights scaled to mean one.
We investigated bias, mean squared error (MSE) and coverage of the distribution parameters μ_{0},μ_{1},σ_{0} and σ_{1} and of sensitivity and specificity at three points: at 0, at the true optimal threshold and at 2.5. Furthermore, we investigated bias and MSE for the optimal threshold. In addition, we documented how often error messages occurred, particularly how often a negative slope was observed (making model estimation impossible), and the percentage of runs where a warning message signaled that convergence could not be achieved.
Results
Results of the simulation study
Distribution parameters The results of the estimation of the distribution parameters will not be discussed in detail, as the results were very similar to the ones of sensitivity and specificity and it is the primary goal to estimate correct sensitivity and specificity. There were outliers of bias of the distribution parameters reaching values up to 100 for few thresholds per study, but generally the bias decreased markedly with increasing number of thresholds.
Optimal threshold In the metaanalysis an overall optimal threshold was estimated. The bias of this optimal threshold was small but slightly increasing with increasing heterogeneity (see the bottom panel of Fig. 2 with a Poisson distribution parameter λ=1.3 for the number of thresholds). It was smaller in the case of same standard deviations (plots at the bottom) than in the case of different standard deviations. There the bias of models forcing equal slopes was markedly higher than the one of models allowing for different slopes. With increasing number of thresholds per study the bias was decreasing (not shown). The MSE behaved similar to the bias and thus will not be discussed.
Examples
To illustrate our approach we applied it to two data sets of published metaanalyses, both with a continuous marker. We will obtain pooled sensitivity and specificity, an optimal threshold and a SROC curve. For both examples we chose the logit transformation for comparing the result with those in the original publications.
Example 1: Diagnostic accuracy of B type natriuretic peptides in heart failure
In a recent metaanalysis Roberts et al. investigated the diagnostic accuracy of, among others, B type natriuretic peptide in heart failure and found 26 studies, where several were reporting more than one threshold ([2], Fig. 1). To use the standard bivariate model, they grouped the data according to recommended thresholds and performed two metaanalyses (100 ng/L and 100–500 ng/L). For thresholds ≥ 500 ng/L, Roberts et al. did no metaanalysis because there were only four studies that showed much heterogeneity.
Sensitivity and specificity for selected thresholds, based on model *DICS and compared to results by Roberts et al. [2] (for thresholds greater or equal to 500 ng/L, Roberts et al. performed no metaanalysis)
New model  Roberts et al. [2]  

point estimate [95 % confidence interval]  point estimate [95 % confidence interval]  
Threshold [ng/L]  Sensitivity  Specificity  Sensitivity  Specificity 
100  0.94 [0.92, 0.95]  0.63 [0.53, 0.72]  0.95 [0.93, 0.96]  0.63 [0.52, 0.73] 
226  0.84 [0.80, 0.87]  0.84 [0.77, 0.90]  0.85 [0.81, 0.88]  0.86 [0.79, 0.91] 
500  0.64 [0.56, 0.71]  0.94 [0.90, 0.97]     
1000  0.41 [0.31, 0.52]  0.98 [0.96, 0.99]     
Example 2: Procalcitonin as a diagnostic marker for sepsis
Wacker et al. [5] published a systematic review on the diagnostic accuracy of procalcitonin as a diagnostic marker for sepsis. Though 11 of the 31 primary studies had reported sensitivity and specificity at different (up to five) thresholds, the authors chose one pair of sensitivity and specificity per study for their metaanalysis using the bivariate model. They obtained a pooled sensitivity of 0.77 [0.72; 0.81] and a specificity of 0.79 [0.74; 0.84].
We extracted data for additional thresholds from the primary studies and found 54 data points in total for 26 different values of the threshold.
Discussion
We have described and evaluated a new approach for metaanalysis of diagnostic test accuracy studies, where several studies report more than one threshold and the corresponding values of sensitivity and specificity. The approach uses a common parametric assumption (normal or logistic) for the distribution of a continuous biomarker. The idea is to estimate the distribution functions of the biomarker, one distribution function within the nondiseased and one within the diseased study population. This is achieved by the use of a mixed effects model with study as random factor.
We applied our approach to a number of examples with both continuous biomarkers and ordinal questionnaires. Here we report results for two continuous biomarkers. In both examples we found large heterogeneity between the studies. Nevertheless, our approach led to convincing results, as the distribution functions and the pooled sensitivity and specificity with their confidence intervals seemed reasonable and were similar to already published results.
Our new approach for metaanalysis of DTA studies has its strengths and limitations.
Strengths Our approach uses multiple pairs of sensitivity and specificity and their corresponding thresholds per study. In comparison with traditional approaches, this has several advantages: we use all the given information and do not need to select one pair of sensitivity and specificity per study. After assuming a distribution type, we do not need additional assumptions for the SROC curve. In contrast to the alternative approaches of Hamza et al. [9] and Putter et al. [12], our approach can deal with a varying number of thresholds per study. We found varying numbers of thresholds in most of the systematic reviews providing multiple thresholds at all.
The models are based on a parametric assumption. The assumption of a normal or logistic distributed biomarker with different parameters for the nondiseased and diseased individuals is very common [1]. It seemed a natural idea to estimate underlying distributions. Directly and without further assumptions, we obtain all desired quantities: sensitivity and specificity, the SROC curve and the Youden index and the optimal threshold. By using a mixed effects model we acknowledge the diversity of the studies, while the data of each study has in principle the same structure. By admitting correlated random effects, we respect the bivariate character of the study data.
The logit and the probit transformation often provided similar results. By logtransforming the biomarker values, we can also handle skewed distributions (loglogistic or lognormal). In fact, each cumulative distribution function F can be transformed into a linear model by using the transformation h=F^{−1}. In this way our basic idea can be extended to other distributions.
Standard approaches such as the bivariate model, and also the approach by MartínezCamblor [13], are based solely on knowledge of pairs of sensitivity and specificity or ROC curves, without making use of threshold information. Often this information is missing in the primary studies. However, we found a number of reviews where this information was present or could be extracted from the primary studies in hindsight, and our approach establishes a link between threshold information and the ROC curve, and we may determine an optimal threshold among all studies. This is important information for clinicians. In the clinical routine it is not only of interest to know which is the best biomarker for a specific illness, but also at which threshold an optimal discrimination between nondiseased and diseased individuals can be achieved. The knowledge of a summary ROC curve alone does not allow inference on the biomarker.
Whereas most physiological biomarkers can be seen as continuous, questionnaires or psychological scales often take only integer values and therefore are ordinally scaled. However, in practice, they are often analyzed as continuous, also in metaanalyses [29]. Our approach could probably be used for psychological scales as well, but we did not systematically investigate this.
Limitations Our model, like the standard bivariate model, is a twostage approach, based on the estimated transformed studyspecific sensitivities and specificities and using inverse variance weights, however ignoring the uncertainty of their variances at the study level. It is thus a linear mixed model, not a generalized linear mixed model. A problem related to this is the necessity to use continuity correction, at least in case of zeros in the two by two tables which has been criticized [30].
The approach differs from others in that we did not use a binomial model for modeling sensitivity and specificity at the study level. This would have led to two binomial parameters per threshold with additional requirements of monotonicity and correlation ([14], option (i)). We think it is more natural to look at the distributions and refer to an analogous situation in survival analysis, where it is standard to consider a timetoevent variable, instead of jointly modeling binary outcomes such as, say, ‘one year mortality’, ‘two years mortality’ and ‘five years mortality’.
Some care has to be taken concerning the concept of an optimal threshold across studies. This is only reasonable if a biomarker value has the same meaning in all studies and does not differ because of laboratory conditions. If the thresholds are very heterogeneous, this has to be doubted. Of course the question arises as well in how far it is reasonable to pool sensitivity and specificity if the studies are very inhomogeneous.
A weak point of this method is the possibility of estimating decreasing proportions of negative test results with an increasing threshold. Whereas this is impossible within a study, it may happen if one combines data of several studies. Thus, if the heterogeneity between studies is huge and the number of thresholds is low, a valid regression slope cannot be assured and we do not recommend our method for such data. The problem becomes less relevant if there is sufficient threshold information.
If there are not enough data points reported from the studies, some of the linear mixed effects models may not be applicable as the number of parameters to be estimated might be too big. For some models and data sets, cases of numerical instability occurred. Besides, the fixed point iteration of the optimal threshold in case of a logistic distribution assumption did not converge in some few cases.
Model selection remains a challenge. We investigated several approaches, including the Akaike information criterion (AIC) and the conditional AIC (cAIC) criterion that allows comparing mixed models with different fixed effects [23, 31]. However, we encountered problems with cAIC, as the ordering of models surprisingly depended on how the weights were scaled, which seemed unplausible. We thus decided to apply the REML criterion [23].
Our R code offers a broad range of models, and users may decide which model or which selection criteria they want to use.
Further potential extensions of the method are the derivation of confidence intervals for the optimal threshold under a logistic distribution assumption and accounting for the uncertainty of the optimal threshold in the confidence intervals of sensitivity and specificity at this point. Also, a nonparametric analogue has not been investigated so far.
Simulation study The simulation study showed that with increasing heterogeneity, the quality of the estimates deteriorates. Generally, reasonable results of the new approach can only be expected for the heterogeneity levels ‘no’ and ‘moderate’. However, since the distribution estimates for almost all data examples have been convincing, we assume that in practice heterogeneity is mostly moderate. MartínezCamblor [13] in his simulation study considered only levels of heterogeneity smaller than the ’moderate’ heterogeneity level used here. Bias and MSE of the estimates decreased with a increasing number of thresholds per study.
For data with maximally moderate heterogeneity the linear mixed models allowing for different fixed slopes (denoted with *) are to be preferred. They led to smaller bias and MSE in scenarios where the standard deviations were different and to an equivalent bias and MSE in scenarios where the standard deviations were the same.
In most circumstances the bias of sensitivity and specificity was the smallest for the most complex models examined, the CIDS and *CIDS model (common random intercept and different random slope). On the other hand, we observed that the more complex the mixed effects model was, the more convergence problems occurred in the lmer() function.
Unfortunately, the coverage of the estimates of the distribution parameters as well as of sensitivity and specificity was by no means satisfying. This may be due to the existing bias, but more probably to incorrect confidence intervals. For the confidence intervals we assumed the parameters to be approximately normally distributed, but possibly the normal quantiles led to confidence intervals that were too narrow. We also note that the estimates of the ‘true’ sensitivity and specificity depend on how well the distributions and their point of intersection could be estimated. Further, the twostage model we employed did not account for the uncertainty of estimating sensitivities and specificities at the first level.
Finally, a possible reason for the poor coverage is that we used the standard errors of the fixed effects part of the parameters for estimating the standard errors of the regressions. Methods for integrating the random effects variance into the estimation of confidence intervals, if possible in a onestage framework, have still to be developed.
Our simulation study was not designed to compare our approach to competing methods. Extensive simulations comparing different methods should be performed in the future.
Conclusions
Although our new approach can still be improved in some aspects, it accounts for the heterogeneity of the studies and the bivariate character of the data and includes multiple thresholds of studies, possibly differing in number. We proposed a total of 16 linear mixed models which differ in their fixed and random effects structure for estimation of the distribution functions. For model selection, we only considered the models allowing for differing fixed slopes, as they led to better results in the simulation study and applied the REML criterion. However, we would prefer to select the model of choice according to a selection criterion in one step.
Our approach is feasible if all studies used equal measurement methods and if most studies provide information of more than one threshold. Then we may benefit from its advantage that both, an SROC curve and an optimal threshold, can be determined. This is the setting for which we recommend the new approach.
Abbreviations
AIC, Akaike information criterion; cAIC, conditional Akaike information criterion; cdf, cumulative distribution function; CI, common random intercept; CICS, common random intercept and common slope; CS, common random slope; CIDS, common random intercept and different random slopes; DI, different random intercepts; DICS, different random intercepts and common random slope; DIDS, different random intercepts and different random slopes; DS, different random slopes; DTA, diagnostic test accuracy; FN, false negative; FP, false positive; MSE, mean squared error; REML, restricted maximum likelihood; ROC, receiver operating characteristic; SD, standard deviation; Se, sensitivity; Sp, specificity; SROC, summary receiver operating characteristic; TN, true negative; TP, true positive
Declarations
Acknowledgements
Andrea Feigel is gratefully acknowledged for the extraction of the metaanalysis data.
Funding
Gerta Rücker’s work was supported by the Deutsche Forschungsgemeinschaft (grant number RU1747/11).
Availability of data and materials
R code for analysis is provided as supplemental material (Additional files 1 and 2). The data of the first example (B type natriuretic peptide in heart failure) are presented in the given reference [2]. Part of the data of the second example (procalcitonin for sepsis) is given in [5]. In addition, we extracted data for multiple thresholds from the primary references given there. The data of both examples are attached as Additional files 3 and 4.
Authors’ contributions
SS wrote the manuscript and the R functions, contributed to the statistical modelling and did all simulations. GR conceived the idea, contributed to the modelling, the calculations and the R code, wrote an earlier draft of the manuscript and made the final revision. MS contributed to the modelling. All authors contributed to the writing and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable. This is statisticalmethodological work. All data are based an previously published material (see section ‘Availability of data and materials’).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford, UK: Oxford University Press; 2004.Google Scholar
 Roberts E, Ludman AJ, Dworzynski K, AlMohammad A, Cowie MR, McMurray JJ, Mant J. The diagnostic accuracy of the natriuretic peptides in heart failure: systematic review and diagnostic metaanalysis in the acute care setting. Br Med J. 2015; 350:910. on behalf of the NICE Guideline Development Group for Acute Heart Failure.View ArticleGoogle Scholar
 Aertgeerts B, Buntinx F, Kester A. The value of the CAGE in screening for alcohol abuse and alcohol dependence in general clinical populations: a diagnostic metaanalysis. J Clin Epidemiol. 2004; 57(1):30–9.View ArticlePubMedGoogle Scholar
 Zhelev Z, Hyde C, Youngman E, Rogers M, Fleming S, Slade T, Coelho H, JonesHughes TVN. Diagnostic accuracy of single baseline measurement of elecsys troponin T highsensitive assay for diagnosis of acute myocardial infarction in emergency department: systematic review and metaanalysis. BMJ. 2015; 350:15. doi:http://dx.doi.org/10.1136/bmj.h15.View ArticleGoogle Scholar
 Wacker C, Prkno A, Brunkhorst FM, Schlattmann P. Procalcitonin as a diagnostic marker for sepsis: a systematic review and metaanalysis. Lancet Infect Dis. 2013; 13(5):426–35.View ArticlePubMedGoogle Scholar
 Vouloumanou EK, Plessa E, Karageorgopoulos DE, Mantadakis E, Falagas ME. Serum procalcitonin as a diagnostic marker for neonatal sepsis: a systematic review and metaanalysis. Intensive Care Med. 2011; 37(5):747–62.View ArticlePubMedGoogle Scholar
 Rücker G, Schumacher M. Summary ROC curve based on the weighted Youden index for selecting an optimal cutpoint in metaanalysis of diagnostic accuracy. Stat Med. 2010; 29:3069–078.View ArticlePubMedGoogle Scholar
 Riley RD, Ahmed I, Ensor J, Takwoingi Y, Kirkham A, Morris RK, Noordzij JP, Deeks JJ. Metaanalysis of test accuracy studies: an exploratory method for investigating the impact of missing thresholds. Syst Rev. 2015; 4:12.View ArticlePubMedPubMed CentralGoogle Scholar
 Hamza TH, Arends LR, van Houwelingen HC, Stijnen T. Multivariate random effects metaanalysis of diagnostic tests with multiple thresholds. BMC Med Res Methodol. 2009; 10(9):73.View ArticleGoogle Scholar
 Leeflang MM, Deeks JJ, Takwoingi Y, Macaskill P. Cochrane diagnostic test accuracy reviews. Syst Rev. 2013; 2:82. doi:10.1186/20464053282.View ArticlePubMedPubMed CentralGoogle Scholar
 Dukic V, Gatsonis C. Metaanalysis of diagnostic test accuracy assessment studies with varying number of thresholds. Biometrics. 2003; 59:936–46. doi:10.1111/j.0006341X.2003.00108.x.View ArticlePubMedGoogle Scholar
 Putter H, Fiocco M, Stijnen T. Metaanalysis of diagnostic test accuracy studies with multiple thresholds using survival methods. Biometrical J. 2010; 52(1):95–110.View ArticleGoogle Scholar
 MartínezCamblor P. Fully nonparametric receiver operating characteristic curve estimation for randomeffects metaanalysis. Stat Methods Med Res. 2014. doi:10.1177/0962280214537047.
 Riley R, Takwoingi Y, Trikalinos T, Guha A, Biswas A, Ensor J, Morris RK, Deeks J. Metaanalysis of test accuracy studies with multiple and missing thresholds: a multivariatenormal model. J Biometrics Biostat. 2014; 5:196. 10.4172/21556180.1000196.Google Scholar
 Riley RD, Elia EG, Malin G, Hemming K, Price MP. Multivariate metaanalysis of prognostic factor studies with multiple cutpoints and/or methods of measurement. Stat Med. 2015; 34(17):2481–96.View ArticlePubMedPubMed CentralGoogle Scholar
 Rutter CM, Gatsonis CA. A hierarchical regression approach to metaanalysis of diagnostic test accuracy evaluations. Stat Med. 2001; 20:2865–84.View ArticlePubMedGoogle Scholar
 Macaskill P. Empirical Bayes estimates generated in a hierarchical summary ROC analysis agreed closely with those of a full Bayesian analysis. J Clin Epidemiol. 2004; 57(9):925–32.View ArticlePubMedGoogle Scholar
 Harbord RM, Deeks JJ, Egger M, Whiting P, Sterne JA. A unification of models for metaanalysis of diagnostic accuracy studies. Biostatistics. 2007; 8:239–51.View ArticlePubMedGoogle Scholar
 Reitsma JB, Glas AS, Rutjes AW, Scholten RJ, Bossuyt PM, Zwinderman AH. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol. 2005; 58(10):982–90.View ArticlePubMedGoogle Scholar
 Chu H, Cole SR. Bivariate metaanalysis of sensitivity and specificity with sparse data: a generalized linear mixed approach. J Clin Epidemiol. 2006; 59:1331–3.View ArticlePubMedGoogle Scholar
 Arends LR, Hamza TH, van Houwelingen JC, HeijenbrokKal MH, Hunink MG, Stijnen T. Bivariate random effects metaanalysis of ROC curves. Med Decis Making. 2008; 28(5):621–38.View ArticlePubMedGoogle Scholar
 Steinhauser S. Determining optimal cutoffs in the metaanalysis of diagnostic test accuracy studies: Master’s thesis, University of Freiburg; 2015. https://www.freidok.unifreiburg.de/data/10827. Accessed 5 Aug 2016.
 Müller S, Scealy JL, Welsh AH. Model selection in linear mixed models. Stat Sci. 2013; 28(2):135–67. arXiv:1306.2427v1 doi:10.1214/12STS410.View ArticleGoogle Scholar
 Bates D, Mächler M, Bolker BM, Walker SC. Fitting linear mixedeffects models using lme4. 2014. ArXiv eprint. http://arxiv.org/abs/1406.5823. Accessed 5 Aug 2016.
 R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2014. R Foundation for Statistical Computing. http://www.Rproject.org. Accessed 5 Aug 2016.Google Scholar
 Bates D, Maechler M, Bolker B, Walker S. lme4: Linear mixedeffects models using Eigen and S4. 2014. R package version 1.17. http://CRAN.Rproject.org/package=lme4. Accessed 5 Aug 2016.
 Perkins NJ, Schisterman EF. The Youden index and the optimal cutpoint corrected for measurement error. Biometrical J. 2005; 47(4):428–41.View ArticleGoogle Scholar
 Rücker G, Schumacher M. Procalcitonin as a diagnostic marker for sepsis. Lancet Infect Dis. 2013; 13:1012–3.View ArticlePubMedGoogle Scholar
 Takwoingi Y, Riley RD, Deeks JJ. Metaanalysis of diagnostic accuracy studies in mental health. Evidencebased Mental Health. 2015. doi:10.1136/eb2015102228.
 Kuss O. Statistical methods for metaanalyses including information from studies without any events—add nothing to nothing and succeed nevertheless. Stat Med. 2014. doi:10.1002/sim.6383.
 Greven S, Kneib T. On the behaviour of marginal and conditional AIC in linear mixed models. Biometrika. 2010; 97(4):773–89.View ArticleGoogle Scholar