A general framework for comparative Bayesian meta-analysis of diagnostic studies

Background Selecting the most effective diagnostic method is essential for patient management and public health interventions. This requires evidence of the relative performance of alternative tests or diagnostic algorithms. Consequently, there is a need for diagnostic test accuracy meta-analyses allowing the comparison of the accuracy of two or more competing tests. The meta-analyses are however complicated by the paucity of studies that directly compare the performance of diagnostic tests. A second complication is that the diagnostic accuracy of the tests is usually determined through the comparison of the index test results with those of a reference standard. These reference standards are presumed to be perfect, i.e. allowing the classification of diseased and non-diseased subjects without error. In practice, this assumption is however rarely valid and most reference standards show false positive or false negative results. When an imperfect reference standard is used, the estimated accuracy of the tests of interest may be biased, as well as the comparisons between these tests. Methods We propose a model that allows for the comparison of the accuracy of two diagnostic tests using direct (head-to-head) comparisons as well as indirect comparisons through a third test. In addition, the model allows and corrects for imperfect reference tests. The model is inspired by mixed-treatment comparison meta-analyses that have been developed for the meta-analysis of randomized controlled trials. As the model is estimated using Bayesian methods, it can incorporate prior knowledge on the diagnostic accuracy of the reference tests used. Results We show the bias that can result from using inappropriate methods in the meta-analysis of diagnostic tests and how our method provides more correct estimates of the difference in diagnostic accuracy between two tests. As an illustration, we apply this model to a dataset on visceral leishmaniasis diagnostic tests, comparing the accuracy of the RK39 dipstick with that of the direct agglutination test. Conclusions Our proposed meta-analytic model can improve the comparison of the diagnostic accuracy of competing tests in a systematic review. This is however only true if the studies and especially information on the reference tests used are sufficiently detailed. More specifically, the type and exact procedures used as reference tests are needed, including any cut-offs used and the number of subjects excluded from full reference test assessment. If this information is lacking, it may be better to limit the meta-analysis to direct comparisons. Electronic supplementary material The online version of this article (doi:10.1186/s12874-015-0061-7) contains supplementary material, which is available to authorized users.


3.1: Introduction
To assess the performance of the different models and to uncover possible bias of combining data without proper control for study specific effects or adjustment for the use of imperfect reference standards, we performed a simulation study using two scenarios.
In each scenario, the aim is to compare two diagnostic tests T 1 and T 2 with sensitivities S 1 = 90% and S 2 = 85% and specificities C 1 = 85% and C 2 = 90%. Comparison between the tests are made by estimating the relative accuracy through the difference in S and C (S D12 and C D12 ), relative S and C (S RR12 and C RR12 ), or the corresponding ORs (S OR12 and C OR12 ).
The first scenario presents a situation where the same imperfect reference test is used in all studies. The second scenario describes the situation where systematic bias may occur through the use of differing reference tests across studies. The two scenarios are described in detail below.
In each simulation study, we generated 150 sets of 20 diagnostic studies, with each study using two or more of a number of possible diagnostic tests. Each of the simulated diagnostic studies has a moderate sample size of 200 subjects and a disease prevalence of 50%. We analyzed each simulated data set using the models described in the manuscript using the logit for the link function g (.). We present the parameter estimates and standard errors and graphically depict bias inŜ D12 ,Ĉ D12 , S RR12 ,Ĉ RR12 ,Ŝ OR12 , andĈ OR12 . We also evaluated the models using coverage probabilities (the proportion of replications in which the credible interval contained the true value) and power (the proportion of replications in which the true difference in S and C between the two tests of interest was detected).
All models were estimated using MCMC methods through Gibbs sampling using OpenBUGS 3.0.3 called from within R 3.0.1 using the BRugs library. For the simulation study convergence was checked using the Gelman-Rubin diagnostic statistiĉ R, extending simulations untilR ≤ 1.05.

3.2.1: Setup
In scenario 1, we simulated a setting without systematic bias but where an imperfect reference test is used to assess the diagnostic accuracy of the index tests of interests. Across the 20 diagnostic studies in the simulated meta-analysis there are 3 index tests. T 1 and T 2 are the index tests of interest as described above. T 3 is a third, less accurate index test, with S 3 = 80% and C 3 = 80%. The reference test, T 4 , is a reference standard similar to a parasitological technique or culture in infectious diseases with high specificity (C 4 =95%) but lower sensitivity (S 4 =80%). We allowed the study-specific S ij and C ij of all tests to vary across studies using (0, .5) normal distributions on the logit scale and induced a correlation (ρ = 0.5) between S i2 and S i4 .
Each simulated diagnostic study uses two or more of 4 possible tests. The availability of the tests is shown in Table 1. In 10 of the 20 studies (studies 1 to 10) a direct comparison between T 1 and T 2 was possible. In five studies (11, 12, 16, 17, and 18) an indirect comparison through the third index test T 3 was possible. The remaining studies only offered information on the diagnostic accuracy of either T 1 and T 2 through comparison with the reference test results.

3.2.2: Analysis
We applied the 5 models described in the methods section and listed in Table 2 of the manuscript to each simulated meta-analysis. Models 1 to 3, that rely on the assumption that a perfect reference test is used, were calculated using the true disease status as reference test and also using the imperfect reference standard T 4 . For each analysis, we included only individual basic studies that were informative of the contrast of interest, T 1 versus T 2 . This includes only direct comparisons (10 of 20 simulated studies) for model 2, studies which include at least 2 of T 1 , T 2 and T 3 for model 3 in simulation study 1 (15 of 20 simulated studies), and all simulated studies in models 1, 4 and 5. In all models, we ignored the correlation between T 2 and the reference test T 4 .
Uninformative priors were used for hyperparameters related to the index tests (T 1 , T 2 , and T 3 ). Likely in practice some information is available on the diagnostic accuracy of the reference tests T 4 , therefore we used informative priors for S 4 and C 4 that were consistent with the simulated values. Specifically, normal priors were provided that indicated with 95% certainty that the average S 4 was in the interval [70%, 95%] and the average C 4 in the interval [90%, 99%]. Table 2 shows the estimates and standard errors of the parameters of interest; Table  3 shows the coverage probabilities and the observed power to detect the difference of 5% in S and C between T 1 and T 2 . Figure 1 presents the bias inŜ D12 andĈ D12 (Figure 1.a),Ŝ RR12 andĈ RR12 (Figure 1

3.2.3: Results
When a true gold standard reference test was available, models 1 to 3 provided unbiased estimates of the S j and C j . The contrasts in S j and C j expressed as an OR, difference in proportions, or RR were also estimated without bias (Table 2, 'Using Gold Standard' columns). Coverage probabilities were close to 95% (Table  3, 'Using Gold Standard' columns). Model 1 which takes in to account all studies had the smallest standard error and consequently had the highest power to detect a difference in S and C between T 1 and T 2 .
In case T 4 was taken as reference standard, the S and especially C of both index tests were underestimated (Table 2, 'Using Imperfect Reference Standard' columns), with very low coverage probabilities (Table 3, 'Using Imperfect Reference Standard' columns). The coverage probabilities for the contrasts between T 1 and T 2 were however better, especially for S D12 , C D12 (Figure 1.a, models 1-3), S RR12 , and C RR12 (Figure 1.b, models 1-3). Only substantial bias was observed when estimating C OR12 (Figure 1.c, models 1-3), the contrast in specificities between T 1 and T 2 expressed as an odds-ratio, with coverage probabilities below 80% (Table 3, 'Using Imperfect Reference Standard' columns).
Allowing for an imperfect reference resulted in generally unbiased estimates of S and C and of the contrasts between T 1 and T 2 (Table 2 and Figure 1: models 4 and 5). Coverage probabilities were close to 95% for the contrasts between T 1 and T 2 (Table 3: models 4 and 5). The latent class approach appeared to have removed the bias in estimating C OR12 that was induced by using the imperfect reference standard T 4 as gold standard.

3.3: Scenario 2
3.3.1: Setup In scenario 2, we simulated the situation of two index tests which are assessed in primary studies that tend to use different reference standards. In this case T 3 is a highly specific but less sensitive reference standard with S 3 = 80% and C 3 = 95% and T 4 is a highly sensitive but less specific reference standard with S 4 = 95% and C 4 = 80%. We again allowed the S ij and C ij of all tests to vary across studies using (0, .5) normal distributions on the logit scale. We created the possibility of systematic bias by using T 3 as reference test in preference when index test T 1 was assessed and using T 4 as reference test in preference when index test T 2 was assessed. In 5 of the 10 studies which allowed direct comparisons T 3 was the reference standard, in the other 5 studies T 4 was the reference standard. When only T 1 was assessed, and not T 2 , T 3 was the reference standard. When only T 2 was assessed, and not T 1 , T 4 was the reference standard (Table 4).

3.3.2: Analysis
As the aim of this scenario was to assess the effects of differing imperfect reference standards, we did not analyze this data for the case a true gold standard was available. Models 4.i and 5.i were fitted assuming that the reference tests T 3 and T 4 could differ in diagnostic accuracy. Model 4.ii and 5.ii were fitted under the assumption that T 3 and T 4 were equal. This means that models 4.i and 5.i corresponded to the situation where the researchers knew of the differences in reference standard used across studies and that the variation in reference standard was thus a known source of bias. Models 4.ii and 5.ii corresponded to the situation that researchers were unaware of the differences in reference standards among studies and that consequently the variation in reference standard was an unknown source of bias.
Uninformative priors were used for hyperparameters related to the index tests (T 1 and T 2 ) and informative priors for hyperparameters related to the reference tests (T 3 and T 4 ). Table 5 show the estimates and standard errors of the parameters of interest; Table  6 shows the coverage probabilities and the observed power to detect the difference of 5% in S and C between T 1 and T 2 . Figure 2 presents the bias inŜ D12 andĈ D12 (Figure 2.a),Ŝ RR12 andĈ RR12 (Figure 2.b),Ŝ OR12 andĈ OR12 (Figure 2.c).

3.3.3: Results
When T 3 and T 4 were assumed to be perfect reference standards and limiting the analysis to direct comparisons between index tests T 1 and T 2 (model 2), we obtained results similar to scenario 1. The differences in accuracy between T 1 and T 2 in terms of S OR12 and C OR12 were underestimated (Table 5 and Figure 2.c: model 2), while estimates of S D12 , C D12 , S RR12 , and C RR12 were less biased (Table 5 and Figure  2.a-2.b: model 2).
Estimates from model 1, i.e. independently estimating the diagnostic accuracy of T 1 and T 2 , resulted in further underestimation of C 1 (Table 5:Ĉ 1 =73.6% in model 1, vs 75.4% in model 2), as T 1 tended to be assessed in studies where the less sensitive reference test T 3 was used, resulting in more apparent false positives for T 1 . Similarly in this analysis, S 2 was more strongly underestimated (Table 5: S 2 =73.6% in model 1, vs 75.4% in model 2), as T 2 tended to be assessed in studies where the less specific reference test T 4 was used resulting in more apparent false negatives for T 2 . As T 1 was the more sensitive test in the simulation and T 2 the more specific, this resulted in an overestimation of the differences in diagnostic accuracy between T 1 and T 2 . This is apparent for S D12 and C D12 (Figure 2.a, model 1) and S RR12 and C RR12 (Figure 2.b, model 1). The bias is not apparent for S OR12 and C OR12 (Figure 2.c, model 1), likely due to the fact that the two biases describe counteract eachother.
When correcting for the use of imperfect reference tests using LCA (models 4.i and 5.i), unbiased estimates for the differences in diagnostic accuracy between T 1 and T 2 were obtained (Table 5 and Figure 2). If the data were however analyzed ignoring the differences between the reference tests, the differences in diagnostic accuracy between T 1 and T 2 were again overestimated (Table 5 and Figure 2: models 4.ii and 5.ii).

3.4: Conclusions
This simulation study indicated that the proposed models, and especially models 4 and 5, can result in unbiased estimates for the relative accuracy of two tests while allowing for imperfect reference tests in a meta-analysis and correcting for bias due to confounding induced by differences in reference standard. Ignoring some aspects of the data generating mechanism, for example the correlation between T 2 and the reference test, did not lead to noticeable bias. Ignoring differences among the reference tests, did however lead to important bias. Interestingly, when estimating the difference or relative risk in S and C between two tests, incorrectly assuming that the reference test was perfect did not necessarily invalidate the meta-analysis results, especially not when limiting the analysis to direct comparisons only. Study nr.  T1  T2  T3 T4 1 to 5 X X X X 6 to 10 X X X 11 to 12 X X X 13 to 15 X X 16 to 18 X X X 19 to 20 X X Table 1 Design of the simulation study -scenario 1. Availability of each index tests (T 1 , T 2 , T 3 ) and of the reference test T 4 is indicated by X. 150 simulated datasets of 20 diagnostic studies were generated.  The simulated values are S1=90%, S2=85%, S3=80%, S4=80%, C1=85%, C2=90%, C3=80%, C4=95%, −5.0% for S D12 , 5.0 for C D12 , −0.057 for log(S RR12 ), 0.057 for log(C RR12 ), −0.46 for log(S OR12 ), and 0.46 for log(C OR12 ). Table 2 Parameter estimates and standard errors from the proposed meta-analytical models applied in the simulation study -scenario 1. Models 1 to 3 are applied both using the true disease status ("Gold Standard" columns) and disease status estimated from the results of T 4 ("Imperfect Reference Standard" columns). Models 4 and 5 allow for the use of imperfect reference standard through latent class analysis.

Using Gold Standard
Using Imperfect Reference Standard  Table 3 Power and coverages of the 95% credible intervals for the S and C estimates from the proposed meta-analytical models applied in the simulation study -scenario 1. Models 1 to 3 are applied both using the true disease status ("Gold Standard" columns) and disease status estimated from the results of T 4 ("Imperfect Reference Standard" columns).
Study nr. T1 T2 T3 T4 1 to 5 X X X 6 to 10 X X X 11 to 15 X X 16 to 20 X X The simulated values are S1=90%, S2=85%, S3=80%, S4=80%, C1=85%, C2=90%, C3=80%, C4=95%, −5.0% for S D12 , 5.0 for C D12 , −0.057 for log(S RR12 ), 0.057 for log(C RR12 ), −0.46 for log(S OR12 ), and 0.46 for log(C OR12 ). Table 5 Parameter estimates and standard errors from the proposed meta-analytical models applied to the simulation study -scenario 2. For models 1 and 2, disease status was estimated from the results of the reference test (T 3 or T 4 ) ("Imperfect Reference Standard" columns). Models 4 and 5 are applied both assuming it is known that the reference tests differ across studies (model 4.i and model 5.i) and ignoring the difference in reference tests (model 4.ii and model 5.ii). Model 3 is not applied as there is no third index test in the simulation.  Table 6 Power and coverages of the 95% credible intervals for the S and C estimates from the proposed meta-analytical models applied from the proposed meta-analytical models applied in simulation study 2. For models 1 and 2, disease status was estimated from the results of the reference test (T 3 or T 4 ) ("Imperfect Reference Standard" columns). Models 4 and 5 are applied both assuming it is known that the reference tests differ across studies (model 4.i and model 5.i) and ignoring the difference in reference tests (model 4.ii and model 5.ii). Model 3 is not applied as there is no third index test in the simulation. Figure 1 Bias in estimates of the contrasts in diagnostic accuracy from the proposed meta-analytical models applied in the simulation study -scenario 1. For models 1 to 3 disease status was estimated from the results of T 4 . The plots present the bias inŜ D12 andĈ D12 (first row),Ŝ RR12 andĈ RR12 (second row),Ŝ OR12 andĈ OR12 (third row).

Figure 2
Bias in estimates of the contrasts in diagnostic accuracy from the proposed meta-analytical models applied in the simulation study -scenario 2. Models 4 and 5 were applied both assuming it is known that the reference tests differ across studies (4.i and 5.i) and ignoring the difference in reference tests (4.ii and 5.ii). The plots present the bias inŜ D12 andĈ D12 (first row),Ŝ RR12 andĈ RR12 (second row),Ŝ OR12 andĈ OR12 (third row).