 Software
 Open Access
 Published:
MetaBayesDTA: codeless Bayesian metaanalysis of test accuracy, with or without a gold standard
BMC Medical Research Methodology volume 23, Article number: 127 (2023)
Abstract
Background
The statistical models developed for metaanalysis of diagnostic test accuracy studies require specialised knowledge to implement. This is especially true since recent guidelines, such as those in Version 2 of the Cochrane Handbook of Systematic Reviews of Diagnostic Test Accuracy, advocate more sophisticated methods than previously. This paper describes a webbased application  MetaBayesDTA  that makes many advanced analysis methods in this area more accessible.
Results
We created the app using R, the Shiny package and Stan. It allows for a broad array of analyses based on the bivariate model including extensions for subgroup analysis, metaregression and comparative test accuracy evaluation. It also conducts analyses not assuming a perfect reference standard, including allowing for the use of different reference tests.
Conclusions
Due to its userfriendliness and broad array of features, MetaBayesDTA should appeal to researchers with varying levels of expertise. We anticipate that the application will encourage higher levels of uptake of more advanced methods, which ultimately should improve the quality of test accuracy reviews.
Background
Background to metaanalysis of test accuracy
In medicine, tests are used to screen, monitor and diagnose medical conditions, and therefore it is imperative that these tests produce accurate results. This ‘accuracy’ refers to their sensitivity and specificity. The former is the probability that a test can correctly identify patients who have the disease and the latter is the probability that the test can correctly identify patients who do not have the disease. To evaluate their accuracy, studies and analyses are carried out to compare the results of the test under evaluation (called the ‘index’ test) against some existing test, which is assumed to be perfect (called the ‘reference’ or ‘gold standard’ test). Index tests typically have lower accuracy than the gold standard; however, they are often quicker, cheaper and/or less invasive.
Standard methods for the metaanalysis of test accuracy assume that the gold standard test is perfect  i.e., that the test is 100% sensitive and specific. These models dichotomize the data into diseased and nondiseased according to the results of the reference test, and include the bivariate model of Reitsma et al. [1] and the hierarchical summary receiver operating characteristic (HSROC) model of Rutter & Gatsonis [2]. These models have been shown to be equivalent in practice when no covariates are included [3]. Models which do not assume a perfect gold standard have also been developed [4,5,6]. These models  which are often referred to as latent class models (LCMs)  assume that each test is measuring the same latent disease, and each individual is assumed to belong in either the diseased or nondiseased classes. These methods can also model the correlation between each test within each disease class (i.e. the conditional dependence between tests). All of the aforementioned methods take into account the correlation between sensitivity and specificity across studies.
Why is this application needed?
The models discussed in the previous section require statistical programming expertise using software such as R or Stata. Cochrane, an organisation who help support evidencebased decisions about health interventions such as diagnostic and screening tests. Whilst they do provide free software RevMan [7] using the MosesLittenberg method [8], it fails to appropriately account for random effects and the correlation between sensitivity and specificity across studies. Carrying out metaanalysis of test accuracy using online applications has a lower user burden since no programming is needed. Not only does this make such methods accessible to a broader array of people, it also streamlines the workflow for more experienced data analysts.
Other web applications for the metaanalysis of test accuracy include MetaDTA [9,10,11] and BayesDTA [12]. The former uses frequentist methods and implements the bivariate model [1], allowing for risk of bias and quality assessment data to be incorporated into the results plots. The latter uses Bayesian methods and incorporates both the bivariate model [1] and the LCM model [4, 5]. Similarly to BayesDTA, our application, MetaBayesDTA [13], runs Bayesian versions of both the bivariate [1] and the LCM model [4, 5], and is powered by Stan [14], a Bayesian model fitting software. However, unlike BayesDTA, our application can also conduct subgroup analysis and metaregression for the bivariate model, and can be used to conduct a comparative metaanalysis of test accuracy for 2 or more tests using categorical metaregression (assuming the same variances between tests), using methods recommended in chapter 11 of version 2 of the Cochrane handbook for systematic reviews of diagnostic test accuracy [15]. Furthermore, for the LCM model, rather than assuming all studies use the same reference tests, it can model multiple reference tests. It also allows users to compare the fit between different LCM models. A full comparison between MetaBayesDTA, MetaDTA and BayesDTA is shown in Table 1.
Implementation
Aims
Our objective was to make a web application which would be accessible to a wide variety of researchers and enable them to conduct a robust Bayesian statistical analysis for metaanalysis of test accuracy  including subgroup analysis, metaregression, comparative test accuracy, and the ability to conduct metaanalysis of test accuracy without assuming a perfect reference test. This would all be possible despite the researcher not possessing sufficient experience in R [16] and Stan [14]. It is also aimed at researchers who can use R and/or Stan (e.g. some data analysts, statisticians, clinical researchers, etc) but would still want to use a web application for efficiency.
Software
We used the statistical programming language R [16] to create our web application, using a variety of packages. One such package includes Shiny [17], which enables R users to create web applications without having to have knowledge of web development languages such as HTML and JavaScript. Another package used includes rstan [18], which enables users to fit Bayesian statistical models in R using Stan [14], and is what we used to fit both the bivariate and LCM models in the application. A new user interface format was developed using the R packages shinydashboard [19] and shinywidgets [20]. This allows the app to have a clean layout, with many of the menus hidden unless the user chooses to display them.
Results
In this section, we will demonstrate the application through a motivating example dataset containing a total of 13 studies from a Cochrane metaanalysis [21], which assessed the accuracy of the Informant Questionnaire on Cognitive Decline in the Elderly (IQCODE)  a screening test used to detect adults who may have clinical dementia within secondary care settings.
Data
The ‘Data’ tab (see Fig. 1) allows users to upload their data. The number of columns the datasets must have will vary depending on whether quality assessment data and/or covariate data is included. Datasets involving no quality assessment or covariate data will have six columns, those involving quality assessment data thirteen, those involving covariates at least seven, and those involving both quality assessment and covariate data at least fourteen. The quality assessment data which can be included is from quality assessment carried out using the QUADAS2 (QUality Assessment of Diagnostic Accuracy Studies, version 2) tool [22]. This tool has four domains: (i) patient selection, (ii) index test, (iii) reference standard and (iv) flow of patients through the study and timing of the index test(s) and reference standard.
The ‘File Upload’ subtab is preloaded with an example dementia dataset from the Cochrane metaanalysis [21], which is described in more detail in the ‘Example datasets’ subtab in the application. The ‘Data for Analysis’ subtab shows the dataset currently being used.
We will use this dataset to demonstrate the application throughout the remainder of this section. To analyse the data, the Cochrane metaanalysis [21] used the bivariate model and found a pooled summary estimates of 0.91 (95% CI [confidence interval] = (0.86, 0.94)) and 0.66 (95% CI = (0.56, 0.75)) for the sensitivity and specificity, respectively.
Perfect gold standard
The ‘Perfect gold standard’ page consists of three tabs: metaanalysis, metaregression and subgroup analysis. All three tabs use the bivariate model proposed by Reitsma et al. [1], employing the variation which uses binomial likelihoods proposed by Chu and Cole [23].
Metaanalysis
The Metaanalysis subtab is split into two halves  the left half consists of the following tabs: ‘priors’, ‘run model’, ‘studylevel outcomes’, ‘parameter estimates’, ‘parameters for RevMan’, and ‘model diagnostics’. The right half has the tabs ‘sROC [summary Receiver Operating Characteristic] plot’, ‘Forest Plots’ and Prevalence’.
Since all of the models in the app are Bayesian, prior distributions need to be specified. The ‘priors’ subtab (see Fig. 2) is where users specify prior distributions. The priors can be changed if some information is known about them, and they can be specified in terms of the logistictransformed (“logit”) sensitivity and specificity, or directly on the probability scale. The default prior distributions are weakly informative. More specifically, for the pooled logit sensitivity and logit specificity, we used a normal distribution with mean zero and SD of 1.5 (N(0, 1.5)), which is equivalent to a 95% prior interval (that is, the interval formed by the 2.5% and 97.5% quantiles of the prior distribution) of (0.05, 0.95) on the probability scale. For the betweenstudy SD’s (standard deviations) we used a truncated (at zero) normal with zero mean and unit SD (\(N_{ \ge 0 }(0, 1)\)). This prior allows for a very large amount of betweenstudy heterogeneity if the data demands; for example, if the pooled sensitivity is found to be 0.80, then this prior assumes that the studyspecific sensitivities will be in the range (0.069, 0.996) with 95% probability. Finally, for the betweenstudy correlation we used an LKJ (LewandowskiKurowickaJoe) [24] prior with shape parameter of 2 (LKJ(2)), which gives a 95% prior interval of \((0.8, 0.8)\). In general, we suggest leaving all of these prior distributions to the defaults. However, if it is known that the sensitivity or specificity for the test under evaluation may be very high (e.g. \(> 95\%\)), then a prior which places more prior probability on these values would be more appropriate than the default N(0, 1.5) prior  for instance a prior of N(3, 1.5) which would be equivalent to a 95% prior interval of (0.500, 0.998) on the probability scale.
Users can examine the prior distributions specified by clicking on the button ‘Click to run prior model’ and the prior medians and 95% prior intervals are shown in a Table (see bottom of Fig. 2). Plots of the prior distributions are also displayed (below the table  not shown in Fig. 2).
Users can run the model by clicking on the ‘Click to run model’ button within the ‘Run model’ subtab. In this subtab, users can also run sensitivity analysis  more specifically, this is where any number of studies can be excluded from the analysis to assess the influence of particular studies on the overall pooled estimates.
The ‘studylevel outcomes’ subtab displays key study information that is also displayed in the ‘Data’ tab, as well as the sensitivity and specificity in each study and study weights  that is, the amount that each study contributes to the overall sensitivity and specificity estimate, calculated using the method from Burke et al. [25]. The ‘parameter estimates’ subtab (see Fig. 3) consists of a table with the posterior medians and 95% posterior intervals (otherwise known as credible intervals [CrI’s]) for key summary parameters including logit sensitivities and specificities, diagnostic odds ratio and likelihood ratios, betweenstudy correlation and standard deviations, and HSROC parameters. The HSROC parameters are estimated from the bivariate model parameters using the relations shown in Harbord et al. [3].
The ‘parameters for RevMan’ subtab consists of the parameter estimates (posterior medians) needed by Cochrane’s RevMan software to build ROC plots for people who want to include the analysis results in a Cochrane review. The ‘Model diagnostics’ subtab contains important diagnostics that users must check to ensure whether the model is valid. These include the Stan sampler diagnostics [14, 26]  divergent transitions and iterations which have exceeded the maximum treedepth (these should both be 0), split Rhat statistics (should be less than 1.05), and posterior density and trace plots [14].
The sROC plot is displayed in the ‘sROC plot’ subtab (see Fig. 4). This plot displays the summary estimates, 95% credible and prediction regions and studyspecific sensitivities and specificities. The plot has a range of customization options; for instance, it allows users to change the size of the summary estimates and studyspecific points, display the sROC curve, disease prevalence and percentage study weights of each study. It is also interactive  users can click on the studylevel points and studylevel information will appear over the plot  this is demonstrated in Fig. 4, where the bottomleft point corresponding to the Jorm et al. [27] study has been clicked on. This plot, as well as the other plots produced by the application, can be downloaded. Risk of bias and quality assessment information, if available in the dataset, can also be displayed on the plot (see supplementary material Fig. 1).
The ‘forest plots’ subtab contains the forest plots  which are plots showing the sensitivity and specificity in each study as well as the corresponding 95% confidence intervals. The ‘prevalence’ subtab contains a tree diagram which puts the summary estimates into context  it shows how many patients would test positive and negative for a given disease prevalence, and then out of those who test positive and negative, which are diseased and nondiseased. There is also another tree diagram option, which first splits the population by disease status and then by test result.
We analysed the IQCODE dementia dataset discussed previously using our application, using the Bayesian bivariate model assuming a perfect gold standard. We used the default prior distributions (see Fig. 2), and obtained virtually the same results as the frequentist analysis conducted in the original study  sensitivity and specificity estimates of 0.91 (95% credible interval [CrI] = (0.85, 0.95)) and 0.66 (95% CrI = (0.55, 0.77)), respectively (see Fig. 3). An sROC plot showing the results is shown in Fig. 4.
Metaregression
The ‘Metaregression’ tab is where users can run the bivariate model including a categorical or continuous covariate in an attempt to explain any between study heterogeneity, and consists of subtabs similar to the ‘Metaanalysis’ tab. The ‘Run model’, ‘studylevel outcomes’, ‘Model Diagnostics’ and ‘sROC plot’ subtabs are the same as those in the ‘Metaanalysis’ tab.
Rather than a ‘priors’ subtab, it has a ‘Model set up & priors’ tab, since users also need to select the covariate to use. Furthermore, if using a continuous covariate, users need to specify the value to use for centering (the default is the mean of the values of the covariate) and which value of the covariate to calculate the summary accuracy estimates at. For the default priors, for continuous metaregression we used N(0, 1.5) priors for the pooled logit sensitivity and specificity intercepts and N(0, 1) priors for the pooled logit sensitivity and specificity coefficients. For categorical metaregression, we used N(0, 1.5) priors for the pooled logit sensitivities and specificities for each level of the covariate. For both continuous and categorical metaregression, similarly to the model with no covariates, we used (\(N_{ \ge 0 }(0, 1)\)) and LKJ(2) priors for the betweenstudy SD’s and correlations, respectively. In general, we suggest leaving these priors at the default values in most cases. However, sometimes it will make sense to change them. For example, as we mentioned in the “Metaanalysis” section previously, for categorical metaregression, if it is known that the sensitivity or specificity for the test under evaluation may be very high \(( > 95\% )\), then a prior which places more prior probability on these values would be more appropriate than the default N(0, 1.5) prior. For continuous metaregression, the default N(0, 1) prior for the coefficient terms will generally allow the coefficient to have a large influence if the data allows. However, for coefficients which are on a small scale, such as disease prevalence, it might make more sense to try priors which are less informative than the default. For example, if the covariate is (centered) disease prevalence and the mean value of the disease prevalence is 0.10, and the sensitivity at this value is found to be 0.80, then this prior will assume that the value of sensitivity for a 10% increase in disease prevalence (i.e. a prevalence of 0.20) is in the interval (0.77, 0.83) with 95% probability, whereas a prior of N(0, 5) will assume an interval of (0.60, 0.92) with 95% probability. In this case, the latter would be more appropriate than the default prior if disease prevalence is thought to be (or if it cannot be ruled out to be) strongly (or negatively) associated with test accuracy.
The ‘parameter estimates’ subtab contents will vary depending on whether continuous or categorical metaregression is being carried out. For the continuous metaregression, there will be one table showing the parameters which do not vary, regardless of what the user chooses for the covariate value to calculate the summary estimates at, and another table containing the parameters that do vary. For categorical metaregression (see Fig. 5), there will be one table containing the parameters shared between studies, such as betweenstudy correlation and standard deviations, and another table showing the groupspecific parameters, such as the sensitivity and specificity at each level (i.e. group) of the categorical covariate. Furthermore, there will also be a table which displays the pairwise differences and ratios between the pooled sensitivity and specificity estimates (see Fig. 6).
The ‘accuracy vs covariate’ subtab contains a plot which displays the summary sensitivity and specificity posterior medians and 95% credible intervals against the selected covariate. For categorical metaregression, there will be a posterior median and 95% credible interval for each category of the covariate, whereas for continuous metaregression there is a smooth line corresponding to the 95% posterior median and 95% credible interval bands as the covariate spans its observed range.
We conducted a categorical metaregression using the type of IQCODE test (either the 16, 26 or 32item version) used as the covariate. The results for the 16item and 26item groups were very similar (see Fig. 5)  for the 16item group we obtained sensitivity and specificity estimates of 0.91 (95% CrI = (0.82, 0.96)) and 0.64 (95% CrI = (0.50, 0.77)). For the 26item group we obtained sensitivity and specificity estimates of 0.89 (95% CrI = (0.72, 0.96)) and 0.65 (95% CrI = (0.45, 0.82)). For the 32item group, we obtained a similar sensitivity  0.92 (95% CrI = (0.58, 0.99), but for the specificity we obtained a very different result  0.87 (95% CrI = (0.58, 0.97))  however, this was only based on 1 study. Looking at the pairwise differences (see Fig. 6), we can see that the 95% credible intervals contain 0 for all of the sensitivities and specificities  indicating that none of the differences are significant  even for the comparison to the 32item group, despite the posterior medians being relatively large. Similarly, the pairwise ratio’s all contain 1, implying that none of them are significant. An sROC plot showing the results is shown in Fig. 7.
Subgroup analysis
Our app also allows users to run subgroup analyses for categorical covariate data. This will run a separate bivariate metaanalysis for each subgroup, obviating the need for users to partition their data and run the analysis multiple times. Such analyses differ from including the subgrouping variable as a categorical covariate and using the regression facility outlined above in the previous section, because here separate random effect variances are calculated for each group, whereas they are assumed to be the same and estimated jointly in the regression. The ‘subgroup analysis’ tab contains the same subtabs as the ‘metaregression’ tab, and the subtabs will look mostly the same as when running a categorical metaregression. The key difference is that in the ‘parameter estimates’ subtab, there is just one table showing the parameters for each subgroup, since there are no parameters shared between the subgroups. For the prior distributions, similarly to the standard metaanalysis model, we recommend keeping them at the default values in most cases. However, sometimes it will make sense to change them. For example, it is known that the sensitivity or specificity for the test under evaluation may be very high \(( > 95\% )\), then a prior which places more prior probability on these values would be more appropriate than the default N(0, 1.5) prior.
We conducted a subgroup analysis for the type of IQCODE test used  either the 16, 26 or 32item version. Only one study used the 32item version, so no analysis could be conducted for this subgroup. Four studies used the 26item version and eight studies used the 16item version. The results for these two subgroups were very similar. More specifically, for the 26item subgroup, we obtained sensitivity and specificity estimates of 0.88 (95% CrI = (0.77, 0.94)) and 0.65 (95% CrI = (0.49, 0.79)), and for the 16item subgroup we obtained sensitivity and specificity estimates of 0.91 (95% CrI = (0.86, 0.95)) and 0.63 (95% CrI = (0.51, 0.74)). These results can be compared to the regression demonstration in the previous section, which made the stronger assumption that the betweenstudy heterogeneity levels are the same across groups. An sROC plot showing the results of the subgroup analysis is shown in Fig. 8.
Imperfect gold standard
In addition to metaanalysis of test accuracy which assumes a perfect gold standard using the bivariate model discussed in the “perfect gold standard” section, our app also allows users to run metaanalysis of test accuracy without assuming a perfect gold standard using LCMs [4, 5] within the “Imperfect gold standard” tab. This tab has the following subtabs: ‘model set up & priors’, ‘Run model’, ‘studylevel outcomes’, ‘parameter estimates’, ‘model diagnostics’, and ‘sROC plot’.
The ‘Model set up & priors’ subtab for the LCM has more options than that of the bivariate model (see Fig. 9). This is because, in contrast to the bivariate model, which only estimates accuracy for the index test, the LCM model estimates accuracy for both the index and the reference test(s), as well as the disease prevalence in each study. Users can choose various modelling options  more specifically, they can choose whether the reference and index test sensitivities and specificities are fixed between studies (i.e. “fixed effects”), or whether they can vary between studies (i.e. “random effects”). They can also choose whether to assume conditional independence between tests. In practice, the conditional independence assumption is typically not a reasonable assumption to make, since it assumes that the test results are uncorrelated within the diseased and nondiseased groups [28]. However, sometimes it is not possible to run a model which does not assume conditional independence because it might be nonidentifiable [29]; that is, there might be two (or more) sets of parameter values that fit the data equally well. For instance, the model may estimate the sensitivity of a test to be equal to both 0.20 and 0.80. This is more likely to occur when the the number of parameters being estimated from our model is greater than what is possible for the given dataset (although it can also occur when it is possible to estimate all parameters). One way to lower the chance of this happening is to introduce more informative prior information  for instance, information about the accuracy of the reference test(s) is often known and can be obtained by searching the relevant literature and by consulting clinicians. Therefore, we would recommend using prior distributions based on such data as opposed to the default N(0, 1.5) priors for the logittransformed specificities and sensitivities. We would generally suggest to leave the other priors at the default values.
In addition to the Stan sampler diagnostics, Rhat statistics, posterior density and trace plots, the ‘Model diagnostics’ subtab has two plots which allows users to assess the fit of the model  the correlation residual plot [30] and the frequency table probability residual plot. It also has a table which shows the overall deviance and studyspecific deviances.
We conducted an analysis using LCM models which do not assume a perfect gold standard. The studies included a variety of reference standards  four studies used the Diagnostic and Statistical Manual of Mental Disorders version III, revised (DSMIIIR) [31]; seven studies used version IV (DSMIV) [32]; one study used the National Institute of Neurological and Communicative Diseases and Stroke/Alzheimer’s Disease and Related Disorders Association (NINCDSADRDA) [33] criteria; and one study used a combination of the DSMIIIR [31] and the International Classification of Diseases, version 10 (ICD10) [34] criteria. Rather than assuming all the reference tests have the same accuracy (as is commonly done in practice), our application allows us to model the differences between the various reference tests using metaregression. To incorporate prior knowledge into the model, we used information from an umbrella review [35] (i.e., a review of systematic reviews and metaanalyses). This umbrella review found that the accuracy for clinical dementia diagnostic criteria had a sensitivity range of 0.530.93 and a specificity range of 0.550.99. For the sensitivities and specificities of all of the reference tests, we used priors corresponding to a 95% prior interval of (0.43, 0.96).
Analysis assuming conditional independence
We first analysed the data using a model which assumed conditional independence between the index test (IQCODE) and the reference tests. We assumed that the reference tests were fixed between studies and assumed random effects for the IQCODE. For the IQCODE, we obtained sensitivity and specificity estimates of 0.94 (95% CrI = (0.89, 0.98)) and 0.77 (95% CrI = (0.62, 0.89)). The IQCODE was estimated to have a higher sensitivity but lower specificity than all of the reference tests. These results suggest that the analysis assuming a perfect gold standard conducted previously underestimates the sensitivity of the IQCODE by around 3% and underestimates the specificity by around 11%. An sROC plot of the results is shown in Fig. 10. The posterior distribution plots (see supplementary material Fig. 2) are satisfactory for all parameters since they are all unimodal (i.e. they all have one peak) and the trace plots are also satisfactory for all parameters since they indicate that the chains overlap considerably and hence have mixed well (see supplementary material Fig. 3). Furthermore, all other sampler diagnostics were satisfactory (i.e., all Rhat statistics were less than 1.05 and there were no divergent transitions or any iterations which exceeded the maximum treedepth [14]). Attempts to run a model assuming conditional independence between tests with random effects for the reference tests and the index test resulted in unsatisfactory posterior distributions plots (see supplementary material Fig. 4). More specifically, some of the posterior distribution plots for the accuracy parameters were bimodal  that is, they have two peaks, which means they would estimate the accuracy as being two different values, indicating that the model is nonidentifiable. The correlation residual plot (see top of Fig. 11) suggests the conditional independence model provides a satisfactory fit to the data, since all of the 95% CrI’s cross the zero line. However, whilst overall good, the frequency table probability residual plot (see bottom plot of Fig. 11) shows that the 95% CrI’s of 4 studies do not overlap the zero line. We found the median and mean overall deviance of this model to be 54.8 and 54.5, respectively.
Analysis assuming conditional dependence
As previously mentioned, despite us obtaining a good correlation residual plot (see Fig. 11), the conditional independence assumption is typically not considered to be a reasonable assumption to make in clinical practice. Therefore, we attempted to fit a model without assuming conditional independence between the IQCODE and reference tests. Similarly to the conditional independence model, there was not enough information to identify all model parameters under the conditional dependence assumption if random effects were assumed for all tests (see supplementary material Fig. 5); therefore, we made the stronger assumption specifying fixed effects for the reference tests to identify the model. For the IQCODE, we obtained sensitivity and specificity estimates of 0.89 (95% CrI = (0.82, 0.95)) and 0.71 (95% CrI = (0.58, 0.84)). Both of these estimates are lower than the model assuming conditional independence (see “Analysis assuming conditional dependence” section), and suggest that the analysis assuming a perfect gold standard slightly overestimated the sensitivity by around 2% and underestimated the specificity by around 5%. An sROC plot of the results is shown in Fig. 12. Furthermore, although the model with conditional independence provided a satisfactory fit (see Fig. 11), the conditional dependence model clearly provides a better fit (see Fig. 13)  since it moves the summary estimates of the residual correlations and table frequency probability residuals closer to 0, and the median deviance has decreased from 54.8 to 43.8 (mean from 54.5 to 44.8).
Discussion
In this paper, we presented MetaBayesDTA, an extensively expanded webbased R Shiny [17] application based on MetaDTA [9]. The application enables users to conduct Bayesian metaanalysis of diagnostic test accuracy studies, both assuming a perfect reference test or modelling an imperfect reference test, without users having to install any software or have any knowledge of R [16] or Stan [14] programming.
The application uses the bivariate model [1] to conduct analysis assuming a perfect reference test, and users can also conduct univariate metaregression and subgroup analysis. It uses LCMs [4, 5] to conduct analyses without assuming a perfect gold standard, allowing the user to run models assuming conditional independence or dependence, options for whether to model the reference and index test sensitivities and specificities as fixed or random effects, and can model multiple reference tests using a metaregression covariate for the type of reference test. The application allows users to input their own prior distributions, which is particularly useful for the LCM models since information about the accuracy of the reference test(s) is often known. Similarly to MetaDTA [9], the tables and figures can be downloaded, and the graphs are highly customizable. Furthermore, risk of bias and quality assessment results from the QUADAS2 [22] tool can be incorporated into the sROC plot; integrating risk of bias into the main analysis decreases the tendency to think of risk of bias as an afterthought. Sensitivity analysis allowing users to remove selected studies can also be carried out easily for all models.
As we discussed in the “why is this application needed?” section (see Table 1), our app offers improvements over both BayesDTA [12] and MetaDTA [9,10,11]. Namely, for the bivariate model, unlike both BayesDTA and MetaDTA, our app allows subgroup analysis and univariate metaregression (either categorical or continuous covariate) to be carried out, which also allows users to easily conduct comparative test accuracy metaanalysis to compare two or more tests to one another. Furthermore, unlike BayesDTA, for the LCM, our app can assess model fit using the correlation residual plot [30], and it can model multiple reference tests, using a categorical covariate for the type of reference test. This is important since studies included in metaanalysis of test accuracy often use different reference tests, and the accuracy can vary greatly between them. Even though it is more complicated than MetaDTA, as it can run 5 different models rather than one and the graphs have more customization options, it has a cleaner layout and many of the menus are hidden unless the user clicks on them to display more options, thanks to the shinydashboard [19] and shinyWidgets [20] R packages. In general, there are some benefits of using Bayesian methods for metaanalysis of test accuracy as opposed to frequentist. For instance, being able to include informative prior information is particularly useful for the imperfect gold standard model, where parameter identifiability if often an issue. Furthermore, Bayesian methods generally outperform frequentist methods when there are few studies in a metaanalysis (which is often the case)  as frequentist methods are more likely to underestimate the betweenstudy heterogeneity [36].
Our web application has some limitations which give way to future developments. For example relating to metaanalysis of test accuracy without assuming a perfect gold standard (LCM), whilst users can model the data without assuming conditional independence between tests, it does not offer functionality to impose restrictions on the correlation structure. Therefore, a potential improvement would be to allow users to impose these restrictions, such as assuming the same correlation in the diseased and nondiseased groups, and/or forcing the correlations between the tests to be positive. Another limitation of the LCM is that it can only model different reference tests using categorical metaregression and therefore assumes that all of the reference tests have the same betweenstudy variances. Although this is often an advantage compared to conducting a subgroup analysis for each reference test, sometimes it might make sense to run a more complex model which assumes separate betweenstudy variances for some reference tests and assumes fixed effects for reference tests only observed in a few (e.g., 5) studies, therefore adding this functionality is a potential update.
For the bivariate model, a potential update for both subgroup analysis and categorical metaregression would be allow users to specify different priors for each of the groups. Furthermore, for metaregression, although our application allows users to see the pairwise differences and ratio’s between the different categories of a categorical covariate (making it possible to use for comparative test accuracy of multiple tests), it only shows these for the metaregression which assumes the variances are the same between all tests. However, in some instances it might make sense for the variances for some (or all  which would be equivalent to conducting a subgroup analysis) of the tests to be different, so a future update to improve the application would be to also display the pairwise differences and ratios for the subgroup analysis, and allowing users to assume independent variances for some tests but shared variances across other tests.
Another limitation is that our application only allows subgroup analysis and metaregression (besides for modelling different reference tests) to be conducted using the bivariate model, which assumes a perfect gold standard. A potential improvement would be to allow users to run subgroup analyses and metaregression for the LCM. Furthermore, the application requires users to have some knowledge about checking Bayesian model diagnostics to check that the models have been fitted OK  although the application does contain some information (in the “model diagnostics” tabs) which explains how to interpret some of the model diagnostics, and also directs users to online resources which explain how to interpret the model diagnostics so users do not have to find this information themselves.
It is important to note that this app is a beta version, so it is expected that there may be some bugs. Therefore, we welcome any user feedback  this can be done by completing the user feedback questionnaire (a link is provided in a popup box which appears when accessing MetaBayesDTA), or by emailing the first author of this paper. Responses to this feedback questionnaire will inform future updates of the application and will ensure that the userfriendliness of MetaBayesDTA increases over time and becomes a widely used diagnostic test accuracy metaanalysis web application, as MetaDTA [9] has become. A number of features included in MetaBayesDTA were included as a result of user and stake holder feedback  including the imperfect gold standard models, the metaregression and subgroup analysis, the “hidden” menus and options to make the interface look cleaner and less intimidating, and the Bayesian capabilities of the application.
In general, one could argue that easytouse apps could lead to the overapplication of complex methods even when they are not appropriate. This is because web applications  such as the one presented in this paper  will allow less experienced researchers to be able to conduct complex analyses which would otherwise be inaccessible to them, lowering the amount of knowledge needed to perform the analysis, and therefore increasing the chance of invalid results being published. Therefore, we recommend that there is a statistician (with knowledge of how to check Bayesian model diagnostics) in the review team. Furthermore, we have implemented a number of features in our application to minimise the risk of misleading research outputs being produced. These include: the informative popup boxes which appear which give information about setting up appropriate prior distributions and remind users to check the sampler diagnostics every time they run a new model, guidance in the “sampler diagnostics” tab so that users can interpret the sampler diagnostics, and implementing appropriate restrictions (e.g., whenever randomeffects are used, the 95% prediction regions will always be displayed on the sROC plots  we do not allow only 95% credible regions to be displayed as this will not portray information about the betweenstudy heterogeneity and can be misleading).
One could also argue that the widespread usability of apps could stimulate the uptake of more appropriate methods, which means that better methods will become standard practice more quickly. This could have important impacts for clinical practice; for instance, the fact that our app allows one to easily conduct a metaanalysis of test accuracy without assuming a gold standard without assuming the same reference test is used across all studies opens up many new datasets to synthesis, since many studies are conducted using different imperfect reference tests.
Conclusions
In this paper, we presented MetaBayesDTA [13], a userfriendly, interactive web application which allows users to conduct Bayesian metaanalysis of test accuracy, with or without a gold standard. The application uses methods which were previously only available by using statistical programming languages, such as R [16].
This application could have a wideranging impact across academia, guideline writers, policy makers, and industry. For example, when there is not a perfect reference test available, the estimates of test accuracy can change quite notably when relaxing the perfect reference test assumption, leading to potentially different conclusions being drawn about the accuracy of a test which could ultimately lead to changes in which tests are used in clinical practice. Furthermore, the ability of the app to easily conduct comparative test accuracy metaanalysis means that clinicians will more easily be able to tell which tests are better.
Availability and requirements
Project name: MetaBayesDTA
Project home page: https://crsu.shinyapps.io/MetaBayesDTA/
Operating system(s): Platform independent
Programming language: R, Stan
Other requirements: Web browser (R Shiny officially supports Google Chrome, Mozilla Firefox, Safari, or Internet Explorer)
License: Not applicable.
Any restrictions to use by nonacademics: None
Availability of data and materials
The web application (and the dataset used for analysis) is available at: https://crsu.shinyapps.io/MetaBayesDTA/.
The data, R and Stan code for the web application is available at: https://github.com/CRSUApps/MetaBayesDTA.
Abbreviations
 HSROC:

hierarchical summary receiver operating characteristic
 LCM:

latent class model
 IQCODE:

Informant Questionnaire on Cognitive Decline in the Elderly
 QUADAS2:

QUality Assessment of Diagnostic Accuracy Studies, version 2
 CI:

Confidence Interval
 sROC:

summary Receiver Operating Characteristic
 SD:

Standard Deviation
 LKJ:

LewandowskiKurowickaJoe
 CrI:

Credible Interval
 DSMIIIR:

Diagnostic and Statistical Manual of Mental Disorders version III, revised
 DSMIV:

Diagnostic and Statistical Manual of Mental Disorders version IV
 NINCDSADRDA:

National Institute of Neurological and Communicative Diseases and Stroke/Alzheimer’s Disease and Related Disorders Association
 ICD10:

International Classification of Diseases, version 10
References
Reitsma JB, Glas AS, Rutjes AWS, Scholten RJPM, Bossuyt PM, Zwinderman AH. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol. 2005. https://doi.org/10.1016/j.jclinepi.2005.02.022.
Rutter CM, Gatsonis CA. A hierarchical regression approach to metaanalysis of diagnostic test accuracy evaluations. Stat Med. 2001. https://doi.org/10.1002/sim.942.
Harbord RM, Deeks JJ, Egger M, Whiting P, Sterne JAC. A unification of models for metaanalysis of diagnostic accuracy studies. Biostatistics. 2007. https://doi.org/10.1093/biostatistics/kxl004.
Chu H, Chen S, Louis TA. Random effects models in a metaanalysis of the accuracy of two diagnostic tests without a gold standard. J Am Stat Assoc. 2009. https://doi.org/10.1198/jasa.2009.0017.
Menten J, Boelaert M, Lesaffre E. Bayesian metaanalysis of diagnostic tests allowing for imperfect reference standards. Stat Med. 2013. https://doi.org/10.1002/sim.5959.
Dendukuri N, Schiller I, Joseph L, Pai M. Bayesian MetaAnalysis of the Accuracy of a Test for Tuberculous Pleuritis in the Absence of a Gold Standard Reference. Biometrics. 2012. https://doi.org/10.1111/j.15410420.2012.01773.x.
Barrett JK, Farewell VT, Siannis F, Tierney J, Higgins JPT. Twostage metaanalysis of survival data from individual participants using percentile ratios. Stat Med. 2012;31(30):4296–4308. https://doi.org/10.1002/sim.5516. https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.5516.
Littenberg B, Moses LE. Estimating Diagnostic Accuracy from Multiple Conflicting Reports: A New Metaanalytic Method. Med Dec Making. 1993;13(4):313–321. PMID: 8246704. https://doi.org/10.1177/0272989X9301300408.
Freeman SC, Kerby CR, Patel A, Cooper NJ, Quinn T, Sutton AJ. Development of an interactive webbased tool to conduct and interrogate metaanalysis of diagnostic test accuracy studies: MetaDTA. BMC Med Res Methodol. 2019;19(1):81.
Patel A, Cooper N, Freeman S, Sutton A. Graphical enhancements to summary receiver operating characteristic plots to facilitate the analysis and reporting of metaanalysis of diagnostic test accuracy data. Res Synth Methods. 2021;12(1):34–44. https://doi.org/10.1002/jrsm.1439. https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1439.
Freeman SC, Kerby CR, Patel A, Cooper NJ, Quinn T, Sutton AJ. MetaDTA. 2019. www.crsu.shinyapps.io/dta_ma. Accessed Sept 2022.
Yao M, Schiller I, Dendukuri N. BayesDTA. 2021. https://bayesdta.shinyapps.io/metaanalysis/. Accessed Sept 2022.
Cerullo E, Sutton AJ, Jones HE, Wu O, Quinn TJ, Cooper NJ. MetaBayesDTA. 2022. https://crsu.shinyapps.io/MetaBayesDTA/. Accessed Sept 2022.
Stan Modeling Language Users Guide and Reference Manual. 2020. www.mcstan.org/docs/2_25/referencemanual. Accessed Sept 2022.
Takwoingi Y, Dendukuri N, Schiller I, Rücker G, Jones H, Partlett C, Macaskill P, et al. Chapter 11: Undertaking metaanalysis. Draft version (27 September 2021) for inclusion. In: Deeks JJ, Bossuyt PMM, Leeflang MMG, Takwoingi Y, editors. Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Version 2. Cochrane. 2021; Version 2.
Team RC. R: A Language and Environment for Statistical Computing. R Found Stat Comput. 2021. https://www.Rproject.org. Accessed Sept 2022.
Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, et al. Shiny: Web Application Framework for R. 2022. R package version 1.7.2. https://CRAN.Rproject.org/package=shiny. Accessed Sept 2022.
Stan Development Team. RStan: the R interface to Stan. 2022. R package version 2.21.5. https://mcstan.org/. Accessed Sept 2022.
Chang W, Borges Ribeiro B. Shinydashboard: Create Dashboards with ‘Shiny’. 2021. R package version 0.7.2. https://CRAN.Rproject.org/package=shinydashboard. Accessed Sept 2022.
Perrier V, Meyer F, Granjon D. ShinyWidgets: Custom Inputs Widgets for Shiny. 2022. R package version 0.7.2. https://CRAN.Rproject.org/package=shinyWidgets. Accessed Sept 2022.
Harrison J, Fearon P, NoelStorr A, McShane R, Stott D, Quinn T. Informant Questionnaire on Cognitive Decline in the Elderly (IQCODE) for the diagnosis of dementia within a secondary care setting. Cochrane Database Syst Rev. 2015;(3). https://doi.org/10.1002/14651858.CD010772.pub2.
QUADAS2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies. Ann Intern Med. 2011;155(8):529–536. PMID: 22007046. https://doi.org/10.7326/00034819155820111018000009. https://www.acpjournals.org/doi/abs/10.7326/00034819155820111018000009.
Chu H, Cole SR. Bivariate metaanalysis of sensitivity and specificity with sparse data: a generalized linear mixed model approach. J Clin Epidemiol. 2006;59(12):1331–213323.
Lewandowski D, Kurowicka D, Joe H. Generating random correlation matrices based on vines and extended onion method. J Multivar Anal. 2009. https://doi.org/10.1016/j.jmva.2009.04.008.
Burke DL, Ensor J, Snell KIE, van der Windt D, Riley RD. Guidance for deriving and presenting percentage study weights in metaanalysis of test accuracy studies. Res Synth Methods. 2018;9(2):163–178. https://doi.org/10.1002/jrsm.1283. https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1283.
Betancourt M. A Conceptual Introduction to Hamiltonian Monte Carlo. 2018. https://arxiv.org/abs/1701.02434.
Jorm AF, Scott R, Cullen JS, MacKinnon AJ. Performance of the Informant Questionnaire on Cognitive Decline in the Elderly (IQCODE) as a screening test for dementia. Psychol Med. 1991;21(3):785–90. https://doi.org/10.1017/S0033291700022418.
Vacek PM. The Effect of Conditional Dependence on the Evaluation of Diagnostic Tests. Biometrics. 1985. https://doi.org/10.2307/2530967.
Jones G, Johnson WO, Hanson TE, Christensen R. Identifiability of Models for Multiple Diagnostic Testing in the Absence of a Gold Standard. Biometrics. 2010;66(12):855–63.
Qu Y, Tan M, Kutner MH. Random Effects Models in Latent Class Analysis for Evaluating Accuracy of Diagnostic Tests. Biometrics. 1996. https://doi.org/10.2307/2533043.
APA. Diagnostic and Statistical Manual of Mental Disorders. 3rd ed. Washington DC: American Psychiatric Association; 1987.
APA. Diagnostic and Statistical Manual of Mental Disorders. 4th ed. Washington DC: American Psychiatric Association; 1994.
McKhann, Guy and Drachman, David and Folstein, Marshall and Katzman, Robert and Price, Donald and Stadlan, Emanuel M. Clinical diagnosis of Alzheimer’s disease. Neurology. 1984;34(7):939. https://n.neurology.org/content/34/7/939. https://doi.org/10.1212/WNL.34.7.939.
WHO. The ICD10 Classification of Mental and Behavioural Disorders. 10th ed. Geneva: World Health Organisation; 1993.
Gaugler JE, Kane RL, Johnston JA, Sarsour K. Sensitivity and specificity of diagnostic accuracy in alzheimer’s disease: A synthesis of existing evidence. 2013. https://doi.org/10.1177/1533317513488910.
Williams DR, Rast P, Bürkner PC. Bayesian MetaAnalysis with Weakly Informative Prior Distributions. 2018. https://doi.org/10.31234/osf.io/7tbrm. psyarxiv.com/7tbrm.
Acknowledgements
The authors would like to thank Olivia Carter for carefully proofreading the manuscript.
Funding
The work was carried out whilst EC was funded by a National Institute for Health Research (NIHR) Complex Reviews Support Unit (project number 14/178/29) and by an NIHR doctoral research fellowship (project number NIHR302333). The views and opinions expressed herein are those of the authors and do not necessarily reflect those of the NIHR, NHS or the Department of Health. The NIHR had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript. This project is funded by the NIHR Applied Research Collaboration East Midlands (ARC EM). The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.
Author information
Authors and Affiliations
Contributions
EC coded the software application and wrote the manuscript. AS, TQ and NC conceived the project. All authors critically reviewed and revised the manuscript. All authors read and approved the final manuscript. All authors take responsibility for the accuracy and integrity of the work.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Cerullo, E., Sutton, A.J., Jones, H.E. et al. MetaBayesDTA: codeless Bayesian metaanalysis of test accuracy, with or without a gold standard. BMC Med Res Methodol 23, 127 (2023). https://doi.org/10.1186/s1287402301910y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287402301910y
Keywords
 MetaAnalysis
 Diagnostic test accuracy
 Application
 Imperfect gold standard
 Latent class