Graphical presentation of diagnostic information

Whiting, Penny F; Sterne, Jonathan AC; Westwood, Marie E; Bachmann, Lucas M; Harbord, Roger; Egger, Matthias; Deeks, Jonathan J

doi:10.1186/1471-2288-8-20

Correspondence
Open access
Published: 11 April 2008

Graphical presentation of diagnostic information

Penny F Whiting¹,
Jonathan AC Sterne¹,
Marie E Westwood²,
Lucas M Bachmann³,
Roger Harbord¹,
Matthias Egger⁴ &
…
Jonathan J Deeks⁵

BMC Medical Research Methodology volume 8, Article number: 20 (2008) Cite this article

21k Accesses
103 Citations
6 Altmetric
Metrics details

Abstract

Background

Graphical displays of results allow researchers to summarise and communicate the key findings of their study. Diagnostic information should be presented in an easily interpretable way, which conveys both test characteristics (diagnostic accuracy) and the potential for use in clinical practice (predictive value).

Methods

We discuss the types of graphical display commonly encountered in primary diagnostic accuracy studies and systematic reviews of such studies, and systematically review the use of graphical displays in recent diagnostic primary studies and systematic reviews.

Results

We identified 57 primary studies and 49 systematic reviews. Fifty-six percent of primary studies and 53% of systematic reviews used graphical displays to present results. Dot-plot or box-and- whisker plots were the most commonly used graph in primary studies and were included in 22 (39%) studies. ROC plots were the most common type of plot included in systematic reviews and were included in 22 (45%) reviews. One primary study and five systematic reviews included a probability-modifying plot.

Conclusion

Graphical displays are currently underused in primary diagnostic accuracy studies and systematic reviews of such studies. Diagnostic accuracy studies need to include multiple types of graphic in order to provide both a detailed overview of the results (diagnostic accuracy) and to communicate information that can be used to inform clinical practice (predictive value). Work is required to improve graphical displays, to better communicate the utility of a test in clinical practice and the implications of test results for individual patients.

Peer Review reports

Background

Readers of a research report evaluating a diagnostic test may wish to assess the test's characteristics (diagnostic accuracy) or evaluate the impact that its use has on diagnostic decisions (predictive value) for individual patients. Graphical displays of results of test accuracy studies allow researchers to summarise and communicate the key findings of their study. We discuss the types of graphical display commonly encountered in primary diagnostic accuracy studies and systematic reviews of such studies, and systematically review the use of graphical displays in recent diagnostic systematic reviews and primary studies. Table 1 defines the various measures of diagnostic accuracy used.

Table 1 Definitions of measures of diagnostic accuracy

Full size table

Types of graphical display

Primary studies

Figure 1 illustrates four types of graphical display commonly used to present data on diagnostic accuracy for primary diagnostic accuracy studies. We used data from a study of the biochemical tumour marker CA-19-9 antigen to diagnose pancreatic cancer to construct these graphs [1].

Dot plots (Figure 1a) and Box-and-whisker plots (Figure 1b)

Dot plots are used for test results that take many values, and display the distribution of results in patients with and without the target condition. Box and whisker plots summarise these distributions: the central box covers the interquartile range with the median indicated by the line within the box. The whiskers extend either to the minimum and maximum values or to the most extreme values within 1.5 interquartile ranges of the quartiles, in which case more extreme values are plotted individually [2]. Sometimes an indication of the threshold used to define a positive test result is included, for example by adding a horizontal line or shading at the relevant point. Such plots can be used to clearly summarise a large volume of data, but are only able to display differences in the distribution of test values between patients with and without the target condition; they do not directly display the diagnostic performance of the test.

Although the CA-19-9 antigen test to diagnose pancreatic cancer (used to construct Figure 1) is an example of continuous data, it is also possible to construct similar graphs for categorical test results providing that the number of categories is reasonably large. Alternatively, for smaller numbers of categories, similar information can be conveyed using paired bar charts/histograms. Paired histograms show the distribution of test results in patients with the target condition above the x-axis and the distribution in patients without the target condition below the x-axis. These types of graphical display are less commonly used. It is not possible to construct any of these graphs for truly dichotomous test results. However, truly dichotomous tests rarely occur in practice. Examples of dichotomous tests include dipstick tests that change colour if the target condition is said to be present (although these are based on an underlying implicit threshold) or the presence/absence of certain clinical symptoms.

Receiver operating characteristic (ROC) plot (Figure 1c)

ROC plots show values of sensitivity and specificity at all of the possible thresholds that could be used to define a positive test result [3]. Typically, sensitivity (true positive rate) is plotted against 1-specificity (false positive rate): each point represents a different threshold in the same group of patients. Stepped lines are used for continuous test results while sloping lines are used for ordered categories. ROC curves may be derived directly from the observed sensitivity and specificity corresponding to different test thresholds, or by fitting curves based on parametric [4], semi-parametric [5, 6], or non-parametric methods [7]. The area under the ROC curve (AUC) is a summary of diagnostic performance, and takes values between 0.5 and 1. The more accurate the test, the more closely the curve approaches the top left hand corner of the graph (AUC = 1). A test that provides no diagnostic information (AUC = 0.5) will produce a straight line from the bottom left to the top right. ROC curves may be restricted to a range of sensitivities or specificities of clinical interest.

ROC plots show how estimated sensitivity and specificity vary according to the threshold chosen, and can be used to identify suitable thresholds for clinical practice if the points on the curve are labelled with the corresponding threshold as in Figure 1c, which shows for example that the sensitivity and specificity corresponding to a threshold of 39.3 are 74% and 90%, respectively. Confidence intervals can be added to indicate the uncertainty in estimates of test performance at each point. ROC plots also allow comparison of the performance of several tests independently of choice of threshold, by plotting data sets for multiple tests in the same ROC space. However, they are thought to be difficult to interpret as they describe the characteristics of the test in a way which does not relate directly to its usefulness in clinical practice; research has shown that ROC plots are generally poorly understood by clinicians [8].

Flow charts (Figure 1d)

These depict the flow of patients through the study: for example how many patients were eligible, how many entered the study, how many of these had the target condition, and the numbers testing positive and negative. Such charts require categorisation of test results, for example as "positive" and "negative". Although flow charts do not directly present diagnostic accuracy data, addition of percentages to the test result boxes (as in Figure 1d) can be used to report test sensitivity (68/90 = 76%) and specificity (46/51 = 90%). Charts that first separate individuals according to test result before classification by disease status may similarly be used to depict positive and negative predictive values. The STARD (standards for reporting of diagnostic accuracy) statement, an initiative to improve the reporting of diagnostic test accuracy studies similar to the CONSORT statement for clinical trials, recommends the inclusion of a flow diagram in all reports of primary diagnostic accuracy studies [9]. This should illustrate the design of the study and provide information on the numbers of participants at each stage of the study as well as the results of the study. The example flow chart in Figure 1d is not a full STARD flow diagram as we do not have data on numbers of withdrawals or uninterpretable results from this study. It does, however, show the design (diagnostic case-control) and results of the study.

Systematic reviews

Figure 2 illustrates two graphical displays commonly used to present data on diagnostic accuracy in diagnostic systematic reviews. Data from a systematic review of dipstick tests for urinary nitrite and leukocyte esterase to diagnose urinary tract infections were used to construct these graphs [10].

Forest plots (Figure 2a)

Forest plots are commonly used to display results of meta-analysis. They display results from the individual studies together with, optionally, a summary (pooled) estimate. Point estimates are shown as dots or squares (sometimes sized according to precision or sample size) and confidence intervals as horizontal lines [11]. The pooled estimate is displayed as a diamond whose centre represents the estimate and tips the confidence interval.

For diagnostic accuracy studies, measures of test performance (sensitivity, specificity, predictive values, likelihood ratios or diagnostic odds ratio) are plotted on the horizontal axis. Diagnostic test performance is often described by pairs of summary statistics (e.g. sensitivity and specificity; positive and negative likelihood ratios), and these are depicted side-by-side. Between-study heterogeneity can readily be assessed by visual examination. Results may be sorted by one of a pair of test performance measures, usually that which is most important to the clinical application of the test. A disadvantage of paired forest plots is that they do not directly display the inverse association between the two measures that commonly results from variations in threshold between studies.

ROC plots and summary ROC (SROC) curves (Figure 2b)

ROC plots can be used to present the results of diagnostic systematic reviews, but differ from those used in primary studies as each point typically represents a separate study or data set within a study (individual studies may contribute more than one point). A summary ROC (SROC) curve can be estimated using one of several methods [12–15] and quantifies test accuracy and the association between sensitivity and specificity based on differences between studies. As with forest plots, ROC plots provide an overview of the results of all included studies. However, unless there are very few studies, it is not feasible to display confidence intervals as the plot would become cluttered. Results for several tests can be displayed on the same plot, facilitating test comparisons. It is also possible to display pooled estimates of sensitivity and specificity together with associated confidence intervals or prediction regions. ROC plots may also be used to investigate possible explanations for differences in estimates of accuracy between studies, for example those arising from differences in study quality. Figure 3 shows results for a recent review that we conducted on the accuracy of magnetic resonance imaging (MRI) for the diagnosis of multiple sclerosis (MS) [16]. By using different symbols to illustrate studies that did (diagnostic cohort studies) and did not (other study designs) include an appropriate patient spectrum we were able to show that studies that included an inappropriate patient spectrum grossly overestimated both sensitivity and specificity.

Other plots

Various other graphical methods have been developed to display the results of systematic reviews and meta-analyses [17, 18]. Although not generally developed specifically for diagnostic test reviews these can be adapted to display the results of such reviews. Funnel plots [19] and Galbraith plots [20] are often used to assess evidence for publication bias or small study effects in systematic reviews of the effects of medical interventions assessed in randomized controlled trials. However, their application to systematic reviews of diagnostic test accuracy studies is problematic [20]. Diagnostic odds ratios are typically far from 1, and it has been shown that, for data of this type, sampling variation can lead to artefactual associations between log odds ratios and their standard errors [21]. It is therefore recommended that the effective sample size funnel plot be used in reviews of test accuracy studies [20].

Predictive value

A number of graphical displays aim to put results of diagnostic test evaluations into clinical context, based either on primary studies or systematic reviews. Two graphical displays commonly used for this purpose are the likelihood ratio nomogram (Figure 4a) and the probability-modifying plot (Figure 4b). Each allows the reader to estimate the post-test probability of the target condition in an individual patient, based on a selected pre-test probability. To use the likelihood ratio nomogram, the reader needs an estimate of the likelihood ratios for the test. He then draws a line through the appropriate likelihood ratio on the central axis, intersecting the selected pre-test probability, to derive the post-test probability of disease. The probability-modifying plot depicts separate curves for positive and negative test results. The reader draws a vertical line from the selected pre-test probability to the appropriate likelihood ratio line and then reads the post-test probability off the vertical scale. Both graph types are based on a single estimate of test accuracy (likelihood ratio), although it is possible to plot separate curves on the probability-modifying plot or lines on the nomogram to depict confidence intervals around the estimated likelihood ratios. Each assumes constant likelihood ratios across the range of pre-test probabilities. However, this assumption may be violated in practice [22], because populations in which the test is used may have different spectrums of disease to those in which estimates of test accuracy were derived.