- Technical advance
- Open Access
- Open Peer Review
The agreement chart
- Shrikant I Bangdiwala^{1}Email author and
- Viswanathan Shankar^{2}
https://doi.org/10.1186/1471-2288-13-97
© Bangdiwala and Shankar; licensee BioMed Central Ltd. 2013
- Received: 8 January 2013
- Accepted: 23 July 2013
- Published: 29 July 2013
Abstract
Background
When assessing the concordance between two methods of measurement of ordinal categorical data, summary measures such as Cohen’s (1960) kappa or Bangdiwala’s (1985) B-statistic are used. However, a picture conveys more information than a single summary measure.
Methods
We describe how to construct and interpret Bangdiwala’s (1985) agreement chart and illustrate its use in visually assessing concordance in several example clinical applications.
Results
The agreement charts provide a visual impression that no summary statistic can convey, and summary statistics reduce the information to a single characteristic of the data. However, the visual impression is personal and subjective, and not usually reproducible from one reader to another.
Conclusions
The agreement chart should be used to complement the summary kappa or B-statistics, not to replace them. The graphs can be very helpful to researchers as an early step to understand relationships in their data when assessing concordance.
Keywords
- Intra- and inter-observer agreement
- Concordance
- Kappa statistic
- B-statistic
Background
When two raters independently classify the same n items into the same k ordinal categories, one wishes to assess their concordance. Such situations are common in clinical practice; for example, when one wishes to compare two diagnostic or classification methods because one is more expensive or cumbersome than the other, or one wishes to assess how well two clinicians are in blindly classifying patients into disease likelihood categories.
Example 1
Cross tabulations of multiple sclerosis diagnosis by two independent neurologists, comparing concordance with different sets of patients - [Westlund & Kurland (1953)]
(A) Winnipeg patients | (B) New Orleans patients | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Winnipeg neurologist | Winnipeg neurologist | ||||||||||
Certain | Probable | Possible | No | Total | Certain | Probable | Possible | No | Total | ||
New Orleans neurologist | Certain | 38 | 5 | 0 | 1 | 44 | 5 | 3 | 0 | 0 | 8 |
Probable | 33 | 11 | 3 | 0 | 47 | 3 | 11 | 4 | 0 | 18 | |
Possible | 10 | 14 | 5 | 6 | 35 | 2 | 13 | 3 | 4 | 22 | |
No | 3 | 7 | 3 | 10 | 23 | 1 | 2 | 4 | 14 | 21 | |
Total | 84 | 37 | 11 | 17 | 149 | 11 | 29 | 11 | 18 | 69 |
One can assess concordance between the neurologists naively by calculating the proportion of observations in the diagonal cells; but more commonly, one uses either Cohen’s [3] kappa statistic or Bangdiwala’s [4] B-statistic, both of which account for chance agreement. The choice between and interpretation of these two statistics was reviewed in Muñoz & Bangdiwala (1997) [5] and Shankar & Bangdiwala (2008) [6], which also discusses the methodology behind both statistics.
One can account for partial agreement by considering the weighted versions of these two statistics, which assign weights to off-diagonal cell frequencies in their calculations. We considered quadratic weights for calculating weighted statistics in this manuscript. For Table 1A, the Winnipeg patients, the statistics are kappa = 0.208 (weighted kappa = 0.525) and B = 0.272 (weighted B = 0.825), while for Table 1B, the New Orleans patients, the statistics are kappa = 0.297 (weighted kappa = 0.626), and B = 0.285 (weighted B = 0.872). These values would be considered as ‘fair’ to ‘moderate’ but they are not meaningfully different between Winnipeg and New Orleans patients.
Example 2
Cross tabulations of cardiovascular disease cause of death by two independent classification methodologies in the lipids research clinics program mortality follow-Up study (LRC-FUS), comparing elderly (≥65 years) and non elderly deaths – [Bangdiwala et al. (1989)]
(A) Elderly | (B) Non-elderly | ||||||
---|---|---|---|---|---|---|---|
(≥65 years) deaths | (<65 years) deaths | ||||||
Nosologist | Nosologist | ||||||
CVD | Non-CVD | Total | CVD | Non-CVD | Total | ||
Expert panel | CVD | 172 | 11 | 183 | 122 | 10 | 132 |
Non-CVD | 35 | 50 | 85 | 5 | 18 | 23 | |
Total | 207 | 61 | 268 | 127 | 28 | 155 |
For Table 2A, the elderly deaths, the summary concordance measures are kappa = 0.57 and B = 0.74, while for Table 2B, the non-elderly deaths, these measures are kappa = 0.65 and B = 0.87. These can be interpreted according to Muñoz & Bangdiwala [5] as ‘substantial’ to ‘almost perfect’ agreement, but they are not meaningfully different between elderly and non-elderly deaths.
Example 3
Cross-tabulations of numbers of mammograms according to risk categories for breast cancer, comparing concordance among scales of mammographic density patterns - [Garrido-Estepa et al. 2010]
(A) Wolfe classification scale | (B) Tabár classification scale | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Second measure | Second measure | |||||||||||||
N1 | P1 | P2 | DY | Total | II | III | IV | V | Total | |||||
First measure | N1 | 12 | 9 | 0 | 0 | 21 | II | 12 | 9 | 0 | 0 | 21 | ||
P1 | 4 | 139 | 13 | 5 | 161 | III | 4 | 170 | 16 | 8 | 198 | |||
P2 | 0 | 7 | 101 | 14 | 122 | IV | 0 | 4 | 114 | 6 | 124 | |||
DY | 0 | 2 | 13 | 56 | 71 | V | 0 | 8 | 9 | 15 | 32 | |||
Total | 16 | 157 | 127 | 75 | 375 | Total | 16 | 191 | 139 | 29 | 375 | |||
(C) BI-RADS classification scale | (D) Boyd classification scale | |||||||||||||
Second measure | Second measure | |||||||||||||
AEF | SFD | HD | ED | Total | A | B | C | D | E | F | Total | |||
First measure | AEF | 147 | 13 | 0 | 0 | 160 | A | 6 | 4 | 0 | 0 | 0 | 0 | 10 |
SFD | 14 | 101 | 10 | 0 | 125 | B | 4 | 56 | 11 | 0 | 0 | 0 | 71 | |
HD | 0 | 14 | 48 | 6 | 68 | C | 0 | 16 | 50 | 13 | 0 | 0 | 79 | |
ED | 0 | 0 | 3 | 19 | 22 | D | 0 | 0 | 14 | 102 | 9 | 0 | 125 | |
Total | 161 | 128 | 61 | 25 | 375 | E | 0 | 0 | 0 | 14 | 48 | 6 | 68 | |
F | 0 | 0 | 0 | 0 | 3 | 19 | 22 | |||||||
Total | 10 | 76 | 75 | 129 | 60 | 25 | 375 |
The reported values of agreement for the various scales between the first and second readings were 0.73, 0.72, 0.76 and 0.68 for the kappa for the Wolfe, Tabár, BI-RADS and Boyd scales, respectively, and 0.71, 0.75, 0.74 and 0.58 for the B-Statistic for the same scales. The kappa statistics and B-statistics fall into Muñoz & Bangdiwala’s interpretations between ‘substantial’ to ‘almost perfect’, indicating great concordance, but with no meaningful differences among the 4 classification scales with respect to concordance between first and second measure. In Table 3 we note that discrepancies for the BI-RADS and the Boyd classifications are only for contiguous risk categories, while for Wolfe and Tabár they sometimes are two risk categories apart. The weighted versions of the statistics [kappas of 0.84, 0.71, 0.90 and 0.92 for the Wolfe, Tabár, BI-RADS and Boyd scales, respectively, and B-Statistics of 0.96, 0.95, 0.97 and 0.98 for the same scales] are thus much closer to unity than the un-weighted versions.
The above three examples illustrate the wide need for assessing agreement in the clinical field, and the utility of alternative summary statistics. While both statistics can summarize the agreement information numerically, however, a graph can ‘tell a story’. The agreement chart [4] is a two-dimensional graph for visually assessing the agreement between two observers rating the same n units into the same k discrete ordinal categories.
Methods
Constructing the agreement chart
- i
Draw an n × n square.
- ii
Draw k rectangles of dimensions based on the row and column marginal totals, placed inside the n × n square, and positioned with the lower left vertex touching the upper right vertex of the previous rectangle, starting from the (0,0) position to the (n,n) point of the large square.
- iii
Draw k shaded squares of dimensions based on the diagonal cell frequencies, placed inside the corresponding rectangle, positioned based on the off-diagonal cell frequencies from the same row and column.
- iv
‘Partial agreement’ areas can be similarly placed within the rectangles, with decreasing shading for cells further away from the diagonal cells.
The statistical software SAS version 9.3 has incorporated the agreement plot as a default chart (see PROC FREQ under AGREE and KAPPA syntax). The agreement chart is also implemented in the open-access R software under the vcd package [9]. The agreement chart provides a visual representation for comparing the concordance in paired categorical data. The visual image is affected if the order of the categories is permuted, and thus its use is recommended exclusively for ordinal level variables. In the case of perfect agreement, the k rectangles determined by the marginal totals are all perfect squares and the shaded squares determined by the diagonal cell entries are exactly equal to the rectangles, producing a B-statistic value of 1. Lesser agreement is visualized by comparing the area of the blackened squares to the area of the rectangles, while observer bias is visualized by examining the ‘path of rectangles’ and how it deviates from the 45° diagonal line within the larger n × n square.
Results
Examples with charts
Example 1 - Westlund & Kurland (1953) multiple sclerosis
Example 2 – Bangdiwala et al. (1989) cardiovascular disease
Example 3 – Garrido-Estepa et al. (2010) breast cancer risk categories
Discussion
When assessing the concordance between two methods of measurement of ordinal categorical data, summary measures are often used. We believe that a picture conveys more information than a single summary measure, and thus this manuscript introduces the ‘agreement chart,’ how it is constructed and interpreted. The objective of this manuscript is to illustrate its use in visually assessing concordance with several example clinical applications as a way to foster its use in clinical applications.
In the examples presented, the information obtained from the charts led to interpretations of the data that were not obtainable form summary measures of agreement. In Example 1 [Westlund & Kurland (1953) [2]], comparing multiple sclerosis classification by two neurologists, examination of the agreement chart uncovered issues and patterns of disagreement that affected the reliability and validity of the assessments between raters. In Example 2 – [Bangdiwala et al. (1989) [7]], the use of the agreement chart uncovered the importance of using a panel of expert cardiologists to classify cause of death as CVD or non-CVD as opposed to relying on nosological assessment from death certificates in elderly versus non-elderly populations. Finally, in Example 3 [Garrido-Estepa et al. (2010) [8]], the agreement charts uncovered differences among the scales that are not reflected in the numerical summaries - identifying ‘drift’ bias between first and second measure, and differences in preferences for the lowest ‘low-risk category’ for some scales.
The main advantage of the agreement chart is that it is a visual representation of agreement, while existing methods to study agreement are either based on summary measures or on modeling approaches. In addition, it is able to provide insight into how disagreements are affecting the comparability of the two raters/observers. There are no other graphs for visually assessing agreement, and it is easily implementable in standard statistical software. The only major disadvantage is that it is limited to ordinal scale variables, since if the categories where on a nominal scale, permuting the order may affect the visual interpretation of agreement.
The utility of the agreement chart is not only for assessing agreement, but also for allowing insight into disagreements. With multiple categories, appropriate shading of rectangles within the agreement chart helps visualize patterns of disagreements. Differences in marginal distributions of ratings by the two observers are visualized by focusing on the ‘path or rectangles’. Various patterns in the path of rectangles may indicate differences in location or variability between the two raters/observers’ preferences for categories.
Conclusions
Ideally, data should be presented graphically, since “graphics can be more precise and revealing than conventional statistical computations” [10]. Graphs provide a visual impression that no summary statistic can convey, and summary statistics reduce the information to a single characteristic of the data. However, visual impression can be personal and subjective, and not usually reproducible from one reader to another, and graphs occupy more space in a manuscript than a simple summary statistic. Given the choice of measurement scale for the analysis, graphs should be used to complement the summary statistics, not to replace them. However, graphs can be very helpful to researchers as an early step to understand relationships in their data. In this manuscript we used clinically relevant examples to illustrate the additional information provided by the agreement chart when assessing concordance.
Declarations
Acknowledgment
The University of North Carolina at Chapel Hill’s Libraries and the Division of Biostatistics, Albert Einstein College of Medicine, Bronx, NY, provided support for open access publication.
Authors’ Affiliations
References
- Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics. 1977, 33 (1): 159-174. 10.2307/2529310.View ArticlePubMedGoogle Scholar
- Westlund KB, Kurland LT: Studies on multiple sclerosis in Winnipeg, Manitoba, and New Orleans, Louisiana. II. A controlled investigation of factors in the life history of the Winnipeg patients. Am J Hyg. 1953, 57 (3): 397-407.PubMedGoogle Scholar
- Cohen J: A coefficient for agreement for nominal scales. Educ Psychol Meas. 1960, 20: 37-46. 10.1177/001316446002000104.View ArticleGoogle Scholar
- Bangdiwala SI: International statistical institute centenary session 1985. A graphical test for observer agreement. 1985, Amsterdam: International Statistical Institute, 307-308.Google Scholar
- Munoz SR, Bangdiwala SI: Interpretation of Kappa and B statistics measures of agreement. J Appl Stat. 1997, 24 (1): 105-112. 10.1080/02664769723918.View ArticleGoogle Scholar
- Shankar V, Bangdiwala SI: Behavior of agreement measures in the presence of zero cells and biased marginal distributions. J Appl Stat. 2008, 35 (4): 445-464. 10.1080/02664760701835052.View ArticleGoogle Scholar
- Bangdiwala SI, Cohn R, Hazard C, Davis CE, Prineas RJ: Comparisons of cause of death verification methods and costs in the lipid research clinics program mortality follow-up study. Control Clin Trials. 1989, 10 (2): 167-187. 10.1016/0197-2456(89)90029-9.View ArticlePubMedGoogle Scholar
- Michel CE, Solomon AW, Magbanua JP, Massae PA, Huang L, Mosha J, West SK, Nadala EC, Bailey R, Wisniewski C, et al: Field evaluation of a rapid point-of-care assay for targeting antibiotic treatment for trachoma control: a comparative study. Lancet. 2006, 367 (9522): 1585-1590. 10.1016/S0140-6736(06)68695-9.View ArticlePubMedGoogle Scholar
- Meyer D, Zeileis A, Hornik K: vcd: Visualizing Categorical Data: R package version 1.2-13. 2012Google Scholar
- Tufte ER: The visual display of quantitative information. 1983, Cheshire, CT: Graphics PressGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/13/97/prepub
Pre-publication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.