- Technical advance
- Open Access
- Open Peer Review
Observer agreement paradoxes in 2x2 tables: comparison of agreement measures
- Viswanathan Shankar1Email author and
- Shrikant I Bangdiwala2
https://doi.org/10.1186/1471-2288-14-100
© Shankar and Bangdiwala; licensee BioMed Central Ltd. 2014
- Received: 23 June 2014
- Accepted: 12 August 2014
- Published: 28 August 2014
Abstract
Background
Various measures of observer agreement have been proposed for 2x2 tables. We examine the behavior of alternative measures of observer agreement for 2x2 tables.
Methods
The alternative measures of observer agreement and the corresponding agreement chart were calculated under various scenarios of marginal distributions (symmetrical or not, balanced or not) and of degree of diagonal agreement, and their behaviors are compared. Specifically, two specific paradoxes previously identified for kappa were examined: (1) low kappa values despite high observed agreement under highly symmetrically imbalanced marginals, and (2) higher kappa values for asymmetrical imbalanced marginal distributions.
Results
Kappa and alpha behave similarly and are affected by the marginal distributions more so than the B-statistic, AC1-index and delta measures. Delta and kappa provide values that are similar when the marginal totals are asymmetrically imbalanced or symmetrical but not excessively imbalanced. The AC1-index and B-statistics provide closer results when the marginal distributions are symmetrically imbalanced and the observed agreement is greater than 50%. Also, the B-statistic and the AC1-index provide values closer to the observed agreement when the subjects are classified mostly in one of the diagonal cells. Finally, the B-statistic is seen to be consistent and more stable than kappa under both types of paradoxes studied.
Conclusions
The B-statistic behaved better under all scenarios studied as well as with varying prevalences, sensitivities and specificities than the other measures, we recommend using B-statistic along with its corresponding agreement chart as an alternative to kappa when assessing agreement in 2x2 tables.
Keywords
- Rater agreement
- 2x2 table
- Cohen’s kappa
- Aickin’s alpha
- B-statistic
- Delta
- AC1-index
Background
where Po is the proportion of overall observed agreement and Pe is the proportion of overall chance-expected agreement. The kappa statistic thus ranges between – Pe / (1-Pe) to 1.
Generic 2x2 table format for assessing agreement between two raters classifying N units into the same 2 categories
Rater B | ||||
---|---|---|---|---|
Yes | No | Total | ||
Rater A | Yes | x11 | x12 | g1 |
No | x21 | x22 | g2 | |
Total | f1 | f2 | N |
Cicchetti and Feinstein [8] suggested resolving the paradoxes by using two separate indexes (ppos and pneg) to quantify agreement in the positive and negative decisions; these are analogous to sensitivity and specificity from a diagnostic testing perspective.
Note that BI = 0 if and only if the marginal distributions are equal. PI ranges from -1 to +1 and is equal to zero when both categories are equally probable. Similarly, Lantz and Nebenzahl [12] proposed that one should report supporting indicators along with kappa - PO, a symmetry indicator, and ppos. Unfortunately, reporting of multiple indices is often not done.
This manuscript considers the various alternative single indexes for observer agreement in 2x2 tables, and examines their behavior under different scenarios of marginal distributions, balanced or not, symmetrical or not. It is an attempt to shed more light on how these measures address the paradoxes identified by Feinstein and Cicchetti [7], but also to examine their behavior in broader situations encountered in 2x2 tables.
Methods
Different agreement indices
In addition to Cohen’s kappa, we consider the following statistics: Bangdiwala’s B-statistic [13, 14], Prevalence Adjusted Bias Adjusted Kappa (PABAK) [6], Aickins’s alpha [15], Andrés and Marzo’s Delta [16, 17] and Gwet’s AC1-index [18].
Agreement chart for hypothetical data from Table 1 assessing agreement between two raters classifying N units into the same two categories.
Scenarios studied in this manuscript: Cell frequencies, marginals, proportion observed, bias and prevalence index
Scenario | Type of table | x11 | x12 | x21 | x22 | f1 | f2 | g1 | g2 | PO | BI | PI |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Paradox 1: | ||||||||||||
1 | Symmetrical balance | 40 | 9 | 6 | 45 | 49 | 51 | 46 | 54 | .85 | .03 | -.05 |
2 | Symmetrical imbalance | 80 | 10 | 5 | 5 | 90 | 10 | 85 | 15 | .85 | .05 | .75 |
3 | Perfect symmetrical imbalance | 90 | 5 | 5 | 0 | 95 | 5 | 95 | 5 | .90 | 0 | .90 |
Paradox 2: | P O set at 0.60 | |||||||||||
4 | Symmetrical imbalance | 45 | 15 | 25 | 15 | 60 | 40 | 70 | 30 | .60 | -.10 | .30 |
5 | Asymmetrical imbalance | 25 | 35 | 5 | 35 | 60 | 40 | 30 | 70 | .60 | .30 | .10 |
6 | Perfect symmetrical imbalance | 40 | 20 | 20 | 20 | 60 | 40 | 60 | 40 | .60 | 0 | .20 |
7 | Asymmetrical imbalance | 40 | 35 | 5 | 20 | 75 | 25 | 45 | 55 | .60 | .30 | .20 |
8 | Asymmetrical imbalance | 30 | 30 | 10 | 30 | 40 | 60 | 60 | 40 | .60 | .20 | 0 |
P O set at 0.90 | ||||||||||||
9 | Perfect symmetrical imbalance | 85 | 5 | 5 | 5 | 90 | 10 | 90 | 10 | .90 | 0 | .80 |
10 | Symmetrical imbalance | 70 | 10 | 0 | 20 | 80 | 20 | 70 | 30 | .90 | .10 | .50 |
P O low (≤50%) | ||||||||||||
11 | Perfect symmetrical balance | 25 | 25 | 25 | 25 | 50 | 50 | 50 | 50 | .50 | 0 | 0 |
12 | Asymmetrical imbalance | 30 | 30 | 20 | 20 | 60 | 40 | 50 | 50 | .50 | .10 | .10 |
13 | Perfect symmetrical balance | 20 | 30 | 30 | 20 | 50 | 50 | 50 | 50 | .40 | 0 | 0 |
14 | Perfect symmetrical balance | 5 | 45 | 45 | 5 | 50 | 50 | 50 | 50 | .10 | 0 | 0 |
Results
Paradox 1
Agreement charts (a-c) for scenarios 1-3 of Table 2 , addressing Paradox 1.
Estimates of proportion observed, proportion expected and agreement measures, by scenarios
Scenario | Type of table | Po | Pe |
|
| PABAK |
|
|
|
|
---|---|---|---|---|---|---|---|---|---|---|
Paradox 1: | ||||||||||
1 | Symmetrical balance | .85 | .50 | .70 | .72 | .70 | .70 | .70 | .68 | .68 |
2 | Symmetrical imbalance | .85 | .78 | .32 | .82 | .70 | .81 | .55 | .69 | .68 |
3 | Perfect symmetrical imbalance | .90 | .905 | -.05 | .895 | .80 | .89 | - | .78 | .77 |
Paradox 2: | P O set at 0.60 | |||||||||
4 | Symmetrical imbalance | .60 | .54 | .13 | .41 | .20 | .27 | .15 | .21 | .20 |
5 | Asymmetrical imbalance | .60 | .46 | .26 | .4 | .20 | .21 | .33 | .32 | .31 |
6 | Perfect symmetrical imbalance | .60 | .52 | .17 | .38 | .20 | .23 | .18 | .19 | .19 |
7 | Asymmetrical imbalance | .60 | .475 | .24 | .42 | .20 | .23 | .32 | .32 | .31 |
8 | Asymmetrical imbalance | .60 | .48 | .23 | .38 | .20 | .20 | .25 | .24 | .24 |
P O set at 0.90 | ||||||||||
9 | Perfect symmetrical imbalance | .90 | .82 | .44 | .88 | .80 | .88 | .68 | .78 | .77 |
10 | Symmetrical imbalance | .90 | .62 | .74 | .85 | .80 | .84 | - | .83 | .82 |
P O low (≤50%) | ||||||||||
11 | Perfect symmetrical balance | .50 | .50 | 0 | .25 | 0 | 0 | 0 | 0 | 0 |
12 | Asymmetrical imbalance | .50 | .50 | 0 | .26 | 0 | -.11 | 0 | .01 | .01 |
13 | Perfect symmetrical balance | .40 | .50 | -.20 | .16 | -.20 | -.20 | - | -.19 | -.19 |
14 | Perfect symmetrical balance | .10 | .50 | -.80 | .01 | -.80 | -.80 | -.18 | -.77 | -.77 |
Measures of agreement as a function of prevalence, for (a) both sensitivity and specificity set at 95%, (b) sensitivity of 70% and specificity of 95%, (c) sensitivity of 95% and specificity of 70%, and (d) both sensitivity and specificity set at 60%.
Paradox 2
Agreement charts (a-e) for scenarios 4-8 of Table 2 , addressing Paradox 2.
Agreement charts (a-b) for scenarios 9-10 of Table 2 , addressing Paradox 2 with high P O .
Kappa, alpha and delta have higher values of agreement for asymmetrical imbalance (scenarios 5 and 7) than for symmetrically imbalanced marginal totals (scenarios 4 and 6), contrary to what is desired. The B-statistic behaves slightly better, with lower values for asymmetry (comparing scenario 4 to 5), and despite having higher values for symmetry than for asymmetry in scenarios 6 versus 7, it is not as discrepant as the other statistics. This trend was similar in the AC1-index. Comparing the degrees of symmetry (scenarios 9-10), we expect that perfect symmetrical imbalances (scenario 9) should have higher agreement than imperfect symmetrical imbalances (scenario 10). PABAK does not change with changes in prevalence or bias since it is a simple function of PO (scenarios 4-10). We note that kappa and delta have higher values of agreement for imperfect versus perfect symmetry, while the B-statistic and AC1-index behave as one would prefer (scenario 9 vs. 10). B-statistic and AC1-index perform better than the other statistics when PO is larger (scenarios 9-10 vs. scenarios 4-6). When the bias index is greater or equal to the prevalence index (scenarios 1, 5, 7, 8, 11, 12, 13 & 14), the AC1-index is almost same as the PABAK. The slightly poor performance of B-statistic for lower PO values is seen when the bias index is greater than the prevalence index (scenarios 4 vs. 5 and 6 vs. 8). In scenarios 4-8 with PO = 0.60, most indices perform poor, with values substantially lower than PO; however, the B-statistics is closer to PO. Thus, B-statistic resolves paradox 2 when PO is large and comes closer than the other statistics when PO is smaller.
Agreement charts (a-d) for scenarios 11-14 of Table 2 , addressing P O < =0.5.
Discussion
\While all statistics examined are affected by lack of symmetry and by imbalances in the marginal totals, the B-statistic comes closest to resolving the paradoxes identified by Fienstein and Cicchetti [7] and Byrt et al.[6]. Alpha behaves similarly to kappa and is thus greatly affected by the imbalances and lack of symmetry in the marginal totals. The B-statistic and AC1-index were less affected by the imbalances and lack of symmetry in the marginal totals, and were also less sensitive to extreme values of the prevalence. Delta behaves somewhat intermediate between B-statistic and kappa. Delta uses an arbitrary category for calculation in the 2x2 scenario, which makes it not realistic; but the asymptotic estimation with increment of one is closer to non-asymptotic estimates. The B-statistic came closer to resolving both paradoxes than any of the other indices, and thus we recommend use of the B-statistic when assessing agreement in 2x2 tables. However, we note that as Nelson and Pepe [10] suggest, visual representations ‘provide more meaningful descriptions than numeric summaries’ (p. 493), and thus we recommend additionally providing the corresponding agreement chart to illustrate the agreement as well as constraints from the symmetry and balance of the marginal totals and cell frequencies. The B- statistic is easy to calculate and along with the agreement chart, it provides interpretations of the agreement pattern as well as the disagreement pattern between the raters.
Conclusions
The B-statistic behaved better under all scenarios of marginal distributions studied, balanced or not, symmetrical or not, as well as with varying prevalences, sensitivities and specificities than the other measures. We recommend using B-statistic along with its corresponding agreement chart as an alternative to kappa when assessing agreement in 2x2 tables.
Declarations
Acknowledgements
The Division of Biostatistics, Albert Einstein College of Medicine, Bronx, NY, provided support for open access publication.
Ethical committee approval for research involving human or animal subjects, material and data
Not applicable.
Authors’ Affiliations
References
- Banerjee M, Capozzoli M, McSweeney L, Sinha D: Beyond kappa: A review of interrater agreement measures. Can J Stat. 1999, 27 (1): 3-23. 10.2307/3315487.View ArticleGoogle Scholar
- Kraemer HC, Periyakoil VS, Noda A: Kappa coefficients in medical research. Stat Med. 2002, 21 (14): 2109-2129. 10.1002/sim.1180.View ArticleGoogle Scholar
- Landis JR, King TS, Choi JW, Chinchilli VM, Koch GG: Measures of agreement and concordance with clinical research applications. Stat Biopharma Res. 2011, 3 (2): doi:10.1198/sbr.2011.10019Google Scholar
- Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960, 20: 37-46. 10.1177/001316446002000104.View ArticleGoogle Scholar
- Brennan RL, Prediger DJ: Coefficient kappa: Some uses, misuses, and alternatives. Educ Psychol Meas. 1981, 41 (3): 687-699. 10.1177/001316448104100307.View ArticleGoogle Scholar
- Byrt T, Bishop J, Carlin JB: Bias, prevalence and kappa. J Clin Epidemiol. 1993, 46 (5): 423-429. 10.1016/0895-4356(93)90018-V.View ArticlePubMedGoogle Scholar
- Feinstein AR, Cicchetti DV: High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990, 43 (6): 543-549. 10.1016/0895-4356(90)90158-L.View ArticlePubMedGoogle Scholar
- Cicchetti DV, Feinstein AR: High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol. 1990, 43 (6): 551-558. 10.1016/0895-4356(90)90159-M.View ArticlePubMedGoogle Scholar
- Kraemer HC: Ramifications of a population model forκ as a coefficient of reliability. Psychometrika. 1979, 44 (4): 461-472. 10.1007/BF02296208.View ArticleGoogle Scholar
- Nelson JC, Pepe MS: Statistical description of interrater variability in ordinal ratings. Stat Methods Med Res. 2000, 9 (5): 475-496. 10.1191/096228000701555262.View ArticlePubMedGoogle Scholar
- Thompson WD, Walter SD: A reappraisal of the kappa coefficient. J Clin Epidemiol. 1988, 41 (10): 949-958. 10.1016/0895-4356(88)90031-5.View ArticlePubMedGoogle Scholar
- Lantz CA, Nebenzahl E: Behavior and interpretation of the kappa statistic: resolution of the two paradoxes. J Clin Epidemiol. 1996, 49 (4): 431-434. 10.1016/0895-4356(95)00571-4.View ArticlePubMedGoogle Scholar
- Bangdiwala SI: The Agreement Chart. 1988, Chapel Hill: The University of North CarolinaGoogle Scholar
- Bangdiwala SI, Shankar V: The agreement chart. BMC Med Res Methodol. 2013, 13 (1): 97-10.1186/1471-2288-13-97.View ArticlePubMedPubMed CentralGoogle Scholar
- Aickin M: Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics. 1990, 46: 293-302. 10.2307/2531434.View ArticlePubMedGoogle Scholar
- Andres AM, Femia-Marzo P: Chance-corrected measures of reliability and validity in 2× 2 tables. Commun Stat Theory Met. 2008, 37 (5): 760-772. 10.1080/03610920701669884.View ArticleGoogle Scholar
- Andrés AM, Marzo PF: Delta: a new measure of agreement between two raters. Brit J Math Stat Psychol. 2004, 57 (1): 1-19. 10.1348/000711004849268.View ArticleGoogle Scholar
- Gwet KL: Computing inter‒rater reliability and its variance in the presence of high agreement. Brit J Math Stat Psychol. 2008, 61 (1): 29-48. 10.1348/000711006X126600.View ArticleGoogle Scholar
- Bangdiwala SI: A Graphical Test for Observer Agreement. 45th International Statistical Institute Meeting, 1985. 1985, Amsterdam, 307-308.Google Scholar
- Meyer D, Zeileis A, Hornik K, Meyer MD, KernSmooth S: The vcd package. Retrieved October. 2007, 3: 2007-Google Scholar
- Friendly M: Visualizing Categorical Data. 2000, Cary, NC: SAS InstituteGoogle Scholar
- Guggenmoos‒Holzmann I: How reliable are change‒corrected measures of agreement?. Stat Med. 1993, 12 (23): 2191-2205. 10.1002/sim.4780122305.View ArticleGoogle Scholar
- Guggenmoos-Holzmann I: The meaning of kappa: probabilistic concepts of reliability and validity revisited. J Clin Epidemiol. 1996, 49 (7): 775-782. 10.1016/0895-4356(96)00011-X.View ArticlePubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/14/100/prepub
Pre-publication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.