 Research article
 Open Access
 Published:
Homogeneity score test of AC_{1} statistics and estimation of common AC_{1} in multiple or stratified interrater agreement studies
BMC Medical Research Methodology volume 20, Article number: 20 (2020)
Abstract
Background
Cohen’s κ coefficient is often used as an index to measure the agreement of interrater determinations. However, κ varies greatly depending on the marginal distribution of the target population and overestimates the probability of agreement occurring by chance. To overcome these limitations, an alternative and more stable agreement coefficient was proposed, referred to as Gwet’s AC_{1}. When it is desired to combine results from multiple agreement studies, such as in a metaanalysis, or to perform stratified analysis with subject covariates that affect agreement, it is of interest to compare several agreement coefficients and present a common agreement index. A homogeneity test of κ was developed; however, there are no reports on homogeneity tests for AC_{1} or on an estimator of common AC_{1}. In this article, a homogeneity score test for AC_{1} is therefore derived, in the case of two raters with binary outcomes from K independent strata and its performance is investigated. An estimation of the common AC_{1} between strata and its confidence intervals is also discussed.
Methods
Two homogeneity tests are provided: a score test and a goodnessoffit test. In this study, the confidence intervals are derived by asymptotic, Fisher’s Z transformation and profile variance methods. Monte Carlo simulation studies were conducted to examine the validity of the proposed methods. An example using clinical data is also provided.
Results
Type I error rates of the proposed score test were close to the nominal level when conducting simulations with small and moderate sample sizes. The confidence intervals based on Fisher’s Z transformation and the profile variance method provided coverage levels close to nominal over a wide range of parameter combination.
Conclusions
The method proposed in this study is considered to be useful for summarizing evaluations of consistency performed in multiple or stratified interrater agreement studies, for metaanalysis of reports from multiple groups and for stratified analysis.
Background
To evaluate the reliability when two raters classify objects as either positive (+) or negative (−), Cohen’s κ [1] and the intraclass version of κ, which is identical to Scott’s π [2], have often been used. Let p_{a} be the agreement probability, and p_{1} and p_{2} the probabilities classified as (+) by rater 1 and 2 respectively. Then Cohen’s κ (κ_{Cohen}) and Scott’s π (κ_{Scott}) are defined as follows:
where \( {p}_{e(c)}={p}_1{p}_2+\left(1{p}_1\right)\left(1{p}_2\right),{p}_{e(s)}={p}_{+}^2+{\left(1{p}_{+}\right)}^2 \) and p_{+} = (p_{1} + p_{2})/2. The p_{e(c)} and p_{e(s)} are the probabilities of agreement expected by chance for Cohen’s κ and Scott’s π respectively. The p_{e(c)} assumes that the probabilities of positive classification differ between two raters, while the p_{e(s)} assumes that these two probabilities are the same. Landis and Koch provided benchmarks of the strength of consistency as follows: values ≤0 as poor, 0.00 to 0.20 as slight, 0.21 to 0.40 as fair, 0.41 to 0.60 as moderate, 0.61 to 0.80 as substantial and 0.81 to 1.00 as almost perfect agreement [3]. Although the authors acknowledge the arbitrary nature of their benchmarks, they recommended their benchmark scale as a useful guideline for practitioners.
Many extensions have been made to Cohen’s κ including those for agreement in the cases of ordinal data [4], multiple raters [5,6,7,8,9], comparisons of correlated κ ’s [10,11,12,13] and stratified data [14, 15]. However, as Feinstein and Cicchetti showed, Cohen’s κ depends strongly on the marginal distributions and therefore behaves paradoxically [16]. This behavior can be explained by the bias effect and the prevalence effect, on which various discussions have been undertaken [16,17,18]. A number of alternative measures of agreements have also been proposed, such as Holley and Guilford’s G [19], Aickin’s α [20], Andres and Marzo’s delta [21], Marasini’s s* [22, 23] and Gwet’s AC_{1} [24] and AC_{2} [25].
Gwet showed that AC_{1} has better statistical properties (bias and variance) than Cohen’s κ, Scott’s π and Gindex under a limited set of simulations for two raters with binary outcomes [24]. Shanker and Bangdiwala compared Cohen’s κ, Scott’s π, Prevalence Adjusted Bias Adjusted Kappa (PABAK) [26], AC_{1} and Bstatistic [27], which is not a kappatype chancecorrected measure, in the case of two raters and binary outcomes and showed that AC_{1} has better properties than other kappatype measures [28]. In addition, AC_{1} has been utilized in the field of medical research over the past decade [29,30,31,32,33,34,35]. Therefore, in this study we have limited our discussion to AC_{1} in the case of two raters with binary outcomes.
First, a brief review of the concept of Gwet’s AC_{1} is provided. Consider the situation in which two raters independently classify randomly extracted subject as positive (+) or negative (−). Gwet defined two events: G = {the two raters agree} and R = {at least one rater performs random rating}. The probability of agreement expected by chance is then p_{e} = P(G ⋂ R) = P(G R)P(R). A random rating would lead to the classification of an individual into each category with the same probability \( \frac{1}{2} \) and it follows that \( P\left(GR\right)=2\times \left(\frac{1}{2}\right)\times \left(\frac{1}{2}\right)=\frac{1}{2} \). As for the estimation of P(R), this probability cannot be obtained from data. Therefore, Gwet proposed approximating it with a normalized measure of randomness Ψ, defined as follows:
where π_{+} is the probability that a randomly chosen rater classifies a randomly chosen subject into the + category. Thus, the approximated probability of chance agreement is represented by
AC_{1} is thus defined as follows:
where p_{a} is the probability of agreement. Although p_{e} is approximated to \( {p}_e^{\ast } \), Gwet showed that the bias of γ, the difference between γ and the true interrater reliability, is equal to or less than Cohen’s κ, Scott’s π and Gindex under some assumption in the case of two raters with binary outcomes. Gwet also provided an estimator \( {\hat{\gamma}}^{\ast } \) of γ and its variance for multiple raters and multiple categories based on the randomization approach, which requires the selection of subjects to be random in such a way that all possible subject samples have the exact same chance of being selected. However, it is advantageous to employ a modelbased approach when, for example, the evaluation of the effect of subject covariates on agreement is of interest. Therefore, in the case of two raters with binary outcomes, Ohyama [36] assumed the underlying probability that a subject is rated as (+) and its marginal homogeneity of the two raters, and then constructed the likelihood. The maximum likelihood estimator of γ, which is shown to be identical to the estimator given by Gwet, was derived. The likelihoodbased confidence intervals for AC_{1}, inclusion of subject covariates, hypothesis testing and sample size determination were also discussed [36].
In this article, we discuss stratification analyses as another approach to adjust the effect of subject covariates on agreement. For example, a clinical assessment whether a patient has a particular disease symptom may be influenced by overall severity of the disease. In such a case, we consider stratification based on the severity of the disease. Another example is a multicenter interrater agreement study, in which the classifications for subjects are conducted independently in each center. These situations require several independent agreement statistics. Then the main purpose of the analyses would be testing whether the degree of interrater agreement can be regarded as homogeneous across strata, such as centers and severities of the disease.
For κ, Fleiss has been at the forefront of the idea of χ^{2} testbased interclass consistency with large sample variances [37] and further studies by Donner, Eliasziw and Klar [14], Nam [15, 38] and Wilding, Consiglio and Shan [39] have developed the homogeneity test of κ across covariate levels. However, there are no reports on homogeneity tests for AC_{1} or on an estimator of common AC_{1}. Therefore, in this article, we derive the homogeneity score test for AC_{1} from K independent strata and its performance is investigated. An estimation of the common AC_{1} between strata and its confidence intervals is also discussed. Finally, an example application of our approach to clinical trial data is provided.
Methods
Homogeneity tests
Score test
Consider K independent strata involving n_{k} subjects for k = 1, …, K. In each stratum, two raters independently classify subjects as either positive (+) or negative (−). Let X_{kij}= 1 if subject i(=1, …, n_{k}) in the kth stratum is classified as “+” by rater j(=1, 2) and X_{kij} = 0 otherwise. Suppose that P(X_{kij} = 1 i) = u_{ki}, \( E\left({u}_{ki}\right)={\pi}_k\ \mathrm{and}\kern0.5em Var\left({u}_{ki}\right)={\sigma}_k^2 \). The γ of the kth stratum is then expressed as follows [36]:
Let the number of observed pairs in the three categories of the kth stratum be x_{1k}, x_{2k} and x_{3k} and their corresponding probabilities be P_{1k}(γ_{k}), P_{2k}(γ_{k}) and P_{3k}(γ_{k}). The data of the kth stratum are then given as shown in Table 1.
The loglikelihood function is given by
where γ = (γ_{1}, …, γ_{K})^{'}, π = (π_{1}, …, π_{K})^{'}, l_{k}(γ_{k}, π_{k}) = x_{1k} log P_{1k}(γ_{k}) + x_{2k} log P_{2k}(γ_{k}) + x_{3k} log P_{3k}(γ_{k}),
and A_{k} = 1 − 2π_{k}(1 − π_{k}).
The maximum likelihood estimators of γ_{k} and π_{k} are then given by
and
respectively.
The first and second derivatives of the loglikelihood function and the Fisher information matrix are given in the Appendix. The aim of this study is to test the homogeneity of the agreement coefficients among K strata, and thus the null hypothesis to test is represented by H_{0} : γ_{k} = γ_{0} (k = 1, 2, ..., K). The score test statistic for the null hypothesis is derived as follows (see Appendix):
where \( {\overset{\sim }{B}}_k,{\overset{\sim }{C}}_k,{\overset{\sim }{D}}_k\ \mathrm{and}\ {\overset{\sim }{R}}_k \) are obtained by substituting the maximum likelihood estimators \( {\overset{\sim }{\gamma}}_0 \) and \( {\overset{\sim }{\pi}}_k \) under the null hypothesis into \( {\displaystyle \begin{array}{l}{B}_k=\frac{1}{P_{1k}}+\frac{4}{P_{2k}}+\frac{1}{P_{3k}},\\ {}{C}_k=\frac{1}{P_{1k}}\frac{1}{P_{3k}}+\left(1{\gamma}_k\right)\left(12{\pi}_k\right){B}_k,\\ {}{D}_k=\frac{1}{P_{1k}}+\frac{1}{P_{3k}}+\left(1{\gamma}_k\right)\left(12{\pi}_k\right)\left(\frac{1}{P_{1k}}\frac{1}{P_{3k}}+{C}_k\right),\\ {}{R}_k=\frac{x_{1k}}{P_{1k}}\frac{2{x}_{2k}}{P_{2k}}+\frac{x_{3k}}{P_{3k}}.\end{array}} \)
\( T\left({\overset{\sim }{\gamma}}_0,\overset{\sim }{\boldsymbol{\pi}}\right) \) is asymptotically distributed as a χ^{2} with K − 1 degrees of freedom. The homogeneity hypothesis is rejected at level α when \( T\left({\overset{\sim }{\gamma}}_0,\overset{\sim }{\boldsymbol{\pi}}\right) \) ≥ \( \kern0.5em {\chi}_{\left(1\alpha \right),K1}^2 \), where \( {\chi}_{\left(1\alpha \right),K1}^2 \) is the 100 × (1 − α) percentile point of the χ^{2} distribution with K − 1 degrees of freedom.
Note that, since 0 ≤ P_{1k}(γ_{k}), P_{2k}(γ_{k}), P_{3k}(γ_{k}) ≤ 1 and P_{1k}(γ_{k}) + P_{2k}(γ_{k}) + P_{3k}(γ_{k}) = 1, substituting (6), (7) and (8) into these equations, the admissible range of γ_{k} with respect to π_{k} is obtained as follows [36]:
When obtaining the maximum likelihood estimators \( {\overset{\sim }{\gamma}}_0 \) and \( {\overset{\sim }{\pi}}_k \) under the null hypothesis by numerical calculation, initial values need to be set to satisfy this condition.
Goodnessoffit test
Donner, Eliasziw and Klar proposed a goodnessoffit approach for testing homogeneity of kappa statistics in the case of two raters with binary outcomes [40]. This procedure can also be applied to AC_{1} statistics. Given that the frequencies x_{1k}, x_{2k}, x_{3k}, k = 1, …, K in Table 1 follow a multinomial distribution conditional on n_{k}, estimated probabilities under H_{0} are given by \( {\hat{P}}_{hk}\left({\overset{\sim }{\gamma}}_0\right) \), which is obtained by replacing π_{k} by \( {\hat{\pi}}_k \) and γ_{k} by \( {\overset{\sim }{\gamma}}_0 \) in P_{hk}(γ_{k}); h = 1, 2, 3; k = 1, …, K. Then the goodnessoffit statistic is derived as follows:
under H_{0}, \( {\chi}_G^2 \) follows an approximate χ^{2} distribution with K − 1 degrees of freedom. The homogeneity hypothesis is rejected at level α when \( {\chi}_G^2\ge {\chi}_{\left(1\alpha \right),K1}^2 \), where \( {\chi}_{\left(1\alpha \right),K1}^2 \) is the 100 × (1 − α) percentile point of the χ^{2} distribution with K − 1 degrees of freedom.
Estimation of common AC_{1}
If the assumption of homogeneity is reasonable, the estimate of γ_{0} can be used as an appropriate summary measure of reliability. The maximum likelihood estimator \( {\overset{\sim }{\gamma}}_0 \) and \( {\overset{\sim }{\pi}}_k \) are obtained by maximizing the loglikelihood functions \( {l}_0\left({\gamma}_0,\boldsymbol{\pi} \right)=\sum \limits_{k=1}^K{l}_k\left({\gamma}_0,{\pi}_k\right) \). Since an analytical solution cannot be obtained from this function, numerical iterative calculations are used. The variance \( Var\left({\overset{\sim }{\gamma}}_0\right) \) of \( {\overset{\sim }{\gamma}}_0 \) can be expressed as follows (see Appendix):
where \( {B}_k^{(0)},{C}_k^{(0)},{D}_k^{(0)} \) are values using γ_{k} = γ_{0} in B_{k}, C_{k}, D_{k} respectively, and
A simple 100 × (1 − α) % confidence interval using the asymptotic normality of \( {\overset{\sim }{\gamma}}_0 \) can be expressed as follows:
where Z_{α/2} is the α/2 upper quantile of the standard normal distribution and \( \hat{Var}\left({\overset{\sim }{\gamma}}_0\right) \) is obtained by substituting \( {\overset{\sim }{\gamma}}_0 \) and \( {\overset{\sim }{\pi}}_k \) into (14). Hereafter, this method is referred to as the simple asymptotic (SA) method. Since Eq. (14) depends on γ_{0}, SA method may not have the correct coverage rate, and the normality of the sampling distribution of \( {\overset{\sim }{\gamma}}_0 \) may be improved using Fisher’s Z transformation. This method is referred to below as Fisher’s Z transformation (FZ) method (see Appendix).
As an alternative method, we employ the profile variance approach, which has been shown to perform well in the case of the intraclass κ for binary outcome data [41,42,43]. This approach also performs well for AC_{1} in the case of two raters with binary outcomes [36]. The confidence interval based on the profile variance can be obtained by solving the following inequality for γ_{0}:
where \( \overset{\sim }{Var}\left({\overset{\sim }{\gamma}}_0\right) \) is given by substituting \( {\overset{\sim }{\pi}}_k \) into π_{k} in (15). Hereafter, this method is referred to as the profile variance (PV) method (see Appendix).
Numerical evaluations
We conducted Monte Carlo simulations to investigate the performance of the proposed homogeneity tests and to evaluate the estimate of common AC_{1} and its confidence intervals under the following conditions: the number of strata in the simulation is K = 2 or 3; and random observations are generated from the trinomial distributions according to the probabilities of (6), (7) and (8) by giving the values of γ_{k} and π_{k}. The balanced and unbalanced cases were considered for the values of π_{k} and n_{k}. The values of γ_{k} and π_{k} are set within the theoretical range of Eq. (12) derived in the preceding paragraph. Ten thousand times of iterations were carried out for each parameter combination.
When π_{k} is close to 0 or 1 and n_{k} is small, there are cases in which the generated data include zero cells. In such cases, B_{k}, C_{k}, D_{k} and R_{k} cannot be estimated . Thus, when zero cells were generated, we adopted the approach of adding 0.5 to the frequency of each combination by two raters, (+,+), (+,−), (−,+), (−,−). This simple method was discussed by Agresti [44] and was adopted in a previous study [39].
Results
Empirical type I error rate for the homogeneity test
The type I error rates of the homogeneity tests with a significance level of 0.05 were examined. The sample size was set at n_{k} = n = 20, 50, 80 for balanced settings and (n_{1}, n_{2}, n_{3}) = (20, 50, 80) for unbalanced settings. The error rate obtained by the score test is expressed as SCORE and the error rate obtained by the goodnessoffit test is expressed as GOF. Table 2 summarizes the results for K = 2.
Overall, the proposed score test did not show any significant type I error rate inflation, but it was very conservative when sample size was small and γ_{0} was close to 1.
In the case of n = 20 when γ_{0} = 0.1, 0.3 or 0.5, the type I error rates of SCORE were maintained at the nominal level of 0.05 regardless of whether π_{k} was balanced or unbalanced, but when γ_{0} = 0.7 or 0.9, the type I error rates were slightly conservative. Especially when γ_{0} = 0.9, the rate was significantly conservative to the extent of being less than 0.01. In the case of n = 50, the type I error rates were maintained at the nominal level of 0.05 except when γ_{0} = 0.9. Finally in the case of n = 80, the type I error rates were almost maintained at the nominal level. In contrast, the type I error rate of GOF tended to be larger than that of SCORE and in many cases it was not maintained at the nominal level.
The results obtained for K = 3 are shown Table S1 and Table S2 in Additional file 1.
The Additional file 2 provides the simulation code of empirical type I error rate using R language.
Empirical power of the homogeneity test
The empirical power of the score test was investigated only for the case of K = 2, by setting γ_{1} = 0.1, 0.3, 0.5 and γ_{2} − γ_{1} = 0.3, 0.4. The values of π_{k} and n_{k} were set as in the type I error simulation. The results are shown in Table 3. The power tended to be large as the value of γ_{1} increased under the fixed values of π and γ_{2} − γ_{1}.
The empirical power of the GOF test was also examined under the same simulation conditions as the score test. The results are also shown in Table 3. However, the GOF had a large type I error rate inflation (Table 2) and was invalid as a test.
The Additional file 2 provides the simulation code of empirical power using R language.
Bias and mean square error for common AC_{1}
We evaluated the bias and mean square error (MSE) of the maximum likelihood estimator for the common AC_{1}, \( {\overset{\sim }{\gamma}}_0 \). The balanced and unbalanced conditions for π_{k} and the balanced condition for n_{k} were considered. The results are shown in Table 4. The bias of \( {\overset{\sim }{\gamma}}_0 \) tended to be small as γ_{0} increased, but \( {\overset{\sim }{\gamma}}_0 \) was almost unbiased. As expected, the bias and MSE tended to be small as the sample size increased.
The Additional file 3 provides the simulation code of bias and mean square error for common AC_{1} using R language.
Confidence intervals for common AC_{1}
We conducted a simulation study to evaluate the performances of the three confidence intervals presented in the previous section. The coverage rates of the 95% confidence interval were examined. The balanced and unbalanced conditions for π_{k} and the balanced condition for n_{k} are considered. The results are shown in Table 5. The coverage rate of the SA method was generally lower than 0.95 under many conditions, with the exception of the value being close to 0.99 in the case of n_{1} = n_{2} = 20 and γ_{0}= 0.9. The FZ method and PV method greatly improved the coverage rates close to the nominal level. However, the coverage rate of the PV method was closer to the nominal level than that of the FZ method in most cases under the conditions examined. The coverage rates of each method were also evaluated in the case of K = 3, and the unbalanced n_{k} conditions and both the FZ method and the PV method achieved coverage rates near 0.95 (results not shown).
The Additional file 4 provides the simulation code of confidence intervals for common AC_{1} using R language.
An example
As an example, we used data from a randomized clinical trial called the Silicon Study, which was conducted to investigate the effectiveness of silicone fluids versus gases in the management of proliferative vitreoretinopathy (PVR) by vitrectomy [45]. The PVR classification, determined at the baseline visit, defines the severity of the disease as a continuum of increasing pathology graded as C3, D1, D2 or D3. The presence or absence of retinal injury in the superior nasal cavity was evaluated clinically by the operating ophthalmic surgeon and photographically by an independent fundus photograph reading center [46].
The data and results are summarized in Table 6. For reference, the results of the homogeneity score test proposed by Nam for the intraclass κ are also provided [15]. The probabilities of agreement in each stratum were from 0.800 to 0.880 and not so different. However, the values of κ in each stratum were from 0.117 to 0.520 and were greatly different. This might be due to the prevalence effect caused by the small values of π. In contrast, the values of γ were 0.723 to 0.861 and did not differ greatly among strata.
The proposed homogeneity score statistic \( T\left({\overset{\sim }{\gamma}}_0,\overset{\sim }{\boldsymbol{\pi}}\right) \) was 2.060 (pvalue = 0.560) and the homogeneity hypothesis was not rejected. The estimate of common AC_{1} was 0.808 and its 95% confidence intervals were 0.743–0.873 (SA method), 0.732–0.864 (FZ method) and 0.730–0.862 (PV method). Also, the score statistic for testing the homogeneity of κ’s [15] was 2.700 (pvalue = 0.440) and the common κ was 0.352.
The Additional file 5 provides the code for clinical data examples using R language.
To investigate the sensitivity of the indicators to π_{k}, we hypothetically considered more balanced and less balanced π_{k} under fixed p_{a} and n_{k} in each stratum. The generated data set and analysis results are summarized as Table S3 in the Additional file 1. κ was more sensitive to changes in the value of π, but AC_{1} was less sensitive to changes in the value of π than κ. The common AC_{1} was not affected as much as the common κ even if the π balance was lost.
Discussion
It is well known that Cohen’s κ depends strongly on the marginal distributions, and Gwet proposed alternative and more stable measures of agreement, AC_{1} for nominal data and its extended agreement AC_{2} for ordinal data [24, 25]. A number of alternative measures have also been proposed, as in Holley and Guilford’s G [19], Aickin’s α [20], Andres and Marzo’s delta [21] and Marasini’s s* [22, 23]. Gwet [24] and Shankar and Bangdiwala [28] compared some measures and showed that AC_{1} has better properties than other kappatype measures. In addition, AC_{1} has been utilized in the field of medical research over the past decade [29,30,31,32,33,34,35]. However statistical inference procedures of AC_{1} have not been discussed sufficiently. Therefore, Ohyama expressed AC_{1} using population parameters to develop a likelihoodbased inference procedure and constructed confidence intervals of AC_{1} based on profile variances and likelihood ratios. Inclusion of subjects’ covariates, hypothesis testing and sample size estimation were also presented [36]. In the present study, the case of stratified data was discussed as one development of Ohyama [36] for two raters with binary outcomes. Furthermore, tests were derived for the homogeneity of AC_{1} between K independent strata and the inference of common AC_{1} was discussed.
In the numerical evaluation of type I error, both tests were conservative when the sample size was small and γ_{0} was 0.9, but the conservativeness was relaxed when the sample size was as large as 80. In other settings of simulation, the score test performed well while GOF sometimes could not achieve the nominal level. Therefore, we recommend using the score test for testing the homogeneity of AC_{1} among K strata. Note that, when zero cells are observed, the homogeneity score test statistic cannot be calculated. In such cases in our simulation study, we simply added 0.5 to the data set, which had no serious effect on the performance of the proposed score test in our simulation settings.
If the homogeneity assumption is reasonable, it may be desired to provide an estimate of the common AC_{1} as a summary measure of reliability. In the present study, we proposed an estimator of common AC_{1} and constructed its confidence intervals based on the SA, FZ, and PV methods. We also evaluated the performance of each numerically. The bias and MSE tended to be small as the sample size increased, and the results were nearly 0 when n = 80. The PV method provides coverage levels close to nominal in most situations, while the SA method tends to provide a shortage of coverage and the FZ method tends to provide excess coverage in some situations. Therefore, we recommend the PV method for calculating confidence intervals.
As in the PVR example, AC_{1} in each stratum is less affected by the prevalence or marginal probability than by the κ. It is suggested that the proposed homogeneity test and the general framework of common AC_{1} estimation are also essentially more stable than those of the κ.
There were some limitations in this study. First, as described above, the performance of the proposed score test was very conservative when γ_{0} = 0.9 and sample size was small. An exact approach might be an alternative method in such cases.
Next, in this study, the cases were limited to two raters with binary outcomes in each stratum. However, in the evaluation of medical data, it is often the case that multiple raters classify subjects into nominal or ordered categories. Our proposed method may be extended to the case of multiple raters with binary outcomes using the likelihood function for multiple raters. In the cases of two raters with nominal outcomes, Agresti [47] proposed a quasisymmetry model with kappa as a parameter, and this technique may be extended to AC_{1} in the case of stratified data.
Finally, continuous covariates need to be categorized adequately to apply the proposed approach. The regression model proposed by Ohyama [36] can be used to assess the effect of continuous covariates on AC_{1}, but it is limited to the case of two raters with binary data. Nelson and Edwards [48] and Nelson, Mitani and Edwards [49] proposed a method for constructing a measure of agreement using generalized linear mixedeffect models by introducing continuous latent variables representing the subject’s true disease status and for flexibly incorporating rater and subject covariates. These approaches might be applicable to AC_{1} and AC_{2}.
Conclusion
The method proposed in this study is considered to be useful for summarizing evaluations of consistency performed in multiple or stratified interrater agreement studies. In addition, the proposed method can be applied not only to medical or epidemiological research but also to assessment of the degree of consistency of characteristics, such as biometrics, psychological measurements, and data in the behavioral sciences.
Availability of data and materials
The program codes are shared in additional files. Clinical data referred to are from Barlow, et al. ref. [45].
Abbreviations
 FZ:

Fisher’s Z transformation
 GOF:

goodnessoffit
 MLE:

maximum likelihood estimator
 MSE:

mean squared error
 PV:

profile variance
 PVR:

proliferative vitreoretinopathy
 SA:

simple asymptotic
References
 1.
Cohen J. Coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–40. https://doi.org/10.1177/001316446002000104.
 2.
Scott WA. Reliability of content analysis; the case of nominal scale coding. Public Opin Q. 1955;19:321–5. https://doi.org/10.1086/266577.
 3.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. https://doi.org/10.2307/2529310.
 4.
Cohen J. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull. 1968;70(4):213–20. https://doi.org/10.1037/h0026256.
 5.
Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378–82. https://doi.org/10.1037/h0031619.
 6.
Landis JR, Koch GG. An application of hierarchical kappatype statistics in the assessment of majority agreement among multiple observers. Biometrics. 1977;33(2):363–74. https://doi.org/10.2307/2529786.
 7.
Kraemer HC. Extension of the kappa coefficient. Biometrics. 1980;36(2):207–16. https://doi.org/10.2307/2529972.
 8.
Davies M, Fleiss JL. Measuring agreement for multinomial data. Biometrics. 1982;38(4):1047–51. https://doi.org/10.2307/2529886.
 9.
Berry KJ, Mielke PW. A generalization of Cohen's kappa agreement measure to interval measurement and multiple raters. Educ Psychol Meas. 1988;48(4):921–33. https://doi.org/10.1177/0013164488484007.
 10.
Oden NL. Estimating kappa from binocular data. Stat Med. 1991;10(8):1303–11. https://doi.org/10.1002/sim.4780100813.
 11.
McKenzie DP, Mackinnon AJ, Péladeau N, Onghena P, Bruce PC, Clarke DM, et al. Comparing correlated kappas by resampling: is one level of agreement significantly different from another? J Psychiatr Res. 1996;30(6):483–92. https://doi.org/10.1016/S00223956(96)000337.
 12.
Barnhart HX, Williamson JM. Weighted leastsquares approach for comparing correlated kappa. Biometrics. 2002;58(4):1012–9. https://doi.org/10.1111/j.0006341X.2002.01012.x.
 13.
Gwet KL. Testing the difference of correlated agreement coefficients for statistical significance. Educ Psychol Meas. 2016 Aug;76(4):609–37. https://doi.org/10.1177/0013164415596420.
 14.
Donner A, Eliasziw M, Klar N. Testing the homogeneity of kappa statistics. Biometrics. 1996;52(1):176–83. https://doi.org/10.2307/2533154.
 15.
Nam JM. Homogeneity score test for the intraclass version of the kappa statistics and samplesize determination in multiple or stratified studies. Biometrics. 2003;59(4):1027–35. https://doi.org/10.1111/j.0006341X.2003.00118.x.
 16.
Feinstein AR, Cicchetti DV. High agreement but low kappa: I. the problems of two paradoxes. J Clin Epidemiol. 1990;43(6):543–9. https://doi.org/10.1016/08954356(90)90158L.
 17.
Thompson WD, Walter SD. A reappraisal of the kappa coefficient. J Clin Epidemiol. 1988;41(10):949–58. https://doi.org/10.1016/08954356(88)900315.
 18.
Vach W. The dependence of Cohen's kappa on the prevalence does not matter. J Clin Epidemiol. 2005 Jul;58(7):655–61. https://doi.org/10.1016/j.jclinepi.2004.02.021.
 19.
Holley JW, Guilford JP. A note on the G index of agreement. Educ Psychol Meas. 1964;24(4):749–53. https://doi.org/10.1177/001316446402400402.
 20.
Aickin M. Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics. 1990;46(2):293–302. https://doi.org/10.2307/2531434.
 21.
Andrés AM, Marzo PF. Delta: a new measure of agreement between two raters. Br J Math Stat Psychol. 2004;57(Pt 1):1–19. https://doi.org/10.1348/000711004849268.
 22.
Marasini D, Quatto P, Ripamonti E. The ordinal interrater agreement for the evaluation of University courses. Stat Appl. 2014;XII(1):5–16. https://doi.org/10.1400/229464.
 23.
Marasini D, Quatto P. Ripamonti. Assessing the interrater agreement through weighted indexes. Stat Methods Med Res. 2016;25(6):2611–33. https://doi.org/10.1177/0962280214529560.
 24.
Gwet KL. Computing interrater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008;61(Pt 1):29–48. https://doi.org/10.1348/000711006X126600.
 25.
Gwet KL. Handbook of interrater reliability: the definitive guide to measuring the extent of agreement among raters. 4th ed. Gaithersburg: Advanced Analytics, LLC; 2014.
 26.
Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol 1993;46(5):423–429. https://doi.org/10.1016/08954356(93)90018V.
 27.
Bangdiwala SI. A graphical test for observer agreement, vol. 1. Amsterdam: Proc 45th Int Stats Institute Meeting; 1985. p. 307–8.
 28.
Shankar V, Bangdiwala SI. Observer agreement paradoxes in 2x2 tables : comparison of agreement measures. BMC Med Res Methodol. 2014;14:100. https://doi.org/10.1186/1471228814100.
 29.
Alencar LM, Zangwill LM, Weinreb RN, Bowd C, Vizzeri G, Sample PA, et al. Agreement for detecting glaucoma progression with the GDx guided progression analysis, automated perimetry, and optic disc photography. Ophthalmology. 2010;117(3):462–70. https://doi.org/10.1016/j.ophtha.2009.08.012.
 30.
Marks D, Comans T, Thomas M, Ng SK, O'Leary S, Conaghan PG, et al. Agreement between a physiotherapist and an orthopaedic surgeon regarding management and prescription of corticosteroid injection for patients with shoulder pain. Man Ther. 2016;26:216–22. https://doi.org/10.1016/j.math.2016.10.001.
 31.
Pollock M, Fernandes RM, Hartling L. Evaluation of AMSTAR to assess the methodological quality of systematic reviews in overviews of reviews of healthcare interventions. BMC Med Res Methodol. 2017;17(1):48. https://doi.org/10.1186/s1287401703255.
 32.
Veldhoen S, Weng AM, Knapp J, Kunz AS, Stäb D, Wirth C, et al. Selfgated noncontrastenhanced functional lung MR imaging for quantitative ventilation assessment in patients with cystic fibrosis. Radiology. 2017;283(1):242–51. https://doi.org/10.1148/radiol.2016160355.
 33.
Zee J, Hodgin JB, Mariani LH, Gaut JP, Palmer MB, Bagnasco SM, et al. Reproducibility and feasibility of strategies for morphologic assessment of renal biopsies using the Nephrotic syndrome study network digital pathology scoring system. Arch Pathol Lab Med. 2018;142(5):613–25. https://doi.org/10.5858/arpa.20170181OA.
 34.
Hansen D, Hansen E, Retegan C, Morphet J, Beiles CB. Validation of data submitted by the treating surgeon in the Victorian audit of surgical mortality. ANZ J Surg. 2019;89(1–2):16–9. https://doi.org/10.1111/ans.14910.
 35.
Wennberg S, Karlsen LA, Stalfors J, Bratt M, Bugten V. Providing quality data in health care  almost perfect interrater agreement in the Norwegian tonsil surgery register. BMC Med Res Methodol. 2019;19(1):6. https://doi.org/10.1186/s1287401806512.
 36.
Ohyama T. Statistical inference of agreement coefficient between two raters with binary outcomes. Commun Stat Theory Methods. 2019. https://doi.org/10.1080/03610926.2019.1576894.
 37.
Fleiss JL. Statistical methods for rates and proportions. 2nd ed. Hoboken: Wiley; 1981.
 38.
Nam JM. Testing the intraclass version of kappa coefficient of agreement with binary scale and sample size determination. Biom J. 2002;44:558–70. https://doi.org/10.1002/15214036(200207)44:5<558::AIDBIMJ558>3.0.CO;25.
 39.
Wilding GE, Consiglio JD, Shan G. Exact approaches for testing hypotheses based on the intraclass kappa coefficient. Stat Med. 2014;33(17):2998–3012. https://doi.org/10.1002/sim.6135.
 40.
Donner A, Eliasziw M. A goodnessoffit approach to inference procedures for the kappa statistic: confidence interval construction, significancetesting and sample size estimation. Stat Med. 1992;11(11):1511–9. https://doi.org/10.1002/sim.4780130809.
 41.
Lee JJ, Tu ZN. A better confidence interval for kappa on measuring agreement between two raters with binary outcomes. J Comput Graph Stat. 1994;3:301–21. https://doi.org/10.2307/1390914.
 42.
Donner A, Zou G. Interval estimation for a difference between intraclass kappa statistics. Biometrics. 2002;58(1):209–15. https://doi.org/10.1111/j.0006341X.2002.00209.x.
 43.
Zou G, Donner A. Confidence interval estimation of the intraclass correlation coefficient for binary outcome data. Biometrics. 2004;60(3):807–11. https://doi.org/10.1111/j.0006341X.2004.00232.x.
 44.
Agresti A. Categorical data analysis. 2nd ed. Hoboken: Wiley; 2002. https://doi.org/10.1002/0471249688.
 45.
Barlow W, Lai MY, Azen SP. A comparison of methods for calculating a stratified kappa. Stat Med. 1991;10(9):1465–72. https://doi.org/10.1002/sim.4780100913.
 46.
Silicone Study Group. Proliferative vitreoretinopathy. The Silicone Study Group. Am J Ophthalmol. 1985;99(5):593–5. https://doi.org/10.1016/S00029394(14)77967X.
 47.
Agresti A. An agreement model with kappa as parameter. Stat Prob Lett. 1989;7(4):271–3. https://doi.org/10.1016/01677152(89)901041.
 48.
Nelson KP, Edwards D. Measures of agreement between many raters for ordinal classifications. Stat Med. 2015;34(23):3116–32. https://doi.org/10.1002/sim.6546.
 49.
Nelson KP, Mitani AA, Edwards D. Assessing the influence of rater and subject characteristics on measures of agreement for ordinal ratings. Stat Med. 2017;36(20):3181–99. https://doi.org/10.1002/sim.7323.
 50.
Bickel PJ, Doksum KA. Mathematical statistics : basic ideas and selected topics. 1st ed. San Francisco: HoldenDay; 1977.
 51.
Rothman KJ. Estimation of confidence limits for the cumulative probability of survival in life table analysis. J Clin Epdemiol. 1978;31(8):557–60. https://doi.org/10.1016/00219681(78)900437.
 52.
Afifi AA, Elashoff RM, Lee JJ. Simultaneous nonparametric confidence intervals for survival probabilities from censored data. Stat Med. 1986;5:653–62. https://doi.org/10.1002/sim.4780050612.
Acknowledgements
We thank Professor T. Kakuma and Dr. T. Yanagawa for providing useful advice on the model construction and evaluation method. We also thank reviewers and editors for constructive and useful advice for improving this article.
Funding
No grant support or other funding was received.
Author information
Affiliations
Contributions
CH and TO designed the concept of this research. CH conducted the simulation, analyzed a clinical example and drafted the manuscript. TO supervised this study and critically reviewed the manuscript. Both authors have read and approved the manuscript.
Corresponding author
Correspondence to Chikara Honda.
Ethics declarations
Ethics approval and consent to participate
Not applicable because of the study involved the development of statistical methods. Clinical example retrospective data were originally published in Barlow, et al. ref. [45].
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
First and second derivatives of the loglikelihood function
The first and second derivatives of l(γ, π), γ = (γ_{1}, ∙ ∙ ∙, γ_{K})^{′}, π = (π_{1}, ∙ ∙ ∙, π_{K})′ with respect to r_{k} and π_{k} are obtained as follows:
where
Since E(x_{hk}) = n_{k}P_{hk}(γ_{k}) (h = 1, 2, 3),
where
Thus, the Fisher information matrix is given as follows:
Score test statistic for the null hypothesis
Define the score function U as
The score statistic for testing the null hypothesis H_{0} : γ_{1} = ⋯ = γ_{K} = γ_{0} is asymptotically distributed as a χ^{2} with K1 degrees of freedom, and then expressed as
where \( {\overset{\sim }{\gamma}}_0 \) and \( {\overset{\sim }{\pi}}_k \) are the maximum likelihood estimators under H_{0}, and then score function vector is expressed as \( U\left({\overset{\sim }{\gamma}}_0,\overset{\sim }{\boldsymbol{\pi}}\right)=\frac{1}{2}{\left({\overset{\sim }{A}}_1{\overset{\sim }{R}}_1,\dots, {\overset{\sim }{A}}_K{\overset{\sim }{R}}_K,0,\dots, 0\right)}^{\prime }. \)
The upper left K × K matrix of \( I{\left(\overset{\sim }{\boldsymbol{\gamma}},\overset{\sim }{\boldsymbol{\pi}}\right)}^{1} \) is expressed as follows:
so that, the score test statistic Eq. (11) is derived.
Confidence interval for γ_{0} based on Fisher’s Z transformation
Fisher’s Z transformation of \( {\overset{\sim }{\gamma}}_0 \) is defined by
Using the delta method, the asymptotic variance of \( \overset{\sim }{z} \) is represented by
where \( Var\left(\overset{\sim }{\gamma}\right) \) is given by Eq. (14). Then a confidence interval for z = 0.5 log[(1 + γ_{0})/(1 − γ_{0})] is obtained by
where \( \hat{Var}\left(\overset{\sim }{z}\right) \) is defined by substituting \( {\overset{\sim }{\gamma}}_0 \) and \( {\overset{\sim }{\pi}}_k \) into \( Var\left(\overset{\sim }{z}\right) \). Thus the confidence interval for γ_{0} based on FZ method is obtained as follows:
Derivation of Eq. (14)
By the secondorder partial derivatives of the loglikelihood function l_{0}(γ_{0}, π) and taking expectations, we obtain
where \( {B}_k^{(0)},{C}_k^{(0)} \) and \( {D}_k^{(0)} \) are values using γ_{k} = γ_{0} in B_{k}, C_{k} and D_{k} respectively. Let
where I_{γπ} = (I_{γ1}, …, I_{γK}) and I_{ππ} = diag(I_{kk}). When P_{1k} ≠ 0, P_{2k} ≠ 0 and P_{3k} ≠ 0 for all k, I_{ππ} is nonsingular matrix and then the element corresponding to I_{γγ} of the inverse matrix of I, which is the variance of \( {\overset{\sim }{\gamma}}_0 \), is given by
Since
where \( {p}_{a,k}={\gamma}_0\left(1{p}_{e,k}^{\ast}\right)+{p}_{e,k}^{\ast } \) is the probability of agreement in the kth stratum and \( {p}_{e,k}^{\ast }=2{\pi}_k\left(1{\pi}_k\right) \) is the probability of chance agreement in the kth stratum, and using the variance formula for the single stratum case given by Ohyama [36], I^{γγ} can be reduced to the rightmost expression in Eq. (14).
Profile variance approach
The profile variance of a statistic is defined as the variance similar to the estimated variance but without substituting the estimate for the parameter corresponding to the statistic [41].
In this study, \( {\overset{\sim }{\pi}}_k \) is substituted for π_{k} in Eq. (15), and then we obtain the profile variance \( \overset{\sim }{Var}\left({\overset{\sim }{\gamma}}_0\right) \) from (14). Since \( {\overset{\sim }{\gamma}}_0 \) is distributed asymptotically as the normal distribution with mean γ_{0} and variance (14), we have
Thus we can obtain the confidence limits as the two admissible roots of Eq. (17).
Since Eq. (17) is a cubic equation for γ_{0} and is complicated to solve, thus we calculated the confidence limit by numerical calculation. The program is given in additional file. Other examples of the profile variance approach for obtaining confidence intervals can be found in many literatures. For example, Bickel and Doksum [50] reported a confidence interval based on the profile variance for the onesample binomial proportion. Rothman [51] and Afifi, Elashoff and Lee JJ [52] described profile variance type of confidence intervals for survival probability.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Honda, C., Ohyama, T. Homogeneity score test of AC_{1} statistics and estimation of common AC_{1} in multiple or stratified interrater agreement studies. BMC Med Res Methodol 20, 20 (2020). https://doi.org/10.1186/s1287401908875
Received:
Accepted:
Published:
Keywords
 Common AC_{1}
 Consistency evaluation
 Gwet’s AC_{1}
 Homogeneity test
 Interrater agreement
 Stratified study