Our review of the PubMed repository identified 82 eligible agreement studies published in the medical literature between 2018 and 2020. The studies covered a variety of disease areas. We observed a wide range of sample sizes and variability in typical sample size according to clinical field, statistical method and type of endpoint.
Continuous endpoints were the more common, for which Bland–Altman LoA was the most frequent statistical approach used, with a median sample size of 89 (IQR 35 to 124). Finding Bland–Altman LoA the most common approach is consistent with the review of Zaki et al. [2]. Another finding consistent with their review is our observation of the continued use of the correlation coefficient, despite it being deemed inappropriate for the assessment of agreement [6]. However, we did observe a lower frequency of use.
We found Kappa statistics to be the most common approach used with categorical variables, with a median sample size of 71 (IQR 50 to 233). Kappa is commonly used for the assessment of agreement using binary and ordinal scales [7]. Studies with categorical variables tended to have larger sample sizes than those focussing mainly on continuous variables. The finding of larger sample sizes for categorical compared to continuous outcomes is consistent with research in the context of pilot studies [8] and definitive outcome trials, as inferred from the target standardised effect sizes reported by Rothwell et al. [9].
We found that all included studies reported a sample size, but only one-third provided justification for their sample size, and of those, not all reported use of statistical sample size formulae. Kottner et al. [1] recommended that sample size justification be made explicit in agreement studies to ensure transparency and credibility. Despite this, Farzin et al. [10] found justification for the sample size was given in only nine of 280 agreement studies (3%) conducted in diagnostic imaging journals, which is markedly lower than we observed in the present review.
Variation in the quality of sample size reporting has been examined in the context of clinical trials, with 95% of the trails published in high impact journals reviewed by Charles et al. [11] reporting sample size calculations, but only 53% reporting all parameters required for replication. Copsey et al. [12] reported a lower proportion of trials describing a sample size calculation at 67%, with only 21% reporting all the components of the calculation. Tulka et al. [13] reported that just 42% of trials justified their sample size, and only 21% described a complete sample size calculation. Sample size reporting in clinical trials could be expected to be of higher quality since publication of the first CONSORT guidance in 1996 [14]. The trial reviews show higher proportions of studies reporting details of sample size estimation compared to agreement studies, but that inadequate reporting remains prevalent. The higher proportion of studies providing sample size details reported by Charles et al. [11] was likely because their review included only the highest impact medical journals.
Some authors suggest general rules of thumb for sample sizes for agreement studies, for example, Liao [4] recommended a minimum sample size of 32 and McAlinden et al. [15] a minimum sample size of 100 for agreement studies measuring continuous variables. A preferred approach, where possible, would be to use specific calculations that take into account the research question and appropriate statistical method of analysis. Formulae to determine minimum sample size requirements are available for different statistical methods, for example, Bland–Altman LoA [16, 17], ICC [18], Kappa coefficients [19], amongst others.
Some agreement studies may be constrained by the sample size available, for example when embedded within studies powered on a different outcome, or the pre-determined target sample may not be achieved for financial, temporal or other reasons. Nevertheless, the target and actual samples used should still be described and justified. The quality of agreement studies could be improved by following the Guidelines for Reporting Reliability and Agreement Studies (GRAAS) recommendations [1], which require explanation for the chosen sample size and explicit reporting of the number of raters, subjects/objects and replicate observations.
Strengths of this review are that this is the first to investigate how typical sample sizes in recent medical agreement studies differ by field, types of endpoints and statistical method. A team of statisticians was involved in the assessment of studies, allowing for increased accuracy of data review and extraction, and reduction of bias. Limitations include the use of only one electronic repository; research studies not present within the PubMed registry would not have been captured. Relatively few search terms were used, meaning some relevant studies may have been missed. Searches were limited to English language, meaning studies in other languages were also not included.