On Jones et al.’s method for extending Bland-Altman plots to limits of agreement with the mean for multiple observers

Background To assess the agreement of continuous measurements between a number of observers, Jones et al. introduced limits of agreement with the mean (LOAM) for multiple observers, representing how much an individual observer can deviate from the mean measurement of all observers. Besides the graphical visualisation of LOAM, suggested by Jones et al., it is desirable to supply LOAM with confidence intervals and to extend the method to the case of multiple measurements per observer. Methods We reformulate LOAM under the assumption the measurements follow an additive two-way random effects model. Assuming this model, we provide estimates and confidence intervals for the proposed LOAM. Further, this approach is easily extended to the case of multiple measurements per observer. Results The proposed method is applied on two data sets to illustrate its use. Specifically, we consider agreement between measurements regarding tumour size and aortic diameter. For the latter study, three measurement methods are considered. Conclusions The proposed LOAM and the associated confidence intervals are useful for assessing agreement between continuous measurements. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-020-01182-w.


Background
Clinical decisions regarding diagnosis or treatment are often based on one or more measured quantities such as blood pressure, tumour size, or the diameter of an aorta. To understand the limitations of using such measurements in clinical practice, it is important to quantify how much the measurements may vary.
For almost three decades, Bland-Altman plots have been the standard method for graphical assessment of agreement between continuous measurements made by two observers or methods on a number of subjects [1].
In particular, Bland-Altman plots are often used to assess how well a new measurement method compares to a current standard method. However, if the goal is to assess the variability of measurements made by different observers it is preferable to consider more than two observers.
This prompted Jones et al. to suggest an extension of Bland-Altman's graphical method for assessing limits of agreement between two observers to the limits of agreement with the mean (LOAM) for multiple observers [2]. Jones et al.'s LOAM have the advantage that they quantify agreement between measurements on the same scale as the measurements themselves, in contrast to the intra-class correlation (ICC) that has no unit of measure and always takes value between 0 and 1.
In more detail, consider a study where a continuous quantity is observed on a subjects by b observers (or methods). We let y ij denote an observation from a random variable Y ij , which models the measurement performed on the i th subject by the j th observer for i = 1, …, a and j = 1, …, b. Assuming no preferred observer, Jones et al. suggested to assess the agreement between measurements made by different observers by investigating how much the measurements vary around the subjectspecific average [2]. More formally, they were interested in how much the differences D ij ¼ Y ij − Y iÁ are likely to vary, where Y iÁ denotes the average measurement for subject i across the b observers. For visualising the data, Jones et al. propose to consider a plot of the observed differences d ij ¼ y ij − y iÁ against the observed subjectspecific average y iÁ . We will refer to this as an agreement plot. For an example of an agreement plot see Fig. 1 below. An agreement plot can, for example, help to detect whether the spread of the differences is associated to the size of the measurements, or, at least when a and b are not too large, whether some observers tend to always make large, small, or more varying measurements.
Further, Jones et al. equipped the agreement plot with horizontal lines representing the estimated 95% LOAM, which are given by ±1.96s, where s is the estimate of the residual standard deviation in a two-way analysis of variance (ANOVA) including subject and observer as fixed effects. Thus, s is only a measure of the residue variation left after accounting for possible subject and observer effects. On one hand, if there is a non-negligible observer effect, this should be included in the variability of the differences d ij when constructing the LOAM. On the other hand, in the (unrealistic) case of no variation due to observer the 95% LOAM lines suggested by Jones et al. are biased and inefficiently estimated, as it would be custom to refit the ANOVA model without the adjustment for observer effect and adjust the degrees of freedom for s accordingly.
In conclusion, although the method has gained an increasing interest over the years, Jones et al. did not provide a way to: 1) assess the variation of the LOAM estimate, 2) integrate variation due to different observers, and 3) extend the method to multiple observations per observer.
In this paper, we suggest formalising Jones et al.'s approach under a simple two-way random effects model which allows us to formulate a coherent statistical inference procedure for the LOAM. In addition, we provide not only an implementation in the statistical programming software R, but also simple formulae which can be implemented in, e.g., statistical programming languages, Excel, or automatic web-modules for data collection.

A revised version of the limits of agreement with the mean
We propose to derive LOAM assuming a random effects model for the measurements. Assuming a statistical model provides a theoretical framework in which the LOAM can be constructed in a transparent way and fur- thermore enables us to supply estimates and confidence intervals (CIs) for the LOAM.

Statistical model
In the following we assume the measurements follow a two-way random effects model given by where μ describes the overall mean, and A i , B j , and E ij are independent random variables following zero-mean normal distributions with variances σ 2 A , σ 2 B , and σ 2 E , respectively.
Under this model, measurements made by different observers are uncorrelated if they are on different subjects, while they are positively correlated with covariance σ 2 A for the same subjects. Further, the covariance between measurements made by the same observer for different subjects is σ 2 B . Note that the measurements are assumed to be homoscedastic, i.e. has common variance, where the common variance is given by σ 2 A þ σ 2 B þ σ 2 E : That is, the variance is split into three components: the inter-subject, inter-observer, and residual variance. Here we follow the convention of referring to the residual variance σ 2 E as the intra-observer variance. Further, note that we assume a balanced data setup, where each observer has evaluated all the subjects.

Proposed limits of agreement with the mean
Under the two-way random effects model stated in Eq.
(1), the difference between an individual measurement and the subject-specific mean, D ij , is normally distributed with mean zero and variance ðσ 2 B þ σ 2 E Þðb − 1Þ=b . Thus, under this model we expect 95% of these differences to be within the limits AE1:96 We propose the above as the 95% LOAM. To estimate σ 2 B and σ 2 E under the suggested two-way random effects model, we use the unbiased and consistent ANOVA estimates (see, e.g., Chapter 4 of Searle et al. [3]), given bŷ where MSB = SSB/ν B and MSE = SSE/ν E , with denoting the sums of squares for the observer and residual term, and ν B = b − 1 and ν E = (a − 1)(b − 1). Further, y iÁ , y • j , and y •• denote the subject-specific, observer-specific, and overall average, respectively. Using the estimates of σ 2 B and σ 2 E from Eq. (3), we obtain the following estimate of the 95% LOAM: where N = ab is the total number of measurements. For comparison, Jones et al.'s estimate of the LOAM is given by which does not include variation due to observers.

Confidence intervals
Instead of simply reporting the estimated LOAM given by Eq. (4), it is more informative to report CIs. However, as the distribution of the LOAM is quite complicated, we only supply approximate CIs. Graybill and Wang propose a method for constructing (approximate) efficient CIs for linear combinations of variances [4]. To construct CIs for the LOAM in Eq. (2), we first use the method by Graybill and Wang to construct a CI for the term inside the square root of the LOAM. Next, that CI is transformed into a CI for the upper LOAM by taking the square root and then multiplying by 1.96 (see Additional file 1 for details). The resulting approximate (and asymmetric) 95% CI for the upper 95% LOAM is given by Graybill and Wang for other choices of l x and h x [4]). Here F α; m, n is the α-quantile for the Fdistribution with m numerator and n denominator degrees of freedom. A 95% CI for the lower 95% LOAM is simply obtained by negation of the end points of the CI for the upper LOAM, that is, Simulations under the two-way random effects model in Eq. (1) indicate that the coverage probability for the approximate CI is in reality quite close to the wanted 95% even with a low number of observers (see Figure 1 in Additional file 2).

Sample size calculations
When planning an agreement study, it is often desirable to investigate how many measurements are necessary to obtain a certain level of precision in terms of a specified width of the CI for the LOAM. From Eq. (5) it is clear that the value of L and H determine the width of the CI for the LOAM; specifically, the CI gets narrower as L and H approaches zero. In turn, this happens when b is increased, since l x and h x approaches zero, when ν x increases for both x = B and x = E. Thus, to obtain a higher precision we have to increase the number of observers, b, while it is not enough to increase the number of subjects.
Therefore, assume we have a fixed number of subjects a we want to include in a future study to assess agreement between measurements. To determine the number of observers necessary to obtain a desired width W of the 95% CI, we require initial estimates of σ 2 B and σ 2 E , saŷ σ 2 B;0 andσ 2 E;0 , which can be obtained from, e.g., a pilot study. Exploiting the relations SSE ¼ ν Eσ we can express the width of the CI in Eq. (5) in terms of the variance estimates rather than the sum of squares. Further, we let the estimates be given by the initial estimatesσ 2 B;0 andσ 2 E;0 , and set the width equal to W. That is, we want to solve the following equation with respect to b: where Note that ν B , ν E , l B , l E , h B , and h E all depend on b. The equation can then be solved numerically with respect to b to find the number of observers needed to obtain an expected width W of the 95% CI for the 95% LOAM.

Inference on the variance components
In order to assess the extent of the inter-subject, interobserver, and intra-observer variations, we suggest to consider a 95% CI for σ A , σ B , and σ E , respectively.
If the ANOVA estimateσ 2 . Using the statistical delta method (see Additional file 3), we obtain the following approximate 95% CI for σ B : Results from a small simulation study investigating how well the actual coverage of the approximate confidence interval matches the desired coverage probability and how this depends on b and the true values of σ B and σ E can be found in the additional files (see Figure 2 in Additional file 2). In general, the approximation improves as b increases.
It might happen the estimateσ 2 B is negative due to negative correlation between observations made by the same observer on different subjects which will indicate a misspecification of the two-way random effects model formulated in Eq. (1). Negativity can also arise by sampling variation of the unbiased ANOVA estimates, we have used in this paper. Although it is tempting to suggest settingσ 2

B
to zero in such a case, this would introduce bias in the estimation. We therefore suggest to report the negative estimates, and recommend the researcher to comment on the possibility of negatively correlated measurements, and if that does not seem realistic, to assess whether the CIs are too wide to provide any clinically meaningful conclusion. It should be assessed whether more observers should be included to improve the precision of the estimate or whether the model is wrongly specified.
As the distribution ofσ 2 E is known in closed form, an exact asymmetric 95% CI can easily be constructed for σ E (see Additional file 3) and is given bŷ q and χ 2 α;ν E is the α-quantile of a χ 2 -distribution with ν E degrees of freedom.
To provide some context for the scale ofσ B andσ E ; it may also be constructive to ðy iÁ − y ÁÁ Þ 2 . The estimate of σ A may be accompanied by an (approximate) 95% CI, which can be constructed using the statistical delta method (see Additional file 3): Performing an agreement analysis To investigate agreement between observers, we propose first to make the agreement plot with the estimate and CI for the 95% LOAM from Sections 2.1.2-2.1.3, and to calculate the empirical means and standard deviations for the measurements conditional on observer or subject. Inspection of the agreement plot and the empirical means across subject, conditional on observer can be used to reveal whether any observers tend to make unusually large or small measurements. Further, the agreement plot and the conditional empirical standard deviations can be used to check whether the assumption of homoscedasticity of the random model is fulfilled. If the model in Eq. (1) is fitted using statistical software it is often possible to extract residuals and predictions of the observer and subject effects which can be used to check the model assumptions further. Specifically, one may, e.g., consider plots of the residuals against the fitted values, observer number, and subject number, respectively, to further investigate the homoscedasticity assumption. Further, a normal quantilequantile plot of the residuals as well as of the predictions of the observer and subject effects, respectively, can be used to investigate the normality assumptions. However, if the number of observers or subjects is low, an inspection of how the predictions are distributed may be pointless. See, for example, Section 4.3 in Pinheiro and Bates for a more detailed explanation and illustration of model diagnostics [5]. If it is concluded that the model assumptions are unreasonable, one could consider an appropriate transformation of the data or formulate a variance model to handle heteroscedasticity of the outcome [5] or one could consider using a generalised, linear, and mixed model to handle non-normal distribution of outcomes [6]. If the model seems reasonable, we report the estimate and CI for the LOAM. The clinician can then compare the estimated LOAM and associated CI to a clinically acceptable difference between measurements evaluated on the same subject. Whether or not the agreement between measurements is satisfactory depends both on the scale and clinical purpose of the measurements.
Next, we may calculate CIs for σ B and σ E , and use these along with the point estimates (σ 2 B andσ 2 E ) to compare the order of magnitude of the inter-observer variation with the intra-observer variation. In the rare case where the observer variation is negligible, the observer effect could in principle be removed from the random model, requiring that the CIs for the LOAM are adjusted accordingly (see Additional file 4).
The agreement analysis may be supplemented with an estimate and CI for the ICC, which is another measure for agreement based on the variance components. Various forms of ICCs are listed in McGraw and Wong for a range of models [7]. The two-way random effects model proposed in this paper corresponds to Case 2A in McGraw and Wong, with subject as row effect and observer as column effect, and ICC(A, 1) can then be used to assess absolute agreement of the measurements [7]. The plug-in estimate of ICC(A, 1) is easily calculated using the estimated variance components: We refer to Table 7 in McGraw and Wong for an approximate CI for ICC(A,1) [7].

Multiple measurements on each subject per observer
The proposed LOAM and their estimates and CIs can easily be extended to the case where each observer performs multiple measurements on every subject. If each observer performs c measurements on each subject, we extend the two-way random effects model to: where Y ijk is the k th measurement performed by the j th observer on the i th subject for i = 1, …, a, j = 1, …, b, and k = 1, …, c. Note that, conditional on observer and subject, the c repeated measurements are assumed to be independent and identically distributed.
Mimicking the arguments for the single measurement case, but now considering the differences D ijk ¼ Y ijk − Y iÁÁ ; we propose the following 95% LOAM: Again σ 2 A ; σ 2 B ; and σ 2 E are estimated by the ANOVA estimates (see, e.g., Chapter 4 of Searle et al. [3]), which are given bŷ Note that the overall, subject-specific, and observerspecific averages (y ⋯ ; y iÁÁ , and y Á jÁ ) are now also averaging across the multiple measurement index. With these definitions of SSB, SSE, ν B , and ν E and with N = abc, the LOAM estimate and CIs still have the form given by Eq. (4)- (5). For the sample size calculation summarised in Eq. (6)-(7), we furthermore replace a with ac.
Further, CIs for σ A , σ B , and σ E are obtained by Eq. (8)- (10), except that a is replaced with ac, b is replaced by bc, and the definition ofσ 2 A ;σ 2 B ;σ 2 E ; ν A ; ν B , and ν E has changed to the above.
Note that all formulas for the multiple measurement case reduce to those for the single measurement case, when c = 1.
As for the single measurement setup, the observations may be visualised using an agreement plot, where the observed differences d ijk ¼ y ijk − y iÁÁ are plotted against the subject-specific averages y iÁÁ .

Data and software
The statistical programming language R, version 3.6.1 [8], was used to analyse the data in the paper. An R-package, R-scripts, and the aortic data for the LOAM calculations in the present paper can be obtained from the GitHub repository: https://github.com/HaemAalborg/loamr.

Example 1
In a study b = 5 thoracic radiologists measured the diameter (in centimetres) of a = 40 lung tumours from computed tomography scans [9]. This study was also used as an example in Jones et al. [2]. Table 1 shows the empirical mean and standard deviation of the measurements across subject, conditional on radiologist, and Fig. 1 displays the agreement plot. Estimates and CIs of the 95% LOAM, ICC, σ A , σ B , and σ E are listed in Table 2. Neither the agreement plot nor the conditional empirical mean indicate any observer systematically making unusually small or large measurements. Further, there is no indication of heteroscedasticity in relation to change in observer or to the size of the tumour.
The estimated 95% LOAM are ±1.1 cm (95% CI: 1.0 cm to 1.8 cm); the estimate is identical with the 95% LOAM calculated by Jones et al.'s method when rounding to one decimal place. The inter-observer standard deviation estimate is 0.3 cm (95% CI: 0.1 cm to 0.5 cm), while the intra-observer standard deviation estimate is 0.6 cm (95% CI: 0.5 cm to 0.6 cm). Although on a scale comparable to the intra-observer variation, the interobserver variation is smaller, supporting the practice where lung nodule measurements are performed by different radiologists. We may also note that the intersubject variation (unsurprisingly) is larger than both the inter-and intra-observer variation.

Example 2
Borgbjerg et al. consider three methods (OTO, LTL, and ITI) for assessing the maximum antero-posterior abdominal aortic diameter [10]. A total of b = 12 radiologists measured the aortic diameter c = 2 times on a = 50 still abdominal aortic images to assess which of the three methods were most reliable.
Using the methods described in Section 2.2 for multiple measurements, we calculate estimates and CIs for the 95% LOAM, σ A , σ B , and σ E (see Table 3) and make an agreement plot (see Fig. 2). The inter-subject variation is large compared to both the inter-and intraobserver variation. The inter-observer variation is of the same order of magnitude as the intra-observer variation and should not be excluded. The LTL method has the largest estimated LOAM, meaning that measurements made by this method tend to vary more. Conversely, the ITI method has the smallest LOAM suggesting that this method has the highest reproducibility when taking into account both the inter-observer and intra-observer variation However, the wide CIs for the LOAM indicate that more observers may be needed to assess this properly. We found significantly less intra-observer variation for the LTL and ITI compared to the OTO method. This finding is in line with the conclusion by Borgbjerg et al. which suggests that it is advantageous to employ either   the ITI or LTL method when repeated measurements are performed by the same observer [10].

Discussion
In this study, we have defined the LOAM under the assumption of a two-way random effects model, with additive observer and subject effects. This allowed us to formulate a simple statistical inference procedure which can be easily implemented. The theory could be altered to cover various situations where the assumptions of the paper are not fulfilled. First, we include observers as a random effect, meaning that we consider the observers in a study to be a random sample from a larger population of observers that we want to make inference about. It is, however, not unlikely to have a study where the considered observers constitute the whole population of interest, in which case it may be more appropriate to include observers as a fixed effect. The LOAM presented in this paper is based on the variance of the difference between an individual measurement and the subject-specific mean. Under a model with observers as fixed effect, such a LOAM will no longer measure variation due to change of observer. Depending on the purpose of the agreement study, the estimated observer effects could then be included in a reformulation of the LOAM or considered separately. However, we believe that many studies are performed to investigate agreement not only between the specific observers but rather within a larger population of observers, encouraging the choice of model in this paper.
Second, one could imagine a situation where it is relevant to include an interaction term between subjects and observers, that is, modelling that observers may react differently upon the subjects. For single measurements this interaction effect is confounded with the residual error, but for multiple measurements this effect could in principle be modelled and the LOAM adjusted accordingly.
Third, the methods and formulae of this paper rely on the assumption of a balanced data setup, where all observers have evaluated all the subjects the same number of times. However, in practice it is not unlikely to encounter an unbalanced data set as measurements may get lost or not all observers were able to perform all measurements. An unbalanced setup is definitely more complicated to handle but some advances can be made. A new expression for the LOAM may be found under a two-way random model allowing unbalanced data, while existing methods for finding estimates of the variance components can be used to estimate the adjusted LOAM (see, e.g., [3,11]). However, it is in general not possible to obtain closed form expressions for the confidence intervals for the LOAM and variance components.
Fourth, as indicated in Section 2.1.5 it might happen that the estimateσ 2 B is negative due to negative correlation between observations made by the same observer on different subjects which will indicate a misspecification of the two-way random effects model formulated in Eq. (1). It is possible to generalise the theory by considering marginal modelling [12]. It was further indicated in Section 2.1.5 that negativity can also arise by sampling variation of the unbiased ANOVA estimates, we have used in this paper. Various approaches have been suggested to remedy this problem as well [13].
Pursuing these generalisations will, however, make modelling and implementation much more involved, and thereby violate our goal to formulate an easily implementable framework.

Conclusions
Our results show it is possible to formulate measures for the agreement with the mean between multiple