Power estimation of tests in log-linear non-uniform association models for ordinal agreement

Background Log-linear association models have been extensively used to investigate the pattern of agreement between ordinal ratings. In 2007, log-linear non-uniform association models were introduced to estimate, from a cross-classification of two independent raters using an ordinal scale, varying degrees of distinguishability between distant and adjacent categories of the scale. Methods In this paper, a simple method based on simulations was proposed to estimate the power of non-uniform association models to detect heterogeneities across distinguishabilities between adjacent categories of an ordinal scale, illustrating some possible scale defects. Results Different scenarios of distinguishability patterns were investigated, as well as different scenarios of marginal heterogeneity within rater. For sample size of N = 50, the probabilities of detecting heterogeneities within the tables are lower than .80, whatever the number of categories. In additition, even for large samples, marginal heterogeneities within raters led to a decrease in power estimates. Conclusion This paper provided some issues about how many objects had to be classified by two independent observers (or by the same observer at two different times) to be able to detect a given scale structure defect. Our results also highlighted the importance of marginal homogeneity within raters, to ensure optimal power when using non-uniform association models.


Background
Initially developped in psychometrics to assess the severity of behavioral troubles or disturbances [1][2][3], ordinal rating scales (ORS) are now essential tools in health research and health care: for example to measure clinical outcomes such as symptom grading [4], pathologists finding [5], disease severity [6], treatment response [7][8][9], as well as health-related quality of life [10,11]. When the same objects are classified twice on a scale, differences in perception of one observer to another, or of the same observer at two successive times, lead to inter-rater and intra-rater variability. For patients, reproducibility of ratings made using an ORS is a major issue because their classification into one of the different categories may have important consequences on their therapeutic follow-up and possibly on their quality of life. There are two main components of reproducibility. The first component is marginal homogeneity between raters, which corresponds to the differences in raters marginal distributions and refers to the tendencies of a rater to make classifications higher or lower than those of the other rater. The second component is category distinguishability, that is to say the ability for observers to distinguish between categories. Recently, non-uniform association models (NUA) were proposed by Valet et al. [12] to estimate degrees of distinguishability between adjacent categories of an ORS. These models allowed to test different patterns of distinguishability and then to give information of the scale structure quality.
When designing a reproducibility study with two observers (or one observer at two different times) assessing the same objects on an ORS, two major questions have to be solved: How many objects has to be classified by the two observers to be able to detect a given heterogeneous pattern of distinguishability between adjacent categories? Is it important to select these objects in an attempt to approximate some marginal distributions? In this study, simulations were used to estimate the power of non-uniform association models to detect heterogeneities across distinguishabilities between adjacent categories as a function of typical distinguishability patterns and total number of objects classified, assuming homogeneous marginal distribution within reader and between readers. Then, for the same numbers of objects classified twice, the influence of different patterns of marginal heterogeneity within reader on power estimate was studied.

Log-linear non-uniform association models Log-linear modelling and parameters interpretation
Classifications of N objects by two independent raters A and B (or by the same rater at two different times) using an ORS with I categories can be summarized in a I × I contingency table. In this table, let us define counts n ij as the numbers of objects rated i (i = 1,..., I) by observer A and j (j = 1,..., I) by observer B, and suppose that these counts have a full multinomial distribution with expected mean m ij = N × π ij , where N is the sample size, and π ij is a probability distribution on the cells of the I × I table. Log-linear modelling expresses the logarithm of these m ij as a linear combination of parameters that illustrates raters effects on categories, as well as sources of agreement and disagreement. For the independence model, which assumes that ratings are statistically independent, the model is written as: where μ is the overall effect and λ A i and λ B j are A and B effects on category i and j, respectively. For this model, agreement between raters is expected to be due to chance only.
When analyzing agreement in ordered contingency table, we can usually expect an association between ratings due to the natural ordering of the scale. As described by several authors [12][13][14][15], this association between rating is expected to increase as the distance between categories increases. For instance on a five-level severity scale, if an object is rated "1" by A, the probability for this object to be rated "5" by B is very low [16]. This association can be expressed through odds ratio τ ij = m ii m jj /m ij m ji . An odds ratio value equal to 1 indicates that the two ratings are independent. From odds ratio τ ij , Darroch and McCloud defined ν ij = 1 − τ −1 ij as the degree of distinguishability (DD) between two categories of an ORS, that is to say the readers' ability to distinguish between these two categories [17]. A DD value close to 1 indicates an almost perfect distinguishability between the two corresponding categories whereas a DD value close to 0 indicates that these two categories are very hard to distinguish.
Uniform Association (UA) and Non-Uniform Association (NUA) models In order to take into account this association, Goodman introduced the uniform association (UA) model. In 2007, Valet et al. [12] proposed an equivalent but simpler parameterization of the UA model as: where i = 1,..., I and j = 1,..., I. From the UA model, odds ratio are written as τ ij = e β(i−j) 2 . Hence, DDs between two categories i and j are written as ν ij = 1 − e −β(i−j) 2 assuming that the DDs between categories vary according to the distance between them. However, as pointed out by Valet et al. [12] the DDs between adjacent categories are supposed to be constant which can be a limiting a priori hypothesis, since it assumes that the categories of the scale are regularly spaced in terms of distinguishabilities; a rather satisfying property for an ORS. They proposed log-linear non-uniform association (NUA) models to take into account the variations of the DDs between both distant and adjacent categories of an ORS. For ORS with I ≥ 3, NUA models are defined by: For this model, DDs are written as: illustrating the possible DDs variations between categories, even between adjacent ones. NUA models are a generalization of UA models. Indeed, UA model is a particular case of a NUA model where parameters b k , k +1 are all equal (do not depend on k). Comparison of log-likelihood of data when using UA and NUA models allows us to test DDs homogeneity between adjacent categories and can provide useful information on scale structure. See Valet et al. [12,16] for a complete description of the NUA models and the possible patterns of distinguishability that can be tested.

Power estimation of tests in NUA models
To investigate the ability of NUA models to detect heterogeneities within the DDs between adjacent categories, a simple method was proposed to simulate ordered contingency tables resulting from the use of ORS having different patterns of distinguishability between their adjacent categories. Hereafter, tests were defined for a null hypothesis H 0 corresponding to the UA model defined by equation (2), and alternative hypotheses H 1 corresponding to NUA models defined by equation (3). Different scenarios of DDs heterogeneity were proposed to illustrate different typical scale structures. In all situations, marginal homogeneity between readers was assumed, which can be expressed as: Simulation of I × I contingency tables from the NUA models The total sample size N was fixed, but the row and column totals were not. Counts n ij were drawn from a full multinomial distribution M(π ij ,N). In order to simulate different patterns of DDs heterogeneity between adjacent categories, theoretical probabilities π ij were defined, using equation (3), as a function of the parameters of the NUA model: (5) When N and the association parameters b k, k+1 (k = 1,..., I -1) are fixed, it is obvious that probabilities π ij only depend on the unknown parameters μ and l i (i = 1,..., I). These I + 1 unknown parameters can be defined as the solutions of the following non-linear system of I + 1 equations: The first set of equations of the system defined by (6) allows us to control the marginal probabilities distribution during simulations, i.e. to control marginal probabilities π S i . (upperscript "S" stands for simulations). The second condition of the system ensures that μ remains the overall effect [18]. As the number of equation is equal to the number of unknown parameters, the system can be easily solved using classical algorithm that can find roots of nonlinear systems, as the well-known Newton-Krylov method for example [19,20]. However, in this paper, a new method proposed by Lacruz et al. [21] was used. This "non-monotone spectral residual" method can find roots of nonlinear systems, by working without gradient information and it was shown to be competitive and frequently better than usual algorithms.
Many different scenarios of distinguishability patterns can be simulated, using different sets of {b k,k+1 ; k = 1,..., I -1} in the NUA model. Suppose we aim to test all possible patterns of distinguishability, we will have to compare the null UA model (all b k, k+1 are equal) and NUA models with all possible combinations of association parameters, i.e. to test all possible equalities between association parameters. For example, testing equality of exactly B (B = 2,..., I -1) association parameters in a NUA model with I -1 association parameters would already yield to comparisons. However, our aim was not to simulate exhaustively all possible patterns of distinguishability but credible patterns corresponding to typical scale structures in inter or intra-observer variation study. Therefore, as defined in Valet et al. [12] only combinations of "symmetric" and "close" association parameters were considered, that is to say NUA models where equality of some symmetric and close association parameters was assumed, respectively.

Definition of alternative hypotheses
For simplicity, we will consider hereafter contingency tables resulting from the use of ORS with I = 5 categories. The generalization to I × I contingency table is obvious. To exemplify our simulation scenarios, examples of the different values of association parameters that can be simulated in the case of a 5 × 5 contingency table, were described in table 1.
From the UA model where all association parameters are equal (H 0 hypothesis), a different value just for one association parameter (H 1 1 hypotheses) can be used, to account for a scale defect between two categories only (categories are regularly spaced along the scale in terms of distinguishabilities, except two). Equal values for symmetric (for instance it is easier to distinguish extreme categories than to distinguish intermediate categories) or close (for instance it is easier to distinguish lower categories on the scale than upper categories) association parameters can also be used as described by hypotheses H 2 1 . Finally, taking different values for all association parameters (H 3 1 hypothesis) illustrates an ORS where all categories are irregularly spaced in terms of distinguishabilities.

Distribution of marginal probabilities
In addition to the different sets of distinguishabilities values, i.e. different sets {b k,k+1 ; k = 1,..., 4} illustrating the different alternative hypotheses that can be tested, different sets of marginal probabilities {π S i ; i = 1, . . . , 5} were assumed for each alternative hypothesis, to investigate the possible effects of marginal distribution heterogeneity within reader on NUA models' ability to detect significant DDs heterogeneities. These distributions were chosen in order to illustrate different realistic marginal distributions that can be observed in contingency table resulting from the classification of objects on an ORS. These different sets of marginal probabilities are described in table 2. The first set corresponds to homogeneous distribution of marginal probabilities. Then, the next three sets corresponds to homogeneous distributions except for one category with a low prevalence. The fourth and the fifth sets corresponds to homogeneous distributions except for two extreme or intermediate categories with low prevalences. The last set corresponds to an heterogeneous marginal distribution.

Power and Type I error estimation
For each specific set of {b k, k+1 ; k = 1,..., 4} and {π S i ; i = 1, . . . , 5}, parameters μ and l i were calculated using the non-linear system defined by (6). Probabilities π ij of the multinomial distribution were calculated from equation (5), using the specific set of {b k, k+1 ; k = 1,..., 4} and the previously calculated values of μ and l i . Then, 10000 simulations of 5 × 5 contingency tables summarizing classifications of N objects were drawn. The same null hypothesis of equal DDs between all adjacent categories was used. For this null hypothesis, a common value b 1,2 = b 2,3 = b 3,4 = b 4,5 = log(3) was chosen, corresponding to similar association between adjacent ratings (τ 1,2 = τ 2,3 = τ 3,4 = τ 4,5 = 3) and hence similar DDs between all adjacent categories. To account for different null hypotheses, we also proposed a common value of b 1,2 = b 2,3 = b 3,4 = b 4,5 log(2) and b 1,2 = b 2,3 = b 3,4 = b 4,5 = log(4). For each simulation, the log-likelihood of UA model (H 0 ) and NUA models defined by H 1 were calculated. As proposed by several authors [12,18], the G 2 likelihood ratio-statistic was used to compare these two models. Indeed, we used the difference statis- and H 3 1 , differences Δdf were equal to 1, 1 and 3, respectively. For each scenario, power was estimated as the proportion of significant NUA models when applied on contincency tables simulated under the same alternative hypothesis. Type one error a was estimated as the proportion of significant NUA models when applied on contingency tables simulated under the null hypothesis. Table 1 Examples of association parameters and distinguishability patterns between adjacent categories from NUA models in a 5 × 5 contingency table   Hypothesis Association parameters Distinguishability patterns

Results
All simulations and power estimations were performed using R software [22]. Association parameters were equal to log(3) under the null hypothesis (i.e. OR equal to 3) and for each alternative hypothesis, the values K of the tested OR ranged from 1 to 16, which corresponds to association parameters ranging from log(1) = 0, to log (16) = 2.77. Thus, for a specific alternative hypothesis, each specific set of association parameters {b k, k+1 ; k = 1,..., 4} contained some fixed parameters equal to log(3) depicting the null hypothesis, and some varying parameters ranging from 0 to 2.77 depicting the alternative hypotheses. Simulations results were firstly displayed on Figure 1, illustrating for each simulated scenario, the power estimates of tests with alternative hypotheses corresponding to the different NUA models tested. In others words, this figure represents the probability of finding significant heterogeneities within the DDs between adjacent categories, according to the total sample size N, three different alternative hypotheses, and for different values K of tested OR. Left panel (Figure 1, examples a. to c.) corresponds to simulated scenarios with homogeneous marginal distributions within rater, whereas right panel (Figure 1, examples d. to f.) corresponds to simulated scenarios with three different sets of heterogeneous marginal distributions. We can observe that power estimates were constantly lower in scenarios with heterogeneous marginal distributions (right panel) as compared to those with homogeneous marginal distributions (left panel). In some cases, influence of marginal distributions heterogeneity was even drastic and strongly penalized NUA models ability in detecting significant heterogeneities within DDs between adjacent categories ( Figure 1, example d.). For total sample sizes of N ≤ 100, we can also note that none of the simulated scenarios provided power estimates greater than 80%. Conversely, except for example given in Figure 1, example d., power estimates were greater than 80% for tested OR K ≥ 12, for all the tested hypotheses. Then, power estimates were given in table 3. Like in Figure 1, this table shows power estimates as a function of N, the three different alternative hypotheses, and the different values K of the tested OR. In a similar way, left panel corresponds to simulated scenarios with homogeneous marginal distribution, whereas right panel corresponds to different situations of heterogeneity within marginal distributions. For example, from the null hypothesis that all OR are equal to 3, i.e. DDs between all adjacent categories equal to 2/3, the power estimates of test corresponding to i) an alternative given by H 11 1 : b 1,2 ≠ b 2,3 = b 3,4 = b 4,5 , ii) an homogeneous marginal distribution, and iii) a total sample size equal to N = 250, are greater than 80% for OR greater or equal to 10.
In others words, for N = 250, NUA models are able to detect with a probability greater than 80%, DD between adjacent categories 1 and 2, greater than 1-1/10=.90. For the left panel of this table and for the H 11 1 hypothesis of a different DD between the first two adjacent categories as compared to the others, NUA models are able to detect with a probability greater than 80%: a null DD or DDs greater than .92 for N ≥ 200, and DDs greater than .94 for N ≥ 150. In a similar way, for N = 200, NUA models are able to detect different DD between close and symmetric adjacent categories (H 21 1 and H 22 1 , respectively) with a probability greater than 80% for null DD or DDs greater than .90.
It is clear that table 3 does not provide power estimates for all possible values of association parameters tested and hence for all decimal values between K = 0 and K = 2.77. However, interpolation of power estimate for a specific value of association parameter is straightforward. From table 3 suppose for example that we want to calculate the required sample size for a com- In a similar way, tables 4 and 5 provided power estimates for the same three different alternative hypotheses, considering at this time the null hypotheses that all OR are equal to 2 and 4, respectively. These tables allow the reader to estimate power for different null hypotheses through interpolation. Supplementary tables were also proposed to account for 4 × 4 (Additional file 1: table S1) and 6 × 6 (Additional file 1: table S2) contingency tables. In addition, results for different alternative hypotheses as well as different scenarios and sample sizes can be easily provided on simple request to the authors.

Discussion
Results given by Figure 1 andtables 3 to 5 highlighted the strong influence that marginal heterogeneity within reader may have on power estimates of tests in NUA models. Conversely, when assuming marginal homogeneity within reader, NUA models are able to detect, from a null hypothesis of a DD equal to 2/3 between all adjacent categories and for a reasonable value of N = 200, null DD (between two or three categories with a probability greater than 80%. For a five-level scale, with an equal DD of 2/3 between its adjacent categories, NUA models are hence able to detect two or more confusing categories with a satisfying power. In the same way, for N = 200, NUA models are able to detect with a good power two or more adjacent categories (close or symmetric) for which the DDs are greater or equal to .92.
In our simulations of contingency tables resulting from cross-classifications of the same objects twice on an ordinal rating scale, the assumption of marginal homogeneity between readers was assumed, which can be seen as a limiting constraint. However, as described     For each simulations, the algorithm of Lacruz et al. [21] was used to estimate parameters μ and l i . Like many others systems, this system of non-linear equations appeared to be very sensitive to initial values. In order to handle this problem and to avoid local maximums, solutions μ and l i of each system associated to a specific value K of the tested OR were used as initial parameters of the following system with the next tested K value.
In this simulation study we presented three alternative hypotheses illustrating different patterns of distinguishability between adjacent categories. The first tested hypothesis H 11 1 (DD between categories 1 and 2 different from the others), the corresponding symmetric hypothesis (DD between categories 4 and 5 different from the others), and the last hypothesis H 22 1 (DDs between extreme adjacent categories different from the others) allow to detect significant differences between extreme adjacent categories (1 and 2, 4 and 5 or both) and others intermediate ones. This is a usual pattern in ordinal rating scales, as the first category often corresponds to "no intensity" and the last one often corresponds to the "highest intensity" of the measured phenomenon. These two extreme adjacent categories are more likely to be distinguishable than the others because they correspond to extreme situations. Finally, the second hypothesis H 21 1 (DDs between close adjacent categories from 1 to 3 different then the others) and the corresponding symmetric one (DDs between close adjacent categories from 3 to 5 different from the others) allow to detect higher or lower DDs between some close adjacent categories of the scale. This can also be a typical pattern corresponding for example to ordinal scale where some consecutive grades shows many similarities and may be hard to distinguish.

Conclusions
In this paper we proposed a new simple method based on simulations, to estimate power of tests in log-linear non-uniform association models. To this aim, we first presented a method to simulate contingency tables resulting from cross-classifications of the same objects, using ordinal rating scales having different patterns of distinguishability between their adjacent categories. Then, taking typical situations of scale structures, we proposed a table summarizing the main effects of sample size, alternative hypotheses and marginal distributions on power estimates for the detection of DDs heterogeneities within the scale structure. Results were given for three typical alternative hypotheses, and in the case of an 5 × 5 contingency tables.
In health-research assessment of disease severity or patients' well being are more and more performed using ordinal rating scales. One of the major component of an ordinal scale is category distinguishability between its adjacent categories. Using a simple method based on simulations, this paper provided some issues about how many objects has to be classified by two observers to be able to detect a given scale structure defect, what may be of prime interest to improve ordinal scale quality and then others assessments made using this scale.