An accurate test for homogeneity of odds ratios based on Cochran’s Q-statistic

Background A frequently used statistic for testing homogeneity in a meta-analysis of K independent studies is Cochran’s Q. For a standard test of homogeneity the Q statistic is referred to a chi-square distribution with K−1 degrees of freedom. For the situation in which the effects of the studies are logarithms of odds ratios, the chi-square distribution is much too conservative for moderate size studies, although it may be asymptotically correct as the individual studies become large. Methods Using a mixture of theoretical results and simulations, we provide formulas to estimate the shape and scale parameters of a gamma distribution to fit the distribution of Q. Results Simulation studies show that the gamma distribution is a good approximation to the distribution for Q. Conclusions Use of the gamma distribution instead of the chi-square distribution for Q should eliminate inaccurate inferences in assessing homogeneity in a meta-analysis. (A computer program for implementing this test is provided.) This hypothesis test is competitive with the Breslow-Day test both in accuracy of level and in power. Electronic supplementary material The online version of this article (doi:10.1186/s12874-015-0034-x) contains supplementary material, which is available to authorized users.


Background
The combination of the results of several similar studies has many applications in statistical practice, notably in the meta-analysis of medical and social science studies and also in multi-center medical trials. An important first step in such a combination is to decide whether the several studies are sufficiently similar. This decision is often accomplished via a so-called test of homogeneity. The outcomes of the studies may be expressed in a variety of effect measures, such as: sample means; odds ratios, relative risks or risk differences arising from 2 × 2 tables; standardized mean differences of two arms of the studies; and many more. A variety of statistics for use in tests of homogeneity have been proposed; some are specific to the type of effect measure, and some are applicable to several measures. This paper has its main focus on the test statistic first introduced by Cochran [1] and [2] and its application to testing homogeneity when the effects of interest are *Correspondence: e.kulinskaya@uea.ac.uk 1 School of Computing Sciences, University of East Anglia, NR4 7TJ Norwich, UK Full list of author information is available at the end of the article odds ratios arising from experiments with dichotomous outcomes in treatment and control arms. Cochran's Q statistic is defined by Q = i w i ( θ i − θ w ) 2 where θ i is the effect estimator of the ith study, θ w = i w i θ i / i w i is the weighted average of the estimators of the effects, and the weight w i is the inverse of the variance estimator of ith effect estimator. The use of inverse variance weights has the appealing feature of weighting larger and more accurate studies more heavily in the weighted mean θ w and in the statistic Q. This statistic was investigated for the case that the study effects are normally distributed sample means by Cochran and also by Welch [3] and James [4]. Perhaps the first application of the Q statistic to testing homogeneity of the logarithm of odds ratios is due to Woolf in 1955 [5]. DerSimonian and Laird [6] extended the use of Q for studies with binomial outcomes to difference of proportions as well as to log odds ratios in the context of the random effects model in which the studies are assumed to be sampled from a hypothetical population of potential studies. However, the use of Q in a test of homogeneity is the same whether a random effects or fixed effects model is used.
Under fairly general conditions, in the absence of heterogeneity, Q will follow asymptotically (as the individual studies become large) the chi-square distribution with K − 1 degrees of freedom where K is the number of studies. It is common practice to assume that Q has this null distribution, regardless of the sizes of the individual studies or the effect measure. But this null distribution is inaccurate (except asymptotically), and its use causes inferences based on Q to be inaccurate. This conclusion of inaccuracy should also apply to inferences based on any statistics which are derived from Q, such as the I 2 statistic (see [7] and [8]). Little is known of a theoretical nature about the null distribution of Q under non-asymptotic conditions. In our previous work, together with Bjørkestøl, we have provided improved approximations to the null distribution of Q when the effect measure of interest is the standardized mean difference [9] and the risk difference [10]. In this paper we use a combination of theoretical and simulation results to estimate the mean and variance of Q when the effects are logarithms of odds ratios. We use these estimated moments to approximate the null distribution of Q by a gamma distribution and then apply that distribution in a homogeneity test based on Q (to be denoted Q γ ) that is substantially more accurate than the use of the chi-square distribution. We also compare the accuracy and power of this test with those of other homogeneity tests, such as that of Breslow and Day [11]. Briefly, both the accuracy and the power of our test are comparable to those of the Breslow-Day test (see Sections "Accuracy of the level of the homogeneity test" and "Power of the homogeneity test)".
After introducing notation and the main assumptions in Section "Notation and assumptions", we proceed to our study of the moments of Q for log odds ratios in Section "The mean and variance of Q" and to their estimation in Section "Estimating the moments and distribution of Q LOR ". Results of our simulations of the achieved level and power of the standard Q test, the Breslow-Day test and the proposed improved test of homogeneity based on Q γ are given in Sections "Accuracy of the level of the homogeneity test" and "Power of the homogeneity test". Section "Example: a meta-analysis of Stead et al. (2013)" contains an example from the medical literature to illustrate our results and to compare them to other tests. Section "Conclusions" contains a discussion and summary of our conclusions. We provide information on the design of our simulations in the Appendix; and more results of the simulations for various sample sizes, including unbalanced designs and unequal effects, are contained in the accompanying 'Further Appendices', together with additional information about the derivation of our procedures. Our R program for calculation of the Q γ test of homogeneity can be downloaded from the Journal website.

Notation and assumptions
We assume that there are K studies each with two arms, which we call 'treatment' and 'control' and use the subscripts T and C. The sizes of the arms of the ith study are n Ti and n Ci ; let N i = n Ti + n Ci and let q i = n Ci /N i . Data in the arms have binomial distributions with probabilities p Ti and p Ci . The effect of interest is the logarithm of the odds ratio The null hypothesis to be tested is the equality of the odds ratios (or equivalently their logarithms) across the several studies, i.e., θ 1 = · · · = θ K := θ.
To estimate θ i , we follow Gart, Pettigrew and Thomas [12] who showed that if x successes occur from the binomial distribution Bin(n; p), then among the estimators of , the estimator with a = 1/2 has minimum asymptotic bias; and indeed, this is the only choice of a for which all terms for the bias in the expansion of L a (x) having order O(1/n) vanish. Gart et al. [12] also show that and suggest the use of the following unbiased estimator of the variance: Accordingly, if x i and y i are the number of successes in the treatment and control arms of the ith study, we estimate θ i by We estimate the variance of θ i by A weight w i is assigned to the ith study as the inverse of the variance of θ i , and the weight is estimated by w i = Var[ θ i ] −1 . The weighted average of the log odds ratio effects is given by θ w = i w i θ i / i w i . Then Cochran's Q statistic is defined as the weighted sum of the squared deviations of the individual effects from the average; that is, The "standard" version of the Q statistic, denoted Q stand does not add 1/2 to the number of events in both arms when calculating log-odds unless this is required to define their variances.
The distribution of Q under the null hypothesis of equality of the effects θ i depends on the value of the common effect θ, the number of studies K and the sample sizes n Ti and n Ci . However additional information is needed to specify a unique distribution for Q. For example, the common effect θ = 0 (that is, the probabilities for the treatment and control arms are equal), could arise with all probabilities equal to 1/2 (in both arms of all studies) or with some of the studies having probabilities of 1/4 in both arms and others having probabilities of 1/3 in both arms. To uniquely specify a distribution for Q, we need to introduce a 'nuisance' parameter ζ i for each study. It is convenient to take ζ i = log[ p Ci /(1 − p Ci )] to be the log odds for the control arm of the ith study and to estimate it as described above, i.e., ζ i = L 1/2 (y i ).

The mean and variance of Q
The Q statistic has long been known to behave asymptotically, as the sample sizes become large, as a chi-square distributed random variable with mean K −1 and variance, which is necessarily twice the mean, 2(K − 1). However, the choice of effect (e.g., log odds ratio, sample mean, standardized mean difference) has a substantial impact on the distribution of Q for small to moderate sample sizes, which in turn affects the use of Q as a statistic for a test of homogeneity. For this section, we shall use the notation Q SM for Q when the effect is a normally distributed sample mean and Q LOR when the effect is the logarithm of the odds ratio.
Assuming that the data from the studies are distributed N(μ, σ 2 i ), Welch [3] and James [4] first studied the moments of Q SM under the null hypothesis of homogeneity; using the normality properties, they calculated asymptotic expansions for the mean and variance of Q SM , and Welch matched these moments to those of a re-scaled F-distribution to create a homogeneity test now known as the Welch test. It is useful, for comparison with Q LOR , to examine Welch's mean and variance for Q SM . Omitting terms of order 1/n 2 i and smaller, Welch found where W is the sum of the "theoretical" weights n i /σ 2 i . Notice the following facts about these moments. 1) They converge to the chi-square moments as the sample sizes increase. 2) Both moments are larger than the corresponding chi-square moments. We shall call the difference between the moments of Q and the corresponding chisquare moments: 'corrections' . 3) The variance is more than twice the mean. 4) The moments depend on the nuisance parameters σ 2 i , which are estimated independently of the effects of interest (the sample means).
Based on a combination of theoretical expansions and extensive simulations, we have determined that, when the effect entering into the definition of Q is the log odds ratio, the mean and variance of Q LOR (under the null hypothesis of equal odds ratios) have the following properties. 1) They each converge to the corresponding chi-square moments of K − 1 and 2(K − 1) as the sample sizes increase. 2) Both moments are less than the corresponding chi-square moments. That is, the 'corrections' are negative rather than positive as for Q SM . 3) The variance is not only less than the chi-square variance, it is less than twice the mean. 4) The moments depend on nuisance parameters, which are not independent of the effects.
The two plots of Figure 1 show the relation of the variance of Q LOR to its mean for a representative set of simulations. (See Appendix A for a complete description of the simulations conducted). The two plots have identical data, but the points are colored according to the value of N in the left plot and according to the value of K in the right plot. The mean and variance of Q LOR have been divided by K − 1 in order to place the data on the same scale. The main message of the right plot (and a key finding of our simulations) is that this re-scaling is effective-the different values of K (5, 10, 20 and 40) are fairly uniformly distributed throughout the plot, indicating that after this re-scaling the moments of Q LOR have little dependence on the number of studies.
In the plots, we see that the mean of Q LOR is less than K − 1, that the variance of Q LOR is less than 2(K − 1), and that the variance is less than twice the mean. We also see in the left plot that the departure of the mean and variance from the chi-square values of K − 1 and 2(K − 1) (that is, the 'corrections') are greater for the study size N = 90 (i.e., 45 in each arm) than for the study size N = 150. It is not evident from the graphs, but the 'corrections' needed are also greater when the binomial probabilities p T and p C are more distant from the central value of 1/2.

Estimating the moments and distribution of Q LOR
In this section, we outline a method for estimating the mean and variance of Q LOR . The method involves fairly complicated formulas, but in the Appendix we provide more details and a link to a program in R for carrying out the calculations.
Kulinskaya et al. [10] presented a very general expansion for the mean of Q for arbitrary effect measures in terms of the first four central moments of the effect and nuisance parameters as well as the weight function expressed in terms of these parameters.
Necessary formulas for the application of this expansion to the first moment of Q LOR can be found in Appendix B.2. The resulting expansion provides an approximation to the mean of Q LOR , which we will denote E th [ Q LOR ] where the subscript 'th' indicates that this expectation is entirely theoretical. It depends on the number of studies K, the sample sizes of the separate arms of the studies, the estimated values of the nuisance parameters ζ i , the values of the estimated weights and the estimated value of the effect θ under the null hypothesis.
When we compared E th [ Q LOR ] with the simulated values for the mean of Q LOR , we found that it does an excellent job of identifying the situations where 'corrections' are needed to the chi-square moment, but that it over-estimates the size of the 'correction' by a constant percentage of slightly more than 1/3 (R 2 = 97.0%). More precisely, denoting the mean of Q LOR by E[ Q LOR ], we have the relation Although this equation is based partly on theoretical calculations and partly on the results of simulations (the "0.687" factor), we note that after deciding on the use of the "0.687" factor we conducted new simulations to verify that it was not just a random consequence of the original simulations. More details on our simulations for this formula can be found in Appendix B.1.
Kulinskaya et al. [10] also deduced a very general theoretical expansion for the second moment of Q, but when we applied this expansion to Q LOR and compared it to our simulations, we found that the expansion is much too inaccurate to be of any use. We conjecture that this inaccuracy is due to non-uniform convergence of the expansions with respect to both the number of studies K and the values of the binomial parameters. Accordingly we have chosen to estimate the variance of Q LOR using a quadratic regression formula from our simulations, as seen in Figure 1, but using more complete data than shown in those plots. As in the regression for the mean of Q LOR we fitted a formula for the variance and then checked it against additional simulations (See the Appendix B.2 for more details on our procedures). Our formula for estimating Var[ Q LOR ] is The quadratic regression fit, using 487 of our more than 1400 simulations, had an R 2 value of 98.5%. In using this equation, we first need to calculate E[ Q LOR ] using Equation 6. This quadratic regression is depicted by the black curve on the plot (b) of Figure 1.
Although we do not have a theoretical justification for using a quadratic relation between the mean and variance of Q, such a functional relation between the mean and the variance of Q is often found under various conditions. For examples, in the asymptotic chi-square distribution of Q, the variance (twice the mean) is a linear function of the mean; and in the normally distributed sample mean situation of Equations (4) and (5), a little algebra shows that again the variance is a linear function of the mean. Further, in a common one-way random effects model, Biggerstaff and Tweedie [13] show that the variance of Q is a quadratic function of the mean.
Our simulations show that the family of gamma distributions fits the distribution of Q LOR quite well. By matching the mean and variance of Q LOR with the mean and variance of a gamma distribution, we arrive at an approximation for the distribution of Q LOR which can be used to conduct a test of homogeneity for the equality of log odds ratios using Q LOR as the test statistic. (The shape parameter α of the gamma distribution is estimated The accuracy of this test statistic and a comparison with other test statistics are discussed in the next section.

Accuracy of the level of the homogeneity test
In this section we present the results of extensive simulations designed to analyze the accuracy of the levels of the test of homogeneity of log odds ratios using the Q statistic together with the gamma distribution estimated from the data by the methods of Section "Estimating the moments and distribution of Q LOR ". We denote this test by Q γ . The use of simulations to determine the accuracy of various different tests of homogeneity of log odds ratios has often been discussed in the literature. See, for example, Schmidt et al. [14], Bhaumik et al. [15], Bagheri et al. [16], Lui and Chang [17], Gavaghan et al. [18], Reis et al. [19], Paul and Donner [20,21], and Jones et al. [22]. Our simulations included comparisons with some of the tests proposed by these authors. The comparisons of ours confirmed (as several of the above authors also discovered) that the Breslow-Day [11] (denoted by BD) is often the best available among the previously considered tests.
The Breslow-Day test for homogeneity of odds-ratios is based on the statistic where x j , X j (ψ) and Var(x j |ψ) denote the observed number, the expected number and the asymptotic variance of the number of events in the treatment arm of the jth study given the overall Mantel-Haenszel odds ratioψ, respectively. Its distribution is approximated by the χ 2 distribution with K − 1 degrees of freedom. We found that using the Tarone [23] correction to the Breslow-Day test had such small differences from BD that the two were virtually equivalent. In addition to the BD and Tarone tests, we simulated proposals by Lui and Chang [17] for testing the homogeneity of log odds ratios based on the normal approximation to the distribution of the z-, square-root and log-transformed Q stand statistic. The logtransformation was also suggested by Bhaumik et al. [15]. We do not report these results due to our conclusion that none were superior to BD. Accordingly, in our comparative graphs below, we compare our Q γ test with BD and with the commonly used test (denoted Q χ 2 ), which uses the standard statistic Q stand (calculated without adding 1/2 to the numbers of events when calculating log-odds) together with the chi-square distribution.
Our simulations for testing the null hypothesis of equal odds ratios (all conducted subsequent to the adoption of the regressions of Equations 6 and 7) are of two types. For the first type, the parameters of all studies are identical; these simulations include the following parameters: number of studies K = 5, 10, 20 and 40; total study sizes N = 90, 150, and 210; proportion of the study size in the control arm q = 1/3, 1/2, 2/3; null hypothesis value of the log odds ratio θ = 0, 0.5, 1, 1.5, 2, and 3; and the log odds of the control arm ζ = -2.2 (p C = 0.1), -1.4 (p C = 0.2) and -0.4 (p C = 0.4). The second type of simulation fixes the null hypothesis values of equal log odds ratio at θ = 0, 0.5, 1, 1.5, 2, and 3, but the individual studies are quite heterogeneous concerning all other parameters. For example, for a null value of θ = 0.5 and K = 5 studies, one configuration with an average study size of 150 has different sample sizes of 96, 108, 114, 120, 312, each divided equally between the two arms (q = 1/2) and different control arm probabilities p C of 0.15, 0.3, 0.45, 0.6, and 0.75; note that the condition θ = 0.5 when used with the five different control arm probabilities then uniquely specifies five probabilities p T for the treatment arms. A complete description of the heterogeneous simulations can be found in Appendix A. When K = 5, 10 and 20, all simulations were replicated 10,000 times and thus approximate 95% confidence intervals for the achieved levels are ±0.004; but when K = 40, the simulations were replicated only 1,000 times, giving approximate 95% confidence intervals for the levels of ±0.014.
The first panel of graphs (see Figure 2) shows the achieved levels, at the nominal level of 0.05, for the three tests plotted against the different null values of θ in the range 0 to 3 under the configuration in which all K studies have identical parameters and the study sizes are N = 90 with the subjects split equally between the two arms (q = 1/2). The twelve graphs in the panel use K = 5, 10, 20 and 40; and p C = 0.1, 0.2, and 0.4. Note that the achieved levels for both BD and Q γ are almost always in the range 0.04 to 0.06, with BD slightly better for many situations, but with Q γ occasionally slightly better. The test Q χ 2 is almost always inferior; and when p C = 0.1, it is much too conservative (not rejecting the null hypothesis frequently enough); indeed, when θ = 0, the achieved levels for Q χ 2 are less than 0.01. In the four right graphs, when p C = 0.4, we see that all three tests perform well when 0 ≤ θ ≤ 1.5; these parameters correspond to p T = 0.4, 0.52, 0.64 and 0.75. We also note that in the fairly extreme situation when θ = 3 and p C = 0.4 (and hence p T = 0.93) the quality of all the tests worsens, however BD performs best here and Q χ 2 performs very badly.
These results for the test Q χ 2 are perhaps more easily understood when expressed in terms of the natural parameters, the binomial probabilities p C and p T , rather than the log odds ratio θ. We see that Q χ 2 is extremely conservative whenever either binomial parameter is far from the central values of 0.5, but that its performance is reasonable when the binomial parameters are relatively close to the central values of 0.5. Figure 2 is representative of a number of additional panels of graphs for equal study sizes which can be found in Appendix B.1, Figures 9 and 10. There we have included panels of graphs first for balanced arms with study sizes of 150 and 210. These panels are quite similar to the one presented in Figure 2 except that all levels become closer to the nominal level of 0.05 as the study size increases from 90 to 150 to 210. This behavior is consistent with the known fact that the tests are asymptotically correct as the study sizes tend to ∞. However, we note that even when N = 210, the test Q χ 2 is still quite conservative when p C = 0.1.
Appendix B.1 contains two additional panels of graphs (Figures 11 and 12) which are analogous to the panel in Figure 2 except that the two arms of each study are unbalanced. In the first of these, all studies have twice the number of subjects in the treatment arm (q = 1/3) and the second is reversed with all studies having twice the number of subjects in the control arm (q = 2/3). The results are similar to those of Figure 2 with the following modified conclusions. When q = 1/3 and p c = 0.1, the Q χ 2 test is particularly conservative, rejecting the null hypothesis less than 1% of the time, independent of the number of studies K. Generally both the BD test and the Q γ tests are reasonably close to nominal level, but the BD test is mostly (but not always) somewhat better than the Q γ test. When θ = 3, all tests experience a decline in accuracy, with the BD test mostly superior. Figure 3 is a typical example showing the achieved levels for one set of configurations in which all the studies are distinct. Here the studies are of average size 150. When K = 5, the total study sizes are 96, 108, 114, 120, 312; in selecting these sizes, we have followed a suggestion of Sánchez-Meca and Marín-Martínez [24] who selected study sizes having the skewness 1.464, which they considered typical for meta-analyses in behavioral and health sciences. For a given θ the five studies had different values for the control arm and treatment arm probabilities (see Appendix for details). For K = 10, 20 and 40, the parameters for K = 5 were repeated 2, 4 and 8 times respectively. We see that BD and Q γ are fairly close in outcome with achieved levels almost always between 0.045 and 0.055, while the levels for Q χ 2 mostly cluster around 0.04. Note that the performance of Q χ 2 is somewhat better than seen in Figure 2 for two reasons. First, the study sizes are larger (average of 150 rather than all having size 90); and second, because the binomial parameters vary among the different studies, many of them are closer to the central values of 0.5 where we have seen that the performance of the Q χ 2 test improves.
It is worth noting that when we conducted simulations for the average sample size of 90 for the same scenario (so that the sample sizes were 36, 48, 54, 60, 252), we discovered that the Breslow-Day test does not perform well Figure 3 Achieved levels for heterogeneous studies, N = 150. Comparison of achieved levels, at the nominal level of 0.05, for the three tests Q γ (solid line), BD (dot-dash), and Q χ 2 (dash) plotted against the log odds ratio θ for heterogeneous studies. Here the studies have average size 150 divided equally between arms, but the study sizes and the binomial parameters vary for each study. In the left graphs, the smallest control probabilities are paired with the smallest study sizes. In the right graphs, the smallest control probabilities are paired with the largest study sizes. and may even not be defined for large numbers of studies K due to the sparsity of the data. This is the reason that, for comparative purposes, we use larger sample sizes in Figure 3 than used in Figure 2.

Power of the homogeneity test
In this section we report on the results from our (limited) simulations of power of the three tests: the Q γ , BD and Q χ 2 tests. Power comparisons are not really appropriate when the levels are inaccurate and differ across the tests. Unfortunately it is impossible to equalize the levels or adjust for the differences. Nevertheless we consider power comparisons at a nominal level of 0.05 to be important to inform the practice. We have performed simulations only for the case of K identical studies with balanced sample sizes (q = 1/2). The values for the total study sizes N, the number of studies K, control arm probabilities p C and the common log-odds ratio θ were identical to those used in simulating the levels for the identical studies given in Section "Accuracy of the level of the homogeneity test". For each combination of N, K, p C , θ, according to the random effects model of meta-analysis, we simulated K within-studies log odds ratios θ i from the N(θ, τ 2 ) distribution for the values of the heterogeneity parameter τ from 0 to 0.9 in the increments of 0.1. Given the values of p C and θ i , we next calculated the probabilities in the treatment groups p Ti and simulated the numbers of the study outcomes from the binomial distributions Bin(n i , p C ) and Bin(n i , p Ti ) for i = 1, · · · , K. All simulations were replicated 1000 times.
The first panel of graphs (see Figure 4) shows the power for the three tests when θ = 0 plotted against the different values of heterogeneity parameter τ in the range 0 to 0.9 under the configuration in which all K studies have identical parameters, the study sizes are N = 90 with the subjects split equally between the two arms (q = 1/2). The twelve graphs in the panel use K = 5, 10, 20 and 40; and p C = 0.1, 0.2, and 0.4.
Note that the power for both BD and Q γ are almost always higher than for Q χ 2 , with the difference being especially pronounced for p C = 0.1. The inferiority of Q χ 2 is due to its conservativeness noted in the Section "Accuracy of the level of the homogeneity test". There is no clear-cut winner between the BD and the Q γ , with BD slightly better for some situations, but slightly worse for others. In the three right graphs, when p C = 0.4, we see that all three tests perform equally well.
The second panel of graphs (see Figure 5) shows the power for the three tests when θ = 3. The power of the Q χ 2 test is still the lowest of the three tests. But here the power of the Q γ test appears to be somewhat higher then for the BD when p C = 0.1, about the same when p C = 0.2, and noticeably lower in the extreme situation when p C = 0.4. These differences in power between the BD and Q γ tests are both the consequences of the fact that the Q γ test is somewhat liberal for p C = 0.1 and somewhat conservative for p C = 0.4, as can be seen from Figure 2. The BD test is the closest to the nominal level in these circumstances.

Example: a meta-analysis of Stead et al. (2013)
This section illustrates the theory of Sections "The mean and variance of Q" and "Estimating the moments and distribution of Q LOR " and gives an indication of the improvement in accuracy of the homogeneity test. The calculations can be performed using our computer program (Additional files 1, 2 and 3).
We use the data from the review by Stead et al. [25] of clinical trials on the use of physician advice for smoking cessation. Comparison 03.01.04 [25], p.65 considered the subgroup of interventions involving only one visit.
We use odds ratio in our analysis below although relative risk was used in the original review. The first version of the review was published in 2001. Update 2, published in 2004, included 17 studies for this comparison. Summary data and the results from the standard analysis of these 17 trials are found in Figure 6, produced by the R package meta [26]. Note that meta does not add 1/2 to the number of events in calculation of the log-odds, and therefore calculates the standard statistic Q stand for the test of homogeneity.
The   For the data in these two examples, the gamma approximation results in lower and more accurate p-values than the p-values of both the standard chi-square approximation and the Breslow-Day test. However, in our more extensive simulations there were cases in which the Breslow-Day test was superior. Note that this example has fairly low numbers of events (between 1% and 5% for many studies), which, as mentioned at the end of Section "Accuracy of the level of the homogeneity test", is a situation where the Breslow-Day test may struggle.  Figure 7 shows the fit of our estimated gamma distribution (α = 8.90 and β = 1.66). Note that the fit is quite good throughout the entire empirical distribution. On the other hand, Figure 8 shows that the empirical distribution of Q departs substantially from the chi-square distribution with 16 df, again throughout the entire distribution.

Conclusions
Cochran's Q statistic is a popular choice for conducting a homogeneity test in meta-analysis and in multi-center trials. However users must be cautious in referring Q to a chi-square distribution when the study sizes are small or moderate. Here we have studied the distribution of Q when the effects of interest are (the logarithms of ) odds ratios between two arms of the individual studies. We have shown that the distribution of Q in these circumstances does not follow a chi-square distribution, especially if the binomial probability in at least one of the two arms is far from the central value of 0.5, say outside the interval [ 0.3, 0.7]. Further, the convergence of the distribution of Q to the asymptotically correct chi-square distribution is relatively slow as the sizes of the studies increase.
The mean and variance of Q (when the effects are log odds ratios and under the null hypothesis of homogeneity) are often substantially less than the corresponding chi-square values. We have provided formulas for estimating these moments and have found that matching these moments to those of a gamma distribution provides a good fit to the distribution of Q. The use of this distribution for Q yields a reasonably good test of homogeneity (denoted Q γ ) which is competitive with the well known Breslow-Day test both in accuracy of level and in power. However, this Q γ test does not seem to be superior (either in accuracy of level or in power) to the Breslow-Day test. Accordingly we recommend that the simpler Breslow-Day test be used routinely for testing the homogeneity of odds ratios.
We note that when the data are very sparse, the Breslow-Day test does not perform well and may even not be defined. We have met this difficulty in our unequal simulations described in Section "Accuracy of the level of the homogeneity test". The Q γ test is always well defined and is recommended for use in such situations.
In our study of the moments of Q for log odds ratios, we found that the variance of Q can be well approximated by a function of the mean of Q. Thus when fitting a gamma distribution to Q, at least approximately, the resulting distribution comes from a one parameter sub-family of the gamma family of distributions. The chi-square distributions also form a one parameter sub-family of the gamma family, but our conclusion is that it is the wrong sub-family to apply to Q. Intuitively, one would expect that a two parameter family of distributions would be needed because two independent binomial parameters (p T and p C ) for each study enter into the definition of Q. Thus it would be of interest to have a theoretical explanation of this property of Q, but we have been unable to provide this explanation.
The Q statistic with its distribution approximated by the chi-square distribution is widely used not only for testing homogeneity, but perhaps a more widespread and more important use is its application to estimate the random variance component τ 2 in a random effects model. Numerous moment-based estimation techniques, such as the very popular DerSimonian-Laird [6,27] and Mandel-Paule [28,29] methods use the first moment (K − 1) and the chi-square percentiles applied to the distribution of Q to provide, respectively, point and interval estimation of τ 2 . The latter is achieved through 'profiling' the distribution of Q, i.e., inverting the Q test (see Viechtbauer [27]). From our previous work with Bjørkestøl on the homogeneity test for standardized mean differences [9] and for the risk differences [10], it is clear that the non-asymptotic distribution of Q strongly depends on the effect of interest. This conclusion is confirmed here for Q when the effects are log odds ratios. The use of the correct moments and improved approximations to the distribution of Q for the point and interval estimation of τ 2 for a variety of different effect measures may provide greatly improved estimators, especially for small values of heterogeneity and will be the subject of our further work.

Appendix A: Information about the simulations
All of our simulations for assessing the accuracy of the levels and the power of various homogeneity tests used K studies with K = 5, 10, 20 and 40. All simulations were replicated 10,000 times for K = 5, 10 and 20, and (due to time considerations) only 1000 times for K = 40, unless stated otherwise. The set of simulations with all studies having identical parameters were as follows: study size N = 90, 150 and 210; proportion of each study in the control arm q= 1/2, 1/3 and 2/3; log odds ratio (null hypothesis) θ = 0, 0.5, 1.0, 1.5, 2.0 and 3.0; and binomial probabilities in the control arm p C = 0.1, 0.2 and 0.4. It is easier and more intuitive to select values of p C than to select values of the actual nuisance parameter For the simulations using unequal parameters among the various studies, the parameter choices can be described as follows. For K = 5, we use three vectors of study sizes: < N >=< 36, 48, 54, 60, 252 >; < 96, 108, 114, 120, 312 >; and < 163, 173, 178, 184, 352 >. These three vectors have average study sizes 90, 150 and 210 respectively, which corresponds to the study sizes of the equal simulations. The null hypothesis values of the log odds ratio θ are 0, 0.5, 1.0, 1.5, 2 and 3. For each fixed value of θ, we chose five values of p C with the goal of keeping p T away from 1.0 (see below for these values). Denote the vector of these values of p C by < P > and the vector of the same values but in reverse order by <∼ P >. From θ and < P >, it is easy to calculate the corresponding values of p T ; although these are not needed here, we include the approximate range of p T for information purposes. For K = 5, we conducted simulations for each value of θ pairing the first value of < N > with the first value of < P >, etc. which we denote 'order = 1', and then we pair the first value of < N > with the first value of <∼ P >, etc, which we denote 'order = 2'. By reversing the orders, we first pair the largest study size with the largest binomial probability and then pair the largest study size with the smallest binomial probability. We used balanced studies for these simulations (i.e., q = 1/2). For K = 10, we repeat these pairings twice, and for K = 20 and K = 40 the vectors of study sizes and control arm probabilities are repeated 4 and 8 times respectively.
We conducted many additional simulations with unequal size studies, some with all control probabilities equal except for 20% of the studies which had different control probabilities, and some with one or more of the studies being unbalanced (q = 1/3 and q = 2/3). These simulations did not add substantial information to our conclusions, so they are not reported here.
For the power simulations we only considered the case of K studies with the above identical parameters (including the values of the common log odds ratio θ) and balanced sample sizes (q = 1/2). For each combination of N, K, p C , θ, according to the random effects model of meta-analysis, we simulated K within-studies log odds ratios θ i from the N(θ, τ 2 ) distribution for the values of the heterogeneity parameter τ from 0 to 0.9 in the increments of 0.1. Given the values of p C and θ i , we next calculated the probabilities in the treatment groups p Ti and simulated the numbers of the study outcomes from the binomial distributions Bin(n i , p C ) and Bin(n i , p Ti ) for i = 1, · · · , K. All simulations were replicated 1000 times.

B.1 Additional graphs for accuracy of level and for power
The first two figures of this Appendix are similar to Figure 2 of the main article with the change being that the study sizes are 150 (instead of 90) in Figure 9 and 210 in Figure 10. These panels are quite similar to the one presented in Figure 2 except that all levels become closer to the nominal level of 0.05 as the study size increases from 90 to 150 to 210. This behavior is consistent with the known fact that the tests are asymptotically correct as the study sizes tend to ∞. However, we note that even when N = 210, the test Q χ 2 is still quite conservative when p C = 0.1. Figures 11 and 12 contain additional panels of graphs analogous to that in Figure 2 of the main article with the exception that the two arms of each study are unbalanced.
In the first of these, all studies have twice the number of subjects in the treatment arm (q = 1/3) and the second is reversed with all studies having twice the number of subjects in the control arm. The results are similar to those of Figure 2 with the following modified conclusions. When q = 1/3 and p C = 0.1, the Q χ 2 test is particularly conservative, rejecting the null hypothesis less than 1% of the time, independent of the number of studies K. Generally both the BD test and the Q γ test are reasonably close to nominal level, but the BD test is mostly (but not always) somewhat better than the Q γ test. When θ = 3, all tests experience a decline in accuracy, with the BD test mostly superior.
The final two figures in this appendix are analogous to Figures 4 and 5 in the main article, comparing the power of the three tests Q γ , BD and Q χ 2 when the log odds ratio is 0 and 3 respectively. The panels here (Figures 13 and 14) differ in that the sample sizes have been increased from N = 90 to N = 150. Qualitatively the plots here are quite similar to those in the main article, with the main difference, as would be expected, being that the power when N = 150 is somewhat greater than when N = 90. As before, Q γ and BD have similar power while Q χ 2 is most inferior in the two cases: θ = 0 and p C = 0.1; and θ = 3 Figure 11 Achieved levels for homogeneous studies, N = 90, q = 1/3. Achieved levels for the three tests Q γ (solid line), BD (dot-dash), and Q χ 2 (dash) plotted against the log odds ratio θ. Here all studies have the same parameters: 90 subjects in each study with unequal arms with 60 in the treatment arm (N = 90 and q = 1/3).

Figure 12
Achieved levels for homogeneous studies, N = 90, q = 2/3. Achieved levels for the three tests Q γ (solid line), BD (dot-dash), and Q χ 2 (dash) plotted against the log odds ratio θ . Here all studies have the same parameters: 90 subjects in each study with unequal arms with 30 in the treatment arm (N = 90 and q = 2/3). and p C = 0.4. These two cases share the property that one or both of the binomial probabilities is far from the central value of 0.5; in the first case, p C = p T = 0.1 and in the second case, p T = 0.93.

B.2 Information about formulas for mean and variance of Q LOR
In this appendix we present additional information concerning the data and methods that entered into Equations 6 and 7 which provide formulas for estimating the mean and variance of Q LOR under the null hypothesis of equal odds ratios. The data for Equation 6 include 648 parameter combinations in which all K studies had identical parameters. The parameters are: K = 5, 10, 20, 40; N = 90, 150, 210; q = 1/3, 1/2, 2/3; p C =0.1, 0.2, 0.4; and θ = 0, 0.5, 1, 1.5, 2, 3. The simulations for K = 40 were replicated 1,000 times, and the other simulations were replicated 10,000 times.
For each combination of parameters, we calculated an estimate of the mean of Q LOR (to be denoted simply Q in this section) using the theoretical expansion of Kulinskaya et al. [10]. We denote this quantity by E th [ Q]. For each parameter combination, we also found the mean of Q from the simulations, which we denote by Qbar. These two quantities were then divided by K − 1 to place the data on a scale common for all K. A scatter plot with a fitted line is found in Figure 15. Note that the fitted line (which has an R 2 value of 97.0%) essentially goes through the point (1, 1); the importance of the fitted line going through (1,1) is that both estimates agree when there is zero 'correction' from the re-scaled chi-square moment. Thus we subtracted 1 from both variables in Figure 15 and fit a regression through the origin, yielding a relation which we use to adjust the 'corrections' to the chi-square first moments K − 1 which are given by the the expansion E th [ Q]. This relation is found in Equation 6 of the main paper. (The four outliers in the lower left of Figure 15 belong to the extreme parameter values θ = 3, N = 90, q = 2/3, p T = 0.93, p C = 0.4 and for the four values of K = 5, 10, 20 and 40; omitting them made very little difference in the regression, so they were included in the analysis). Simulations for all of the parameter configurations that entered into Equation 6 of the main paper were redone, and these new simulations were the ones used in analyzing the accuracy of our test Q γ .

Figure 13
Power when the log odds ratio θ = 0 and N = 150. Power for the three tests Q γ (solid line), BD (dot-dash), and Q χ 2 (dash) plotted against τ , the square root of the random effect variance. Here all studies have the parameters: 150 subjects in each study with equal arms of 75 each (N = 150 and q = 1/2) and the log odds ratio θ = 0.
To arrive at the relation in Equation 7, we used simulations for 486 parameter combinations in which all K studies have the same parameters: K = 5, 10, 20; N = 90, 150, 210; q = 1/3, 1/2, 2/3; p C = 0.1, 0.2, 0.4; and θ = 0, 0.5, 1, 1.5, 2, 3, each replicated 10,000 times. For each parameter combination, let Qbar be the mean of the 10,000 values of Q and VarQbar be the variance of these 10,000 values of Q, and re-scale these values by dividing by K − 1. Figure 16 contains a scatter plot of these data together with a quadratic function fit. The quadratic fit has an R 2 value of 98.5%. We have used this regression in Equation 7 of the main article. We note again that simulations for all of the parameter configurations that entered into Equation 7 of the main paper were redone, and these new simulations were the ones used in analyzing the accuracy of our test Q γ .

B.3 The general expansion for the first moment of Q applied to Q LOR
The general expansion for the first moment of Q (denoted E th [ Q] in Section "Estimating the moments and distribution of Q LOR ") as found in Kulinskaya et al. [10] is reproduced at the end of this appendix. In the formulas below, we use the notation i = θ i − θ i and Z i = ζ i − ζ i ; also, we express the weight estimators as functions of the parameter estimators w i = f i ( θ i , ζ i ). The theoretical weights under the null hypothesis are then w i = f i (θ, ζ i ). For the weights as defined in Equation 2 of the main artlcle, some algebra produces the formula for the weight function The formulas below require that the central moments of θ i and ζ i satisfy the following order conditions: and O(E[ 4 i ] ) = 1/n 2 i and similar conditions for the central moments of ζ i . These order conditions for the specific case of the estimators of the log odds ratio (as defined in Section "Notation and assumptions") follow from the work of Gart et al. [12]. However, instead of using the approximations for the central moments given by Gart et al., our R-program calculates these exactly.