There is a large literature discussing the relative merits of using *RR*, *DIF*, and *OR* as outcome measures [10]-[14]. Our results concerning generalizability of *DIF* and *RR*, but not *OR*, in the presence of an unobserved binary covariate with no interaction, add important new information to this discussion.

Because the analyst must weight all the issues, we think it is helpful to present our perspective on some of the other factors that affect the choice of outcome measure. We believe the outcome measure should reflect the underlying model if it is known. Also we agree that one should consider how well the model of constant *RR, DIF, OR* fits the data [10].

It is sometimes argued that *DIF* and *RR* should *not* be used because extrapolated estimates might violate the constraints that 0 <*DIF* < 1 and *RR* > 0 [10]. (For example, suppose that in 9 trials the probability of outcome in the control group is .1 and the probability of outcome in the intervention group is .6. so *DIF* = .5. Also suppose that in 1 additional trial, the probability of outcome in the control group is .65 and the probability of outcome in the intervention group is .95 so *DIF* = .3. If all trials are equal size, a weighted estimate of *DIF* with weights inversely proportional to the variance yields *DIF*
_{
avg
} = .47. The estimated probability of outcome in the last trial would then be .65 + *DIF*
_{
avg
} = 1.12, which violates the constraint on *DIF*.) In contrast to many other investigators we are not concerned with this extrapolation problem. In many meta-analyses the extrapolated estimates will not violate the constraints. If an extrapolated estimate violates a constraint, it could be a valuable indication that the model is inappropriate when applied to all the studies. If the constraint is violated only slightly, it might be sensible to fit a model that constrains *DIF* and *RR* to lie in valid ranges [11].

Sometimes it is argued that *RR* should not be used because its value changes if the labels of the binary outcome are reversed [10]. In particular, if *RR* is constant with one set of labels it is typically not constant if the labels are reversed. However, because the labels have an important meaning (e.g. survive or die), we are not concerned that *RR* changes with label reversal. In contrast, in latent class models, the class labels are arbitrary, so it is helpful to check the computations by verifying that the results are the same if the labels are reversed. A more serious criticism of *RR* is sensitivity to small counts [12]. We agree with this criticism and do not recommend using *RR* with small counts in one group.

We agree with much of the literature that, in terms of interpretation, *RR* and *DIF* are preferable to *OR*. According to Sackett et al [14] "because very few clinicians are facile at dealing with odds and relative odds, ORs are not useful in their original form at the beside or examining room". Walter [10] writes, "The OR is undeniably the most difficult measure to intuit, so it likely to be less useful than RD [*DIF*] or RR for communicating risk"

Besides the choice of outcome measure, other factors affect the appropriateness of combining results from randomized trials and should be considered by the analyst. One factor is the variation in all-or-none compliance among trials. To reduce the variation from this factor, one can fit a model based on inherent compliance (i.e., with baseline subgroups "always-takers", "compliers", and "never-takers") [15, 16]. These models have been applied to meta-analyses involving *DIF* as an outcome [17, 18]. Related models for *RR* [19, 20] could be used for meta-analyses involving *RR*. Our graphic supporting the use of *DIF* and *RR* would directly apply to "compliers", who are the subgroup of interest in these models for all-or-none compliance.

Another factor affecting the combination of results from randomized trials is the variation in treatment (e.g. variation in doses or levels of ancillary care). Despite the theoretical results in this paper, a large empirical study comparing the use of *RR* and *OR* in meta-analyses found little difference in heterogeneity when using *RR* and *OR* [21]. A likely explanation is that the impact of variations in treatment was larger than the bias from using *OR*.