There is a large literature discussing the relative merits of using RR, DIF, and OR as outcome measures [10]-[14]. Our results concerning generalizability of DIF and RR, but not OR, in the presence of an unobserved binary covariate with no interaction, add important new information to this discussion.
Because the analyst must weight all the issues, we think it is helpful to present our perspective on some of the other factors that affect the choice of outcome measure. We believe the outcome measure should reflect the underlying model if it is known. Also we agree that one should consider how well the model of constant RR, DIF, OR fits the data [10].
It is sometimes argued that DIF and RR should not be used because extrapolated estimates might violate the constraints that 0 <DIF < 1 and RR > 0 [10]. (For example, suppose that in 9 trials the probability of outcome in the control group is .1 and the probability of outcome in the intervention group is .6. so DIF = .5. Also suppose that in 1 additional trial, the probability of outcome in the control group is .65 and the probability of outcome in the intervention group is .95 so DIF = .3. If all trials are equal size, a weighted estimate of DIF with weights inversely proportional to the variance yields DIF
avg
= .47. The estimated probability of outcome in the last trial would then be .65 + DIF
avg
= 1.12, which violates the constraint on DIF.) In contrast to many other investigators we are not concerned with this extrapolation problem. In many meta-analyses the extrapolated estimates will not violate the constraints. If an extrapolated estimate violates a constraint, it could be a valuable indication that the model is inappropriate when applied to all the studies. If the constraint is violated only slightly, it might be sensible to fit a model that constrains DIF and RR to lie in valid ranges [11].
Sometimes it is argued that RR should not be used because its value changes if the labels of the binary outcome are reversed [10]. In particular, if RR is constant with one set of labels it is typically not constant if the labels are reversed. However, because the labels have an important meaning (e.g. survive or die), we are not concerned that RR changes with label reversal. In contrast, in latent class models, the class labels are arbitrary, so it is helpful to check the computations by verifying that the results are the same if the labels are reversed. A more serious criticism of RR is sensitivity to small counts [12]. We agree with this criticism and do not recommend using RR with small counts in one group.
We agree with much of the literature that, in terms of interpretation, RR and DIF are preferable to OR. According to Sackett et al [14] "because very few clinicians are facile at dealing with odds and relative odds, ORs are not useful in their original form at the beside or examining room". Walter [10] writes, "The OR is undeniably the most difficult measure to intuit, so it likely to be less useful than RD [DIF] or RR for communicating risk"
Besides the choice of outcome measure, other factors affect the appropriateness of combining results from randomized trials and should be considered by the analyst. One factor is the variation in all-or-none compliance among trials. To reduce the variation from this factor, one can fit a model based on inherent compliance (i.e., with baseline subgroups "always-takers", "compliers", and "never-takers") [15, 16]. These models have been applied to meta-analyses involving DIF as an outcome [17, 18]. Related models for RR [19, 20] could be used for meta-analyses involving RR. Our graphic supporting the use of DIF and RR would directly apply to "compliers", who are the subgroup of interest in these models for all-or-none compliance.
Another factor affecting the combination of results from randomized trials is the variation in treatment (e.g. variation in doses or levels of ancillary care). Despite the theoretical results in this paper, a large empirical study comparing the use of RR and OR in meta-analyses found little difference in heterogeneity when using RR and OR [21]. A likely explanation is that the impact of variations in treatment was larger than the bias from using OR.