 Research article
 Open Access
 Published:
Estimating risk ratio from any standard epidemiological design by doubling the cases
BMC Medical Research Methodology volume 22, Article number: 157 (2022)
Abstract
Background
Despite the ease of interpretation and communication of a risk ratio (RR), and several other advantages in specific settings, the odds ratio (OR) is more commonly reported in epidemiological and clinical research. This is due to the familiarity of the logistic regression model for estimating adjusted ORs from data gathered in a crosssectional, cohort or casecontrol design. The preservation of the OR (but not RR) in casecontrol samples has contributed to the perception that it is the only valid measure of relative risk from casecontrol samples. For cohort or crosssectional data, a method known as ‘doublingthecases’ provides valid estimates of RR and an expression for a robust standard error has been derived, but is not available in statistical software packages.
Methods
In this paper, we first describe the doublingofcases approach in the cohort setting and then extend its application to casecontrol studies by incorporating sampling weights and deriving an expression for a robust standard error. The performance of the estimator is evaluated using simulated data, and its application illustrated in a study of neonatal jaundice. We provide an R package that implements the method for any standard design.
Results
Our work illustrates that the doublingofcases approach for estimating an adjusted RR from crosssectional or cohort data can also yield valid RR estimates from casecontrol data. The approach is straightforward to apply, involving simple modification of the data followed by logistic regression analysis. The method performed well for casecontrol data from simulated cohorts with a range of prevalence rates. In the application to neonatal jaundice, the RR estimates were similar to those from relative risk regression, whereas the OR from naive logistic regression overestimated the RR despite the low prevalence of the outcome.
Conclusions
By providing an R package that estimates an adjusted RR from cohort, crosssectional or casecontrol studies, we have enabled the method to be easily implemented with familiar software, so that investigators are not limited to reporting an OR and can examine the RR when it is of interest.
Background
The familiarity and wide adoption of logistic regression analysis for binary outcomes has resulted in the independent effect of a risk factor being most commonly reported as an adjusted odds ratio (OR) from logistic regression. The ease of communication and interpretation of a risk ratio (RR, also known as relative risk) is well recognized [1] and it is common for investigators to present and discuss the OR as an approximation to a RR for a rare outcome. However, each of these estimators comes with some consequences [2] and their advantages and disadvantages have been discussed extensively in the epidemiological literature. An important limitation of the OR, that is not shared by the RR, is the noncollapsibility that is the subject of ongoing discussion [3]. As a result of this property, the OR can vary across subgroups defined by a variable unrelated to the exposure, which imposes limitations on its interpretation. Another disadvantage of the OR that is not shared by the RR is that it is sensitive to the choice of scale [4]. Thus there are situations where an adjusted RR can provide a better understanding of the data and research findings [5] and overcome the limitations of only reporting an OR [6]. Of particular concern in global public health is the misinterpretation of the OR as a RR, supporting exaggerated claims of the magnitude of associations [7].
If the underlying disease process follows a relative risk model, and not a logistic model, methods have long been available for estimating the RR from cohort or crosssectional data: using logbinomial regression [8], or if this has convergence issues, Poisson regression [9] or Cox regression [10]. In the early years of casecontrol studies, a simple “correction” to the OR was proposed to yield a less biased estimate of the RR [11], but this was later shown to be biased in the presence of confounding [9]. A paper discussing eight methods of estimating the RR [12] from cohort or crosssectional data presented an intriguing approach referred to as “doublingthecases”, motivated in the early 1980’s by Miettinen [13], where manipulation of the data enables the RR to be estimated using standard logistic regression. Assuming the outcome in the data is coded as 1 for cases and 0 for noncases, the data set is expanded with an additional record for each case, in which the outcome is changed to 0, and a logistic regression analysis of this expanded data set provides an unbiased estimate of the RR. However, the naive standard error reported by the logistic regression is only valid for low incidence rates, and is otherwise biased upwards, representing the additional uncertainty that has been added to the data by having the same individual covariate profile associated with being both a case and a noncase. A robust sandwich estimator, first proposed in the early 1990s [14], corrects for the doubling of cases in the modified data, and has since been shown to perform well in simulation studies [12, 15]. However, statistical software packages do not provide an estimate of this standard error, so that a valid measure of precision is not easily available for the RR estimate. As a result of this computational challenge for cohort and crosssectional studies, and the lack of methodology and software for casecontrol sampling, the simple and intuitive doublingofcases approach is absent from the standard toolbox of health researchers.
The early work that developed the robust standard error [14] demonstrated that the doublingofcases approach can also be applied to casecohort data. Since the subcohort is a random sample of the whole cohort, it can be easily shown that the logistic regression of the expanded casecohort data provides a valid estimate of the RR, and the prevalence can be recovered from the intercept using the subcohort sampling fraction. Unlike the subcohort in a casecohort study, when a casecontrol sample is drawn from a cohort, these data are not representative of the larger cohort, resulting in the distortion of the estimate of RR (but not of the OR). However, if the sampling fractions are known, the cohort can be represented by upweighting the observed data using sampling weights [16]. Since the doublingofcases approach uses the standard logistic regression model, it is straightforward to accommodate such sampling weights for valid estimation of the RR from casecontrol samples. However, additional work is required to incorporate the weights when correcting for the overestimation of variability due to the doubling of cases.
In this paper, we describe the doublingofcases approach in the cohort setting and then extend its application to the estimation of adjusted RR from casecontrol data, where the controls are selected either by random or stratified sampling. We derive an expression for the robust standard error and facilitate the use of the method by implementing it as an R package. We evaluate the performance of the approach using simulated data, and illustrate its application in the analysis of the effect of preterm birth on the risk of neonatal jaundice.
Methods
Doubling of cases in cohort studies
To introduce the doublingofcases approach for estimating the RR, first consider a crude analysis using a cohort of N subjects, with a binary disease indicator Y (1 for cases and 0 for noncases) and a binary exposure X (e for exposed and \(\bar {e}\) for unexposed). As illustrated in Fig. 1, the doublingofcases approach involves expanding the cohort, by including each case twice, where the outcome on the second record is coded as a noncase. Such modification does not change the number of cases in the expanded cohort (where the outcome is denoted by Y^{∗}), but increases the number of noncases to N (see Fig. 1 and details in Table 1). Hence, the crude OR computed from the expanded cohort is identical to the RR from the original cohort.
MantelHaenszel OR from expanded cohort
In the presence of an additional categorical confounder, Z, the adjusted RR can be computed from the cohort using the MantelHaenszel approach, which is a weighted average of the RRs within each of the strata defined by Z [17]. Similarly, the MantelHaenszel OR from the expanded cohort is a weighted average of the ORs within each of the expanded strata (which are shown in Table 1 to be identical to the RRs in the original strata) using weights \(w^{*k} = \left (N^{k}_{\bar {e}1} N^{k}_{e.}\right) / \left (N^{k} + N^{k}_{.1}\right)\) for the OR from the kth expanded stratum, which differ from the weights used to compute the MantelHaenszel RR [17]: \(w^{k} = \left (N^{k}_{\bar {e}1} N^{k}_{e.}\right) / N^{k}\) for the RR from the kth stratum. It will be shown below that both the MantelHaenszel RR and the expanded data MantelHaenszel OR are estimating the same underlying parameter, the true adjusted RR.
Logistic regression of expanded cohort
The doublingofcases approach in regression analysis of cohort and casecohort studies was first described in 1993 [14], and more recently was referred to as expanded data logistic regression [15]. Here we will briefly describe the approach by generalising the expanded data MantelHaenszel OR introduced above.
Assume the following relative risk logbinomial regression model for the probability of being a case for an individual with exposure X in stratum Z:
where expβ represents the adjusted RR (with adjustment for Z) [18–21]. When the cohort is expanded by doubling the cases, the prevalence in each exposure group in the original cohort becomes the odds in that exposure group in the expanded cohort (see Table 2). Hence, a loglinear model for the prevalence in the original cohort gives rise to a loglinear model for the odds, i.e., a logistic regression model, in the expanded cohort:
which estimates the same regression coefficients as the logbinomial regression model in Eq. (1). The robust sandwichtype standard error (SE), derived by the same authors [14] to correct for this overestimation, is described in the next section.
Robust Sandwichtype SE for expanded data logistic regression
It can be readily seen from Table 2 that the probability of the modified outcome being 1 in the expanded cohort is:
For the relative risk regression model defined in Eq. (1), the following pseudo loglikelihood was used [14] for estimating the regression coefficient, β, and its variability:
where the subscript i indicates the ith subject in the original cohort of N subjects. This pseudo loglikelihood is exactly the loglikelihood of the logistic regression of the expanded cohort, where the first component (in the square brackets) represents the regular loglikelihood contribution from the N subjects in the cohort, and the second component corresponds to the additional ‘noncases’ created by doubling the cases. Hence, the regular maximum likelihood estimate from logistic regression analysis of the expanded data provides a valid estimate for β= ln(RR).
To describe the robust sandwichtype SE that was proposed [14] for the estimated ln(RR), it is useful to introduce a column vector to collectively denote the covariates observed from the ith subject in the original cohort: x_{i}=(1,X_{i},Z_{i})^{T}, where the first element corresponds to the intercept term in Eq. (1). The components in constructing the sandwichtype SE are derived from the following firstorder derivative of the pseudo loglikelihood, l:
where \(r^{*}_{i} = Y_{i}(1  p^{*}_{i})  p^{*}_{i}\) is derived from the error terms (i.e., the difference between the observed outcome and the estimated probability) from the logistic regression of the expanded cohort. For a case in the original cohort, where \(Y_{i} = 1, r^{*}_{i} = (1  p^{*}_{i}) + ( p^{*}_{i})\) is the summation of the error terms corresponding to the two records in the expanded cohort, one as a case and the other coded as a noncase but with the same covariates (and hence the same probability \(p^{*}_{i}\)). For a noncase where \(Y_{i} = 0, r^{*}_{i} =  p^{*}_{i}\) is the error term corresponding to the single record in the expanded data for this subject.
The proposed robust covariance matrix for the regression coefficients, (β,γ)^{T} is then:
where \(H_{1}^{1}\) is the inverse of the hessian matrix of l, estimated by the naive covariance matrix from the logistic regression of the expanded cohort, and H_{2} is the covariance matrix of U, estimated by:
and the \(\hat {r}^{*}_{i}\) terms are computed from the residuals of the expanded data logistic regression as described above.
Doubling of cases in casecontrol studies
When a casecontrol sample is drawn from a cohort, the sample prevalence is solely dependent on the case:control ratio. However, a casecontrol sample can be regarded as “intentionally missing” data, and provided the sampling fractions are known, valid cohort estimates (including the RR) can be obtained by upweighting the sample observations using inverse probability weights to “reconstruct” the cohort. It is common for all cases in the cohort to be sampled into the casecontrol study, and for controls to be matched to cases on one or more characteristics. In such studies, the weight is 1 for the cases and the weights for controls are calculated as the inverse of the sampling fraction of the noncases within the matching strata. If controls are selected by simple random sampling, the weights are simply the inverse of the overall sampling fraction of noncases in the cohort.
Weighted logistic regression of expanded casecontrol data
As a direct extension of expanded data logistic regression for estimating the RR in cohort studies, we propose a weighted logistic regression of expanded data from a casecontrol study. As before, each case in the casecontrol sample is doubled, but the analysis of the expanded data is conducted with a weighted logistic regression, where the weight of each individual in the expanded data is inherited from the sampling fractions that yielded the original casecontrol sample. Note that doubling of cases is a part of the analytical approach and does not affect the sampling of casecontrol data or the calculation of sampling fractions. Using similar arguments as for cohort data [14], we propose a robust sandwichtype SE for the estimate of the β parameter in the logistic regression model, i.e., the estimated ln(RR), and describe it in the next section.
Robust SE for expanded data weighted logistic regression of casecontrol data
Consider the analysis of a casecontrol sample of n subjects drawn from a cohort of size N. Assuming all cases and a simple random sample of controls are included, the sampling weight (denoted by w) of each case in this casecontrol sample is 1, and for each control it is the number of controls in the cohort divided by the number of sampled controls. For matched casecontrol samples, the sampling weights for controls are the ratios of available controls to sampled controls within each stratum defined by the matching factors. An unbiased estimate of ln(RR) can be obtained from this casecontrol sample by using the doublingofcases approach, provided the individual sampling weights are incorporated in the analysis. More specifically, the pseudo loglikelihood becomes a weighted pseudo loglikelihood:
which is the loglikelihood corresponding to the weighted logistic regression analysis of the expanded casecontrol sample. The first order derivative of l_{w} is:
Following the derivation of Eq. (6) for cohort designs, we propose the following as a robust covariance matrix for the estimates from a weighted analysis:
where \(H_{w1}^{1}\) denotes the inverse of the Hessian matrix of l_{w} and is estimated by the naive covariance matrix from the (weighted) logistic regression of the expanded casecontrol data, and H_{w2} is the covariance matrix of U_{w}, estimated by:
Simulation study
To evaluate our proposed estimator and robust SE for the RR from casecontrol data, we simulated a cohort consisting of N=1000 subjects, where 400 subjects were male (Z=1) and the remainder were female. To generate a confounding effect of sex, the probability of being exposed (X=1) was 0.4 for males and 0.2 for females and the outcome generated from the following logbinomial model:
The intercept term was assigned values corresponding to prevalence rates of approximately 10%, 20%, 30% and 40%. We considered true values of RR=1,1.25,1.5,2. For each simulated cohort, we implemented four designs: a 1:1 and 1:2 casecontrol ratio, each with controls selected randomly or matched on sex.
For the simulated cohort data, we estimated the RR using the logbinomial regression model (the true datagenerating model), the expanded data logistic regression model, and other simple/naive estimators: the MantelHaenszel RR, expanded data MantelHaenszel OR, and the naive logistic regression model (where the estimated OR is viewed as an approximation for the RR). The casecontrol data was analysed by weighted logistic regression of the expanded data and by logistic regression of the original casecontrol sample. Although an unweighted logistic regression analysis with adjustment for matching factors is valid for estimating the OR of other covariates, we chose to perform a weighted analysis of matched casecontrol data to also enable valid estimation of the coefficients of the matching factors. The distributions of the estimates from the doublingofcases approaches over 2000 simulation cycles under each scenario were examined on boxplots, where they were compared to the estimates from the correct analysis (MantelHaenszel RR or logbinomial model) and the naive estimates. The performance of the method was evaluated by averaging the bias, empirical SE and robust SE, and computing the coverage of the (robust) 95% confidence interval, the type I error rate (when the true RR was 1) and power (when the true RR was not 1).
Illustrative example
We analysed risk factors for neonatal jaundice in infants born to Swedish women between 1992 and 2002 [22]. From the singleton livebirths recorded by the Swedish Medical Birth Register during this calendar period, we excluded infants at risk of neonatal jaundice due to known maternal alloimmunisation or potential alloimmunisation due to a history of transfusion, resulting in 657,264 infants for analysis. In addition to the sex and prematurity of the infant, information was available for maternal age, body mass index (BMI), parity (nulliparous or multiparous) and smoking status. After excluding births with missing information on maternal BMI or smoking status, the final cohort consisted of 547,466 births. Maternal BMI was dichotomised at 25, and maternal age was dichotomised at 35 years. We assessed the association of neonatal jaundice with the six factors described above and the presence of an interaction between preterm birth and parity by analysing the full cohort and a 1:2 casecontrol sample matched on maternal age and the sex of the infant. The cohort data was analysed using naive logistic regression, logbinomial regression and expanded data logistic regression models. The matched casecontrol sample was analysed using weighted logistic regression and expanded data weighted logistic regression.
Implementation
All analyses were performed using R (version 4.0.1). We implemented the expanded data (weighted) logistic regression model as an R package named DoublingOfCases (available from: https://github.com/nyilin/DoublingOfCases). The naive logistic regression model was implemented by the glm function with family = binomial(link = “logit”), and for the weighted logistic regression, the inverse sampling weights were specified via the weights option. The logbinomial regression model was implemented by the glm function with family = binomial(link = “log”).
Results
Simulation study
For the simulated cohorts, the expanded data MantelHaenszel OR and expanded data logistic regression OR performed well, providing estimates similar to the MantelHaenszel RR and the logbinomial RR respectively, regardless of the prevalence in the cohort or the true value of the RR (see Figs. 2 and 3A). The bias in the naive OR increased as expected with larger values of RR and prevalence. Simulation scenarios with a prevalence rate of 40% and true RR of 1.5 or 2 approached the boundary of the parameter space of relative risk models, with the maximum event probability close to 0.80 and 0.95 respectively. The logbinomial regression model failed to converge in 2 and 1432 of the 2000 simulation cycles in these two scenarios respectively, but in the cycles where it converged, it provided valid estimates of the RR (see Appendix Table 5 for detailed simulation results).
The “Cohort” column in Fig. 4 summarises the good performance of the expanded data logistic regression estimator in all simulation scenarios, with estimated RR close to the true value, coverage close to 95%, type I error close to 5% and power comparable to that of the logbinomial regression model. The robust SE of the estimated RR from the expanded data logistic regression model was similar to the empirical SE, and similar to the variability from the logbinomial regression model when the latter converged (see Appendix Table 5). The naive logistic regression model had a type I error close to 5% and power comparable with the expanded data logistic regression model in all scenarios, as might be expected. Although the estimated OR was a reasonable approximation to the RR (with small bias and coverage close to 95%) when the exposure had no effect (i.e., when RR = 1) or when the prevalence was low (10%), there was an increase in bias and decrease in coverage with increasing prevalence, especially when estimating a larger RR.
A similar performance was observed for the weighted logistic regression and expanded data weighted logistic regression models when applied to casecontrol data. Figure 3B presents the distributions of the RR estimates from 1:1 and 1:2 matched casecontrol studies, and the performance in terms of bias, coverage, Type I error and power are illustrated for random and matched 1:1 sampling in the second and third columns of Fig. 4 (details in Appendix Table 6).
Illustrative example
A total of 21,441 (3.9%) of the infants in the cohort of 547,466 births were diagnosed with neonatal jaundice. The majority of these cases were firstborn infants, with only 3148 born to multiparous mothers. The crude OR associated with preterm birth was 28.0 but the crude RR was only 16.6, and the stronger association among multiparous mothers (crude OR=32.2 and crude RR=20.4) compared to nulliparous mothers (crude OR=23.4 and crude RR=13.1) suggested a possible interaction effect between these two factors. The large difference between the crude OR and RR suggests that the adjusted OR estimated from a naive logistic regression analysis would not be a reasonable approximation to the adjusted RR.
The logbinomial and expanded data logistic regression models provided similar estimates for the association of neonatal jaundice with each of the factors studied, except for a somewhat larger estimate for the association with overweight from the expanded data logistic regression model. Both models identified premature delivery as a strong risk factor for neonatal jaundice, with an estimated relative risk of approximately 13fold among nulliparous mothers and 20fold among multiparous mothers (see Table 3). Compared to infants of mothers with maternal BMI below 25, infants of overweight mothers (BMI ≥ 25) had an approximate 20%26% higher risk of neonatal jaundice. Multiparity was associated with a decreased risk. Despite the low prevalence of the outcome in this population, the OR from a naive logistic regression model considerably overestimated the RR for preterm birth, almost by a factor of 2 for nulliparous mothers and 1.5 for multiparous mothers. Similar estimates were obtained by analysing the matched casecontrol sample using the weighted logistic and expanded data weighted logistic regression models, by incorporating the sampling weights (see Table 4).
Discussion
Despite the attractive properties of the RR, there has been wide adoption of methods for estimating the OR, due in part to its mathematical and statistical properties, such as the reciprocity with respect to the choice of reference group [23] and the avoidance of predicted probabilities greater than 1. But the OR also has some unattractive properties not shared by the RR. Although the MantelHaenszel RR can be computed for simple tabular data, the more general logbinomial regression model for estimating an adjusted RR is not as widely known as the corresponding logistic regression model for estimating an adjusted OR. As a result of this familiarity, and the straightforward interpretation and ease of communication of the RR, investigators often present an adjusted OR as an approximation to the adjusted RR for rare outcomes. This has been further encouraged by articles in the medical literature that continue to present the OR as a feature of the casecontrol study design [24, 25], although there are methods of sampling that offer estimates of RR [26]. In addition to nonrare outcomes, there are other situations where the RR estimate may be useful or more appropriate, and a recent tutorial article on best practice encourages researchers to examine their results in more than one way [5] when there are valid alternatives. We have provided such an alternative, the doublingofcases approach, that is intuitively appealing and utilises the familiar logistic regression model after a simple modification to the data. Although it has been known for some time that this method provides a valid adjusted RR from cohort or crosssectional data, the standard error of the estimate is not available from statistical software packages. In this paper, we first provided an introduction to this method in the context of cohort or crosssectional data, and then extended the approach to data collected in a casecontrol design, deriving a robust estimate of the standard error. In contrast to the optional use of weighted logistic regression to improve precision or enable estimation of coefficients of matching factors [16, 27], a weighted analysis is required for valid estimation of a RR from casecontrol data. Where the casecontrol study has been implemented in a welldefined population or cohort, these weights are easily available from simple frequency distributions. To make the method accessible to data analysts, we have implemented it as an R package (available from https://github.com/nyilin/DoublingOfCases) that seamlessly estimates adjusted RRs from cohort, crosssectional and casecontrol studies.
Using simulated data, we demonstrated that the expanded data weighted logistic regression of a casecontrol sample, with or without matching, produced similar estimates to the adjusted RR estimated from the full cohort. Our simulation studies also demonstrated the overestimation of a RR by the OR from a simple logistic regression model, even when the outcome is rare, especially for strong effects. In contrast, the weighted logistic regression model of the expanded data generated valid estimates for the RR, even for common outcomes. Our proposed robust SE for the RR estimated from casecontrol data performed well in estimating the variability of the adjusted RR.
In an application to neonatal jaundice, we found a positive association with preterm birth (which was stronger among multiparous mothers) and maternal overweight, and a negative association with multiparity, consistent with the literature [22, 28]. Although it is often assumed that the OR is a reasonable approximation for the RR when studying a rare outcome, this example demonstrated that the OR can considerably overestimate the RR of a rare event when assessing a very strong association: although the prevalence of the outcome (neonatal jaundice) in the cohort was only 3.9%, the adjusted ORs for preterm (23.5 and 32.5 among nulliparous and multiparous mothers, respectively) were considerably larger than the adjusted RRs estimated from logbinomial regression (12.9 and 20.1) or expanded data logistic regression (13.0 and 20.4).
In our simulation study, we encountered a practical difficulty that is known in the implementation of logbinomial regression models in statistical software packages: the algorithm may fail to converge. In our simulation scenario of moderate effect (true RR=2) and high prevalence (40%), the logbinomial regression failed to identify valid starting values for the coefficients in more than 70% of the iterations. While this could be resolved by using crude RR estimates as starting values (data not shown), such issues may not be so easily overcome in practice. For example, Deddens and Petersen [29] created a simple numerical example with outcome Y=(0,0,0,0,1,0,1,1,1,1) and a single exposure X=(1,2,3,4,5,6,7,8,9,10), where the R implementation (via the glm function) failed to converge even when the true estimates were used as starting values. This difficulty, and sometimes inability, to reach convergence in maximising the likelihood of the logbinomial regression model, has been widely discussed in the literature [12, 15], and a computationally expensive approach to alleviate the problem has been made available in SAS [30]. An alternative approach that avoids convergence issues when estimating the RR is the Poisson regression model (with robust SE), which has a similar good performance to that of expanded data logistic regression when applied to cohort data [12, 15], or to casecontrol data that incorporates sampling weights (see Fig. 5 in Appendix). The Poisson regression model approximates the binomial distribution of the binary outcome using a Poisson distribution, whose statistical properties may not be familiar to many applied data analysts, making them reluctant to embark on such an analysis. In contrast, the doublingofcases approach is easily accessible as it leverages on the simple equivalence between the RR from the original data and the OR from the expanded data that is common to crude and adjusted analyses, and uses one of the most common analytical tools in epidemiology, the logistic regression model.
A potential practical limitation of the doublingofcases method for matched casecontrol data is that it is necessary to know the sampling fractions within the matching strata, as these are needed to enable the analysis to ‘reconstruct’ the background population/cohort from the casecontrol sample. The availability of this information will depend on whether the casecontrol study was conducted within a welldefined population, the nature and extent of the matching factors and the available data resources. Where a study is conducted using national or regional health registers, and cases and controls matched on basic demographic data (such as sex and age category), then the necessary information will be available from population statistics offices. The sampling fractions will also be known for studies that identify cases and controls from electronic medical records. However, the necessary data may be difficult or impossible to obtain for casecontrol studies that are implemented in the course of clinical work in lowresource settings with limited data infrastructure.
Another limitation of the doublingofcases approach, in common with Poisson regression, is the potential bias in the estimated ln(RR) when some subjects have estimated probabilities greater than or equal to 1. In the small numerical example from Deddens and Petersen [29] mentioned above, both the Poisson regression and the expanded data logistic regression had estimated probabilities larger than 1 for the 9th and 10th observations and both methods overestimated the RR to some extent: compared to the correct estimate (with 95% CI) of 1.23 (1.01  1.51), the Poisson regression with robust SE estimated the RR as 1.38 (1.13  1.70) and the expanded data logistic regression estimate was 1.44 (1.14  1.82). Although our illustrative example did not have large estimated probabilities (maximum 0.71), RR estimates are also known to be potentially biased when estimating a strong association with exposure [15], as occurred in the expanded data logistic regression analysis of the very strong association of prematurity with neonatal jaundice. Although the doublingofcases approach may result in some bias in the estimates of the RR in such settings, it can still be used by data analysts as a simple first approach. Large estimated probabilities may suggest that the loglinear assumption is inadequate, in which case the regression analysis should consider transformations of continuous covariates and/or interactions between covariates to more appropriately model the underlying datagenerating mechanism.
Conclusions
As a result of the method presented in this paper and the provision of a software package for its implementation, investigators can choose whether to report an adjusted OR or RR, or both, regardless of the study design. The method offers a simple and formal way of justifying the reporting of an adjusted OR as an approximate RR, regardless of the prevalence. Another important advantage is that it facilitates the comparison of findings to published RRs and the inclusion of estimates in metaanalyses that may be challenged by the mixed reporting of OR and RR.
Appendix
Tables 5 and 6 present the detailed simulation results that were visualised in Fig. 4.
Figure 5 presents the results of a supplemental simulation study, in which each simulated cohort was analysed using Poisson regression, and each simple and matched casecontrol sample using weighted Poisson regression with inverse probability weighting. As illustrated in the Figure, the performance of the (weighted) Poisson regression was comparable with the doublingofcases approach in all scenarios investigated.
Availability of data and materials
Data sharing is not applicable to this article as no new data were created or analyzed in this study. The R package created for this application, named DoublingOfCases, is available for download from: https://github.com/nyilin/DoublingOfCases.
Abbreviations
 BMI:

Body Mass Index
 OR:

Odds Ratio
 RR:

Relative Risk
 SE:

Standard Error
References
Nurminen M. To use or not to use the odds ratio in epidemiologic analyses?Eur J Epidemiol. 1995; 11:365–71. https://doi.org/10.1007/BF01721219.
Tamhane A, Westfall A, Burkholder G, Cutter G. Prevalence odds ratio versus prevalence ratio: choice comes with consequences. Stat Med. 2016; 35(30):5730–35. https://doi.org/10.1002/sim.7059.
Greenland S. Noncollapsibility, confounding, and sparsedata bias. part 1: The oddities of odds. J Clin Epidemiol. 2021; 138:178–81. https://doi.org/10.1016/j.jclinepi.2021.06.007.
Norton E, Dowd B. Log odds and the interpretation of logit models. Health Serv Res. 2018; 53(2):859–78. https://doi.org/10.1111/14756773.12712.
Norton E, Dowd B, Maciejewski M. Odds ratioscurrent best practice and use. JAMA. 2018; 320(1):84–85. https://doi.org/10.1001/jama.2018.6971.
Chatterjee A, Woodruff H, Wu G, Lambin P. Limitations of only reporting the odds ratio in the age of precision medicine: A deterministic simulation study. Front Med (Lausanne). 2021; 8(640854). https://doi.org/10.3389/fmed.2021.640854.
Gallis J, Turner E. Relative measures of association for binary outcomes: Challenges and recommendations for the global health researcher. Ann Glob Health. 2019; 85(1):1–12. https://doi.org/10.5334/aogh.2581.
Robbins A, Chao S, Fonseca V. What’s the relative risk? a method to directly estimate risk ratios in cohort studies of common outcomes. Ann Epidemiol. 2002; 12:452–4. https://doi.org/10.1016/s10472797(01)002782.
McNutt LA, Wu C, Xue X, Hafner J. Estimating the relative risk in cohort studies and clinical trials of common outcomes. Am J Epidemiol. 2003; 157:940–43. https://doi.org/10.1093/aje/kwg074.
Lee J, Chia K. Estimation of prevalence rate ratios for cross sectional data:an example in occupational epidemiology. Br J Ind Med. 1993; 50:861–64.
Zhang J, Yu K. What’s the relative risk? a method of correcting the odds ratio in cohort studies of common outcomes. JAMA. 1998; 280(19):1690–1. https://doi.org/10.1001/jama.280.19.1690.
Knol M, Le Cessie S, Algra A, Vandenbroucke J, Groenwold R. Overestimation of risk ratios by odds ratios in trials and cohort studies: alternatives to logistic regression. Can Med Assoc J. 2012; 184(8):895–99. https://doi.org/10.1503/cmaj.101715.
Miettinen O. Design options in epidemiologic research. an update. Scand J Work Environ Health. 1982; 8:7–14.
Schouten E, Dekker J, Kok F, Le Cessie S, Van Houwelingen H, Pool J, Vanderbroucke J. Risk ratio and rate ratio estimation in casecohort designs: Hypertension and cardiovascular mortality. Stat Med. 1993; 12(18):1733–45. https://doi.org/10.1002/sim.4780121808.
Blizzard L, Hosmer D. Parameter estimation and goodnessoffit in log binomial regression. Biom J. 2006; 48(1):5–22. https://doi.org/10.1002/bimj.200410165.
Reilly M, Torrang A, Klint A. Reuse of case—control data for analysis of new outcome variables. Stat Med. 2005; 24:4009–19. https://doi.org/10.1002/sim.2398.
Deeks J, Altman D, Bradburn M. Statistical Methods for Examining Heterogeneity and Combining Results from Several Studies in MetaAnalysis: John Wiley & Sons, Ltd; 2008, pp. 285–312. Chap. 15. https://doi.org/10.1002/9780470693926.ch15.
Deddens J, Petersen M. Approaches for estimating prevalence ratios. Occup Environ Med. 2008; 65:501–6. https://doi.org/10.1136/oem.2007.034777.
Wacholder S. Binomial regression in glim, estimating risk ratios and risk differences. Am J Epidemiol. 1986; 123:174–84.
Zocchetti C, Consonni D, Bertazzi P. Re: Estimation of prevalence rate ratios from crosssectional data (letter). Int J Epidemiol. 1995; 24:1064–105.
Skov T, Deddens J, Petersen M, Endahl L. Prevalence proportion ratios: estimation and hypothesis testing. Int J Epidemiol. 1998; 27:91–95.
Lee B, Le Ray I, Sun J, Wikman A, Reilly M, Johansson S. Haemolytic and nonhaemolytic neonatal jaundice have different risk factor profiles. Acta Paediatr. 2016; 105(12):1444–50. https://doi.org/10.1111/apa.13470.
Sonis J. Odds ratios vs risk ratios. JAMA. 2018; 320(19):2041. https://doi.org/10.1001/jama.2018.14417.
Irony T. Casecontrol studies: Using “realworld” evidence to assess association. JAMA. 2018; 320:1027–28. https://doi.org/10.1001/jama.2018.12115.
Dupepe E, Kicielinski K, Gordon A, Walters B. What is a casecontrol study?. Neurosurgery. 2019; 84(4):819–26. https://doi.org/10.1093/neuros/nyy590.
Blakely T, Pearce N, Lynch J. Casecontrol studies. JAMA. 2019; 321(8):806–07. https://doi.org/10.1001/jama.2018.20253.
Reilly M, Pepe M. A mean score method for missing and auxiliary covariate data in regression models. Biometrika. 1995; 82(2):299–314.
Norman M, Åberg K, Holmsten K, Weibel V, Ekéus C. Predicting nonhemolytic neonatal hyperbilirubinemia. Pediatrics. 2015; 136(6):1087–94. https://doi.org/10.1542/peds.20152001.
Deddens J, Petersen M. Re: ‘estimating the relative risk in cohort studies and clinical trials of common outcomes’. Am J Epidemiol. 2004; 159:213–15.
Deddens J, Petersen M, Lei X. Estimation of prevalence ratios when proc genmod does not converge. In: Proceedings of the 28th Annual SAS Users Group International Conference (March 30–April 2): 2003. http://www2.sas.com/proceedings/sugi28/27028.pdf. Accessed 25 May 2022.
Acknowledgments
We thank Jay Achar for fruitful discussions of this work.
Funding
This work was supported by Cancerfonden (the Swedish Cancer Society) contract number 16 0497. The funding body had no role in the study design, data analysis and interpretation, or writing the manuscript. Open access funding provided by Karolinska Institute.
Author information
Authors and Affiliations
Contributions
The study was conceived by MR, simulations and data analyses conducted by AL and YN and R code developed by YN. YN drafted the manuscript, which was reviewed and revised by all authors. All authors approved the final version of the manuscript for submission.
Ethics declarations
Ethics approval and Consent to participate
N/A, no new data were used for this study.
Consent for publication
N/A.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ning, Y., Lam, A. & Reilly, M. Estimating risk ratio from any standard epidemiological design by doubling the cases. BMC Med Res Methodol 22, 157 (2022). https://doi.org/10.1186/s12874022016363
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874022016363
Keywords
 Doublingofcases
 Expanded data logistic regression
 Logbinomial regression
 Poisson regression
 Relative risk
 Weighted analysis