 Research
 Open Access
 Published:
A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation study
BMC Medical Research Methodology volume 22, Article number: 146 (2022)
Abstract
Background
Regression models are often used to explain the relative risk of infectious diseases among groups. For example, overrepresentation of immigrants among COVID19 cases has been found in multiple countries. Several studies apply regression models to investigate whether different risk factors can explain this overrepresentation among immigrants without considering dependence between the cases.
Methods
We study the appropriateness of traditional statistical regression methods for identifying risk factors for infectious diseases, by a simulation study. We model infectious disease spread by a simple, populationstructured version of an SIR (susceptibleinfectedrecovered)model, which is one of the most famous and wellestablished models for infectious disease spread. The population is thus divided into different subgroups. We vary the contact structure between the subgroups of the population. We analyse the relation between individuallevel risk of infection and grouplevel relative risk. We analyse whether Poisson regression estimators can capture the true, underlying parameters of transmission. We assess both the quantitative and qualitative accuracy of the estimated regression coefficients.
Results
We illustrate that there is no clear relationship between differences in individual characteristics and grouplevel overrepresentation —small differences on the individual level can result in arbitrarily high overrepresentation. We demonstrate that individual risk of infection cannot be properly defined without simultaneous specification of the infection level of the population. We argue that the estimated regression coefficients are not interpretable and show that it is not possible to adjust for other variables by standard regression methods. Finally, we illustrate that regression models can result in the significance of variables unrelated to infection risk in the constructed simulation example (e.g. ethnicity), particularly when a large proportion of contacts is within the same group.
Conclusions
Traditional regression models which are valid for modelling risk between groups for noncommunicable diseases are not valid for infectious diseases. By applying such methods to identify risk factors of infectious diseases, one risks ending up with wrong conclusions. Output from such analyses should therefore be treated with great caution.
Background
Identifying overrepresented groups in infectious disease case statistics is important to guide targeted interventions. Previous studies have shown that interventions are most effective if they are targeted towards highrisk groups [1, 2]. If the elevated risk of infection can be attributed to intervenable causes, then targeted interventions can be implemented to eliminate these causes as part of the mitigation process.
During the COVID19 outbreak, many studies have investigated potential risk factors for infection by using traditional statistical methods on data of the occurrence of infection in different groups [3,4,5,6,7,8,9,10,11]. As a motivating example in this article, we will consider studies investigating overrepresentation of foreignborn, immigrants, and certain ethnic minorities among individuals infected with COVID19 [3,4,5,6,7,8,9,10]. Some suggested explanations include that these individuals are disproportionately overrepresented in specific groups of the population with higher risk of infection; they typically live in more crowded households, have lower socioeconomic status, have less access to health care and insurance, are at elevated risk for other underlying diseases, and are overrepresented in occupations with high exposure [3, 4, 12, 13].
To understand whether overrepresentation in such individual risk factors can explain the overrepresentation in cases, different studies have applied statistical regression models [3,4,5,6,7,8,9]. Common for these studies is that they find an effect of ethnicity or country of birth, even after adjusting for confounding/mediating factors associated with an elevated risk of infection, like socioeconomic status and household size. In these studies, the research question is often framed in terms of the direct effect of ethnicity on infection, therefore they want to control for known mediators and confounders. In our example, we thus also consider a mediating variable, but the numerical results would be identical in a situation where one would control for a confounding variable instead.
Traditional statistical regression methods can be applied to identify risk factors associated with different medical conditions, which in turn can provide insights into identifying potential causes for the condition. These methods are, as we will show, not in general suitable for infectious diseases. Infectious diseases differ from noncommunicable diseases as there is no simple relationship between an increased risk of infection and the number of individuals infected.
Another related problem is that data from infectious diseases violate the crucial assumption for regression models of independent observations. This dependence is not easy to adjust for through, for example, cluster or time series analyses. Because transmissible diseases are acquired through contacts, the contact pattern and social network are the most important explanatory variables for infectious diseases [14]. Hence it is not only your individual risk factors that are important for your risk of infection, but also the properties/risk factors of your social network. This study will separate between individual risk factors that only include covariates related to the individual, and properties related to the social network. We will refer to the first as individual risk factors, and the latter as properties/risk factors of the individual’s contacts. As the disease outcome on an individual directly depends on the outcomes (and thus exposures) of other individuals in the population, we can both have direct effects on the individual due to their individual risk factors, and indirect effects of different exposure variables through the population due to the risk factors of the individual’s contacts [15,16,17]. In this study, we focus on estimating direct effects of exposures.
In this study, we will employ stateofthe art statistical methods to analyse data on communicable diseases by standard regression methods. By standard/traditional regression methods, we refer to readily available methods in statistical software which do not account for the detailed contact structure nor the dynamics of disease transmission. Though we refer to the methods as regression methods, other statistical techniques like statistical tests and ANOVA analyses are subject to the same problems if the contact structure and transmission dynamics are not considered.
One of the key reasons for using regression models is to adjust relationships for confounders or mediators, allowing estimates of direct effects of an exposure. In studies addressing the causes of why ethnic minorities are disproportionately affected by coronavirus disease, this is a main aim. In regression models, one can estimate the effect of a change in a variable on another variable, adjusted for other variables, in terms of the regression coefficient. Hence, by interpreting the regression coefficient, one can answer questions like: what is the risk of becoming sick in group A compared to group B if the factors C, which are differentially represented in the groups, would have been equally distributed. For infectious diseases, adjusting for potential confounders can be even more problematic than investigating univariant relationships because kinships, households, social and cultural structures shape humantohuman contact patterns, and hence the infection dynamics.
Assortative mixing, meaning a preference for individuals to have contacts with others that share characteristics or origins, is common in social networks. For example, this has been shown for traits like gender, age, occupation, religion, obesity, smoking, number of contacts, happiness, and negative vaccine sentiments [18,19,20,21,22,23]. Importantly, preferential mixing by ethnicity and immigrant status is well documented [18, 24, 25]. Hence, since certain immigrants/ethnic minorities belong to a highrisk group for infection, and typically have strong social ties within their group, individuals from these groups can be at higher risk of infection, even if they exhibit lowrisk characteristics at the individual level.
In this study, we have constructed a simulation experiment to investigate the consequences of applying traditional statistical regression methods to analyse individual risk of infectious diseases. We use a simple, populationstructured SIRmodel [1] (susceptibleinfectiousrecovered), a general and wellestablished model for the spread of respiratory infections with acquired immunity, assuming random mixing within the population groups. We show that both point estimates and significance tests from regression models are likely to be wrong when applied to infectious diseases, and that it is necessary to take the data generation process into account. Although this study is motivated by the studies on risk factors for COVID19 among immigrants, our results and conclusions are more general and relevant in other settings, specifically when the contact pattern is assortative.
Methods
Framework
We consider analyses where the goal is to use observed counts or prevalence of infection in various groups to learn about the underlying parameters of disease transmission. In this paper we analyse a simplified SIRmodel where we divide the population into first two and then extend to four subgroups. The twogroup setting is the most parsimonious setting considered and is used to illustrate the relationship between individual and grouplevel risks, whether one can use a fitted regression model for contrafactual predictions to generalise to other group sizes, and the dependence between the incidence ratio in the two groups. The fourgroup setting is motivated by the studies of ethnicity and COVID19 and is used to study the estimated regression coefficients from observational data.
We assume that we have data on the number of infected and total population sizes for each subgroup (individuallevel data will give similar results) at timepoint t. In this simplified setup, either four or sixteen key parameters control the transmission dynamics between the two or four subgroups, respectively. These parameters, \({\beta }_{ij}\), typically derived from contact studies, are summarised in a WhoAcquiresInfectionfromWhom matrix as usual in infectious disease modelling, where \({\beta }_{ij}\) is the rate of transmission from an individual in group \(j\) to an individual in group \(i\). We assume here that all groups have the same duration of infectiousness.
An important distinction is that “highrisk” can in this setup be due to at least three different factors that alone or in combination can explain an overrepresentation of cases in one of the subgroups. We can decompose each \({\beta }_{ij}\) as \({\beta }_{ij}={inf}_{j}\times {c}_{ij}\times {susc}_{i}\), where \({inf}_{j}\) is the infectivity of group \(j\), \({c}_{ij}\) is the number of contacts group \(i\) has with group \(j\) per time unit, and \(sus{c}_{i}\) is the susceptibility of group \(i\). Hence, we in general expect higher incidence in a group with higher susceptibility, or if the group has overall more contacts. We also expect higher incidence in a group with higher infectivity, if there are more contacts within than between groups. These three different potential causes of increased infection prevalence may have different effects on the overall disease dynamics in the population. In this study, we assume for simplicity that \({inf}_{j}=1\) in all simulations.
We define \({c}_{ij}\) as the total contact rate between groups \(i\) and \(j\), hence depending on the population sizes in the groups. It is easier to specify our scenarios in terms of \({p}_{ij}\), which is the relative frequency of contacts between individuals in groups \(i\) and \(j\). We specify a situation in which individuals have twice as many contacts within their group as between groups by \({p}_{ij}=\left(\begin{array}{cc}2/3& 1/3\\ 1/3& 2/3\end{array}\right).\)
This matrix is related to \({c}_{ij}\) as follows:
where \({C}_{i}\) is the total relative contact rate for group \(i\). If all the groups have the same number of contacts, we have \({C}_{i}=1\). Hence, the \({c}_{ij}\) are calculated from the \({p}_{ij}\) in a way that ensures that the total number of contacts per time unit for everyone in group \(i\) is given by \({C}_{i}\). In all settings except the last one (defined as case 4), we use \({C}_{i}=1\), such that all groups have the same total contact rate. Hence, when \({C}_{i}=1\), the \({c}_{ij}\) are calculated from the \({p}_{ij}\), ensuring that all groups have the same total number of contacts in the population per time unit.
In all simulations, we start with 0.1% infected in each group as our initial conditions.
Simulated population
Two subgroups
We simulate data in a population with \(N=100 000\) citizens. We first split the population into two subgroups \(\mathrm{A}\) and \(\mathrm{B}\), and we assume a higher susceptibility in group \(\mathrm{B}\) than in group \(\mathrm{A}\), such that \(sus{c}_{B}=a\cdot {susc}_{A}\), where \(a\ge 1\). We let \({N}_{A}=90 000\) and \({N}_{B}=10 000\) be the number of individuals in groups \(\mathrm{A}\) and \(\mathrm{B}\), respectively. The relative contact matrix is defined by
\(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right),\) where \({p}_{ij}\), \(i,j\in A,B\) is the relative number of contacts group \(i\) has with individuals in group \(j\).
As an example of a contact structure in this population, let \({C}_{A}={C}_{B}=1, {p}_{AB}={p}_{BA}=1/3,{ p}_{AA}={p}_{BB}=2/3\). This corresponds to a setting where all individuals have the same total number of contacts, but twice as many contacts within their own subgroup. Plugging in the quantities we get \({{c}_{AA}=1.05, c}_{AB}=0.53, {c}_{BA}=0.91,{c}_{BB}=1.82.\)
Cases
We simulate two cases in the twogroup setting, case 1 and case 2.
Case 1: We let \(a=1.2\). We vary the basic reproduction number \({R}_{0}\) (or, equivalently, \({susc}_{A}\)) and the timepoint for which we compare the outcome (T). We assume no contact between the groups, so \(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right)=\left(\begin{array}{cc}1& 0\\ 0& 1\end{array}\right)\). We compute the incidence rate ratio (hereafter denoted as relative risk) resulting from the simulations, that is, the proportion infected in subgroup \(B\) divided by the proportion infected in subgroup \(A\). For noncommunicable diseases, one would expect a onetoone relationship between individual risk of disease and relative risk, i.e. a relative risk of 1.2.
Case 2: We set \({susc}_{A}\) such that for \(a=1\), \({R}_{0}=0.9\), and then vary \(a\). See supplementary material for the expression for \({R}_{0}\). We assume random mixing between the groups, so \(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right)=\left(\begin{array}{cc}1/2& 1/2\\ 1/2& 1/2\end{array}\right)\). We study the results for time point T = 200 days, which for most parameter choices in the paper corresponds to after the epidemic has burnt out. We compare two different outcomes:

A)
We investigate whether the predictions of a Poisson regression model (see Sect. Poisson regression model) fitted to simulated data on the number infected in each group from a setting with a difference between the two groups can be used to predict the total number of infections in the setting with no difference between the groups. Specifically, the fitted regression model is used to predict the proportion infected in the contrafactual scenario where the entire population belongs to the lowrisk group \(A\), that is, \({N}_{A}=100 000, {N}_{B}=0\). We compare this prediction with the simulations when the whole population is in \(A\). For noncommunicable diseases, one would expect no discrepancy between the simulations and the predictions from the fitted model.

B)
We plot the proportion infected in the lowrisk group when we vary the susceptibility in the highrisk group. If individual risks of infection only depended on individual characteristics, one would expect the proportion infected in the lowrisk group to be independent of the properties of the highrisk group.
Four subgroups
We split the population into four groups, \({A}_{h}\), \({A}_{l}\), \({B}_{h}\), and \({B}_{l}\), inspired by the recent analyses of ethnicity and risk of COVID19 infection. We assume two ethnicity groups \(A\) and \(B\), and that one ethnicity group (\(B\)) is disproportionately represented in a highrisk group. Hence, the two ethnicity groups are divided into two risk groups, with one risk group (\(h\)) having a higher risk than the other (\(l\)). This could for example represent a highrisk occupation. As before, we let \({N}_{A}=90 000\) individuals, and \({N}_{B}=10 000\) individuals. We further assume that 10% and 50% of the individuals in ethnicity groups \(A\) and \(B\) are in the highrisk group, respectively. Hence, we let group \({A}_{h}\) be the \({N}_{{A}_{h}}=9 000\) individuals in ethnicity group \(A\) with the high individual risk, group \({A}_{l}\) be the \({N}_{{A}_{l}}=81 000\) individuals in ethnicity group \(A\) with low individual risk, group \({B}_{h}\) be the \({N}_{{B}_{h}}=5 000\) individuals in ethnicity group \(B\) with high individual risk, and finally \({B}_{l}\) be the \({N}_{{B}_{l}}=5 000\) individuals in ethnicity group \(B\) with low individual risk.
The contact matrix is defined by the contacts between the risk levels \(h\) and \(l\), and the contacts between the ethnicity groups \(A\) and \(B\). Hence, let
\(\left(\begin{array}{cc}{p}_{hh}& {p}_{hl}\\ {p}_{lh}& {p}_{ll}\end{array}\right)\) be the relative contact matrix between the risk levels, where \({p}_{ij}\), \(i,j\in h,l\) is the relative number of contacts risk group \(i\) has with individuals in risk group \(j\). Further, let
\(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right)\) be the relative contact matrix between ethnicity groups \(A\) and \(B\), where \({p}_{ij}\), \(i,j\in A,B\) is the relative number of contacts ethnicity group \(i\) has with individuals in ethnicity group \(j\). The relative contact matrix between the groups \({A}_{h}\), \({A}_{l}\), \({B}_{h}\), and \({B}_{l}\) is then defined by the outer product of these two matrices, such that
We vary the amount of mixing within and between the ethnicity groups (assortativity). We define random mixing in ethnicity groups as a contact structure where every individual has the same probability of being in contact with any other person irrespective of ethnicity groups, while we denote an assortative contact structure as homogeneous mixing. We simulate with different levels of assortative mixing, defined by the relative number of contacts within to between the ethnicity groups. An assortativity of 1 then corresponds to random mixing with \({p}_{AA}={p}_{AB}={p}_{BA}={p}_{BB}=1/2\). An assortativity of 2 is defined as twice as many contacts within ethnicity groups, that is, \({p}_{AA}={p}_{BB}=2/3, {p}_{AB}={p}_{BA}=1/3\), where we divide by 3 for normalisation. More generally, an assortativity of \(x\) is defined by \(x\) times as many contacts within as between ethnicity groups, such that \({p}_{AA}={p}_{BB}=x/(x+1), {p}_{AB}={p}_{BA}=1/(x+1)\).
Cases
We simulate two different cases in the fourgroup setting, cases 3 and 4, with varying definitions of the highrisk individuals. We set the parameters in both cases such that \({R}_{0}=1.3\). We study the results at T = 200 days.
Case 3: In case 3, the highrisk individuals are defined by a higher susceptibility than the lowrisk individuals, in such manner that \({susc}_{{A}_{h}}={susc}_{{B}_{h}}=a\cdot sus{c}_{{A}_{l}}={a\cdot susc}_{{B}_{l}}\), with \(a\ge 1\). We assume \({p}_{ij}=1/2\) for all \(i,j\in h,l\) and vary \(a\). This definition of elevated risk could correspond to closer contact, genetic or biological differences, fewer hygienic precautions, or a mixture of those effects. We compute how much of the relative risk of ethnicity group \(B\) compared to \(A\) can be explained by a higher proportion of highrisk individuals (see Sect. Unexplained relative risk). For noncommunicable diseases, we would expect no unexplained relative risk. For \(a=2\) we fit a Poisson regression model on the outcome of the simulations, adjusting for risk level and ethnicity. We investigate how the confidence interval (CI) of the regression coefficient related to ethnicity varies with assortativity.
Case 4: In case 4, we assume \({susc}_{A_h}={susc}_{A_l}={susc}_{B_h}={susc}_{B_l}=1\), but we let the highrisk individuals have more contacts than the lowrisk individuals. The additional contacts are with other highrisk individuals, such that \({p}_{hh}=d\cdot {p}_{hl}={d\cdot p}_{lh}=d\cdot {p}_{ll}\), where \(d\ge 1\). We assume \({p}_{hl}={p}_{lh}={p}_{ll}=1\) and let \(d=3\). To account for the increased number of contacts, we now use \({C}_{{A}_{h}}={C}_{{B}_{h}}=1, {C}_{{A}_{l}}={C}_{{B}_{l}}=1.28\). As for case 3, we fit a Poisson regression model to the simulation outcome and investigate the CI for the ethnicity regression coefficient when varying assortativity. This definition of elevated risk could be due to larger households, less adherence to social distancing advice, or an occupation that requires more contacts.
Transmission model
We simulate disease spread by a stochastic SIRmodel [1]. Let \({S}_{i}\), \({I}_{i},\) and \({R}_{i}\) be the number of susceptible (\(S\)), infectious (\(I\)), and recovered (\(R\)) individuals in group \(i\). The following set of difference equations describes the disease development over time \({S}_{i}\left(t+\Delta t\right)= {S}_{i}\left(t\right){X}_{1},\) \({I}_{i}\left(t+\Delta t\right)= {I}_{i}\left(t\right)+{X}_{1}{X}_{2},\) where \({X}_{1}\sim \mathrm{Binom}\left({S}_{i}\left(t\right),\Delta t\sum_{j}\frac{{susc}_{i}{p}_{ij} in{f}_{j}{I}_{j}}{N}\right),\) and \({X}_{2}\sim \mathrm{Binom}\left({I}_{i}\left(t\right), {\Delta }_{t}\gamma \right),\) where \(1/\gamma =3\) days is the assumed duration of the infectious period, \(\Delta t=0.2\) is the timestep used in the simulations, and \(i=A, B\) in the twogroup setting, and \(i={A}_{h}\), \({A}_{l}\), \({B}_{h}\), \({B}_{l}\) in the fourgroup setting. We assume a constant population size so that \({R}_{i}={N}_{i}{S}_{i}{I}_{i}\).
Analysis of simulation results
Poisson regression model
We fit a Poisson regression model on the outcome of the disease simulation, here exemplified in the fourgroup setting. We include ethnicity and risk level as covariates, resulting in the following regression model \({\mathrm{log}{\mu }_{i}=\mathrm{log}{N}_{i}+{\beta }_{0}+\beta }_{e}{e}_{i}+{\beta }_{r}{r}_{i}\), where \({Y}_{i}\sim \mathrm{Poisson}\left({\mu }_{i}\right)\) denotes the number of infected individuals in group \(i\), \({e}_{i}\) and \({r}_{i}\) denote ethnicity and risk group status for group \(i\), respectively, and as before, \({N}_{i}\) are the population sizes of each group used as an offset. We are interested in the estimated regression coefficient \({\beta }_{e}\), which quantifies the effect of ethnicity. We let ethnicity group \(B\) and \(l\) be the reference levels for ethnicity and risk level, respectively. We assume a standard significance level of 0.05.
Unexplained relative risk
Since a higher proportion of ethnicity group \(B\) belongs to the highrisk group, we expect a larger infected proportion in ethnicity group \(B\). We denote the explained relative risk in ethnicity group \(B\) compared to \(A\) as the relative risk which can be explained by a higher proportion in the high individual risk group. This can be computed from the proportion of highrisk individuals in each ethnic group, together with the observed relative risk between the high and lowrisk group. Specifically, let \({R}_{{A}_{h}}, {R}_{{A}_{l}, }, {R}_{{B}_{h}},\) and \({R}_{{B}_{l}}\) be the total proportion of infected individuals in each group. The explained relative risk, \(ER\), is then given as \(ER=(0.5+0.5{RR}^{r})/(0.9+0.1{RR}^{r})\), where \({RR}^{r}=({R}_{{A}_{h}}+{R}_{{B}_{h}})/({R}_{{A}_{l}}+{R}_{{B}_{l}})\), is the observed relative risk between the high and lowrisk groups. The unexplained relative risk, \(UR\), in ethnic group \(B\) is then given by the observed relative risk in ethnic group \(B\), minus the explained relative risk, that is \(UR=({R}_{{B}_{h}}+{R}_{{B}_{l}})/({R}_{{A}_{h}}+{R}_{{A}_{l}})ER\). For noncommunicable diseases, we would expect no unexplained relative risk.
Estimating from the datageneration model using approximate Bayesian computation
In addition to analysing the simulated data with regression models, we apply a simple Markov Chain Monte Carlo approximate Bayesian computation (ABCMCMC) algorithm [26] to estimate the risk parameters in the groups from the simulations. We use these simulations to explore if we can obtain the correct parameters when the true datagenerating model is accounted for. Details are provided in the supplementary material.
Results
Case 1
Figure 1 shows the relative risk obtained in the simulations when \({R}_{0}\) is varied and the individual risk in group \(B\) is 20% higher than the individual risk in group \(A\) for different simulation times T. We note that there is no simple relationship between the difference in individual risk and the overrepresentation at group level in the simulations; when varying \({R}_{0}\), the overrepresentation varies between 0 and 10. There is a large spread of relative risk values for each value of \({R}_{0}\) due to the stochastic nature of the transmission model. As \({R}_{0}\) becomes large, the overrepresentation becomes small, as most of the population is infected. For higher \({R}_{0}\), the overrepresentation is in general larger earlier in the outbreak (low T) than later. For lower \({R}_{0}\), the overrepresentation is in general larger later in the outbreak (high T) than early in the outbreak. The disease dynamics for this set of models are provided in the supplementary material, section Infection dynamics in cases 1 and 2.
Case 2
Figure 2a shows the discrepancy between the predictions from the regression model and the model simulations when we use the regression model fitted on data based on two groups with different susceptibilities to predict the total number of cases if we only had one group (low risk), when the difference in susceptibility is varied. We note that the regression model’s predictions significantly overestimate the proportion of infected when everyone belongs to the lowrisk group. The problem increases when the relative susceptibility between the groups (\(a\)) increases. When there is no difference between the high and lowrisk group, the point predictions from the regression model perform well in this simple twogroup setting. However, the figure clearly shows that the prediction intervals from individual simulations do not adequately cover the spread in simulations, indicating that the Poisson regression underestimates the uncertainty. For noncommunicable diseases and other settings where Poisson regression is applicable, we would expect no discrepancy.
Figure 2b shows how the proportion of infected in the lowrisk group depends on the properties of the highrisk group. The larger the susceptibility in the highrisk group, the larger proportion infected in the lowrisk group. We note that we need a large enough \(a\) to sustain an epidemic in the highrisk group. We also note a saturation effect in \(a\), such that above a certain level, the fraction infected in the lowrisk group is almost constant when \(a\) increases. The ratio of infected in the lowrisk group increased from approximately 0 to almost 0.3, through increasing the susceptibility of the highrisk group. The disease dynamics for case 2 are provided in the supplementary material, section Infection dynamics in cases 1 and 2.
Note that we have chosen to illustrate the results for T = 200, but for other time points, the results would likely be different, as illustrated in Fig. 1.
Cases 3 and 4
Figure 3a shows how the unexplained relative risk for ethnicity group \(B\) increased when we increased the relative risk between the high and lowrisk group. The effect is more prominent when the assortative mixing within ethnic groups is larger. For noncommunicable diseases, we expect no unexplained relative risk.
Figures 3b and c show the CI for the ethnicity regression coefficient when the proportion of contacts within the same ethnicity groups increases for cases 3 and 4, respectively. Since we know the truth in the simulation, we know that ethnicity does not affect the individual risk of infection. For random mixing, the regression analysis resulted in the truth—no effect of ethnicity (CIs centred at 1). However, the higher the tendency for contacts within the ethnicity groups, the higher the estimated effect of ethnicity. With a large enough degree of assortative mixing, we find an ethnicity coefficient significantly different from 1. Note that we are focussing on the significance of the coefficients and not on the effect sizes. As illustrated for case 1 (cf. Figure 1), the effect sizes are highly dependent upon which time point is used to compare the relative risks, as these cannot be interpreted as estimates of individuallevel properties. This is the case for both cases 3 and 4. The effect is larger when we increase the susceptibility in the highrisk group than when we increase the number of contacts, but the results for the two different definitions of the highrisk group are qualitatively very similar.
In the supplementary material we show that we can apply the ABCMCMC algorithm together with the simulation model to accurately estimate the transmission parameters. To accurately estimate some of the parameters, the true transmission model and the other parameters are needed. For example, one needs to assume a value, or a range of possible values, for the assortativity.
Validity of traditional methods
The results above show that, in general, one cannot use traditional statistical methods to estimate the individuallevel effects from populationlevel infectious disease outcomes. While this is true, if we are interested in explaining factors that are associated with increased risk, such methods can be adequate and can give a good approximation when the dynamics and feedback characterising the spread of infectious diseases are not important.
To illustrate, we consider a deterministic version of the SIR model with two subgroups, with the same population sizes in the groups, and a relative difference in susceptibility by a factor \(a\) in group \(B\) compared to group \(A\). We further assume that we start with the same number infected in each group. In this example, the number of infectious individuals \({I}_{i}\) is given by the following differential equations:
Under random mixing, we find \({\lambda }_{i}=sus{c}_{i}I\), where \(I={I}_{A}+{I}_{B}\) is the total number of infected individuals in the population. During the early phase of an outbreak (\(S\approx N\)), we can approximate the number of new cases in \(\Delta t\) by:
where \(in{c}_{A}\) and \(in{c}_{B}\) are the incidences in groups \(A\) and \(B\), respectively. We then find \(RR=in{c}_{B}/in{c}_{A}=a.\) This approximation holds until \({S}_{i}/N\) becomes different in the two groups. The difference in susceptibility means that this ratio will change at different rates in the two groups. Therefore, after some time, the incidence ratio of observed cases will diverge from \(a\).
One of the most important settings where traditional statistical methods are used to estimate relative risks of infection is in randomised controlled trials to estimate vaccine effect [27]. In this setting traditional methods will still work, since even if the mixing in the whole population might not be random, it has been shown that as long as the two groups we are comparing have the same contact structure, we are in a similar regime as described above [28, 29]. This requirement will be fulfilled by randomisation. As above, one might still get a biased estimate if a large fraction of the population is infected during the trial such that \({S}_{i}/N\) changes. Typically, these trials take part over a reasonably short time so the traditional methods will still likely be valid.
In the situation without random mixing, for example assuming twice as many contacts within as between groups, the main difference from the approximated relations above is that \({\uplambda }_{i}\) is no longer proportional to \(I\). This means that although we could recover the individual effect from the population effect in the very beginning, the relation is broken immediately as soon as \({I}_{1}\ne {I}_{2}\), which will occur after the first generations of the disease spread. Figure 4 shows the ratio of the daily incidence in the two groups, simulated using the stochastic model with \(a=0.5\). We consider three contact structures: assortativity of 2, random mixing, and no contact between groups (\({p}_{AA}={p}_{BB}=1,{p}_{AB}={p}_{BA}=0\)). With random mixing, the incidence ratio is approximately 0.5 for about 25 days, while for assortative mixing, the incidence ratio drops almost immediately to about 0.4. The drop is caused by the indirect protection in group \(B\), since they have more contacts with individuals who are less likely to be infected. The setting with no contact between the groups shows an extreme effect of the indirect protection as for this model \({R}_{0}<1\) in group \(B\), and hence there is no large outbreak in the group. For the randomly mixing case, or when comparing two groups with the same contact patterns, it is possible to analytically relate the population incidence ratio to the individuallevel difference in susceptibility [28, 30].
Discussion
Our simulations show that the standard traditional regression methods are not, in general, valid for infectious diseases. We show that the results from applying such methods to data on infectious diseases to identify risk factors are not interpretable. One can even risk ending up with qualitatively wrong results by using them. The main reason is that traditional regression methods do not consider the dependency between the cases. For the risk of acquiring directly transmissible infections, the indirect effects are the most important—unless you are exposed to the infection, your individual risk profile is irrelevant.
The interdependence between the individuals in transmission chains leads to a complex relationship between increases in transmissibility and the total number infected that depends on many parameters that either need to be measured or assumed.
We summarise the main takehome messages related to the use of traditional statistical methods as

1.
There is no obvious relationship between increased individuallevel risk of infection and the observed number of cases on populationlevel. This relation must be interpreted through untangling the transmission dynamics.

2.
The regression coefficients are not interpretable. Estimates from regression analysis on observed relative risk will not estimate the underlying parameters of the transmission process.

3.
There is no onetoone relationship between individual risk and population risk. Estimates like regression coefficients cannot be interpreted in the standard way as how the incidence would change if we removed or changed a risk factor.

4.
Individual risks are not identifiable as they depend on the properties of the rest of the population. The proportion of infected in the lowrisk group increased significantly by only changing the susceptibility of the highrisk group.

5.
As regression coefficients are not interpretable, this also means that adjusting for confounders and mediators is not possible in traditional ways. We cannot assess how much of the overrepresentation can be explained by other, correlated covariates.
In a simple setting of two subgroups, one can conclude that if the relative observed risk between the two groups is different from one, then there is a difference between the two groups. However, it is not possible to use regression analysis assuming independence between the observations to assess the size of the effect and/or whether the difference is significantly different. In a more complex setting of more than two groups, we cannot conclude anything about the relative individual risks of infection between two groups, as there is potential confounding or mediation that we cannot adjust for, depending on the research question.
In our simulation experiment, we find a significant effect of ethnicity, even though the example was constructed such that the behaviour was identical in the two ethnicity groups. The only difference was the proportion of individuals belonging to a highrisk group, which we adjusted for in the regression analysis. The larger the tendency for mixing within ethnicity groups, the larger the estimated effect of ethnicity, while for random mixing, we correctly found no effect of ethnicity. It is thus clear that by applying a standard regression model, we have not properly adjusted for the mediation through the risk level variable and thus we have not been able to estimate the direct effect of ethnicity.
Multiple studies conclude an effect of ethnicity/country of birth/immigration status on the risk of infection of COVID19, even after adjusting for socioeconomic variables and other potential risk factors [3,4,5,6,7,8,9]. As illustrated, regression models cannot be used to adjust for these variables, even if the variables were perfect measures of what one wishes to control for. We therefore conclude that great caution should be applied when interpreting such results.
Though we have chosen to illustrate our points by the example of the overrepresentation of COVID19 among certain ethnic groups, our results will be valid in general when analysing infectious disease case data with traditional regression methods. Another important example arises in analysing vaccine efficacy from observational studies. Regression models have been used in observational studies to evaluate vaccine efficacy during COVID19 [31,32,33]. As we have illustrated in this paper, it is generally not feasible to estimate the individual vaccine efficacy from aggregated observed counts of infection and compare infection risks. It is also not possible to adjust for confounding effects like age, a welldocumented strong confounder with vaccination due to ageprioritised COVID19 vaccination strategies. Age mixing is also known to be assortative, as seen for example from the POLYMOD study [34]. The effects of assortativity on bias in observational vaccination studies was also studied by simulation in [35]. However, randomised controlled trials on vaccine efficacy do not suffer from the same problem, as the independence assumption is more reasonable in a randomised controlled trial. Hence, there are settings when traditional techniques are applicable. However, if traditional techniques are to be used, they should be accompanied by an argument about why the situation under study falls into such a regime.
Recently, there have been developed methods for causal inference under different dependence structures which could be used to estimate effects of interventions like vaccination on infectious diseases [16, 17, 36, 37]. However, these methods typically require either specific experimental designs such that one can assume partial interference, that is, that the population can be divided into independent groups, or knowledge of the underlying social network.
In this study, we have only investigated the effect of violation of the independence assumption. A more general problem with many of these studies is the lack of a clear definition of risk factors, and an unclear specification of the research question. Whether a factor is found to be a risk factor or not for a specific condition depends on both how the research question and risk factor are defined [38]. For example, risk factors for becoming infected in the future may differ from risk factors of having been infected, as certain groups may have gained high immunity levels. To conclude about the (causal) meaning of the risk factor from an analysis, a proper understanding of the interplay between the different factors is necessary, through for example a directed acyclic graph of the problem. This is necessary to avoid socalled table 2fallacies, where effects in multiple regression analysis can be misinterpreted, as discussed in [39]. Another general problem is measurement error and confounding [40], as many variables like socioeconomic status are hard to measure, and there is strong correlation between many covariates. The conclusions may also depend on the model specification through for example the response distribution and the assumed shape of the covariate effects. Moreover, in this study we have focussed on estimating the direct effect of an exposure on the disease outcome. One could also be interested in other parameters, like the total or indirect effect of an exposure [15,16,17]. This can for example be of key interest when analysing the effect of an intervention like vaccines, where one might be interested both in the direct protective effect of the vaccine on the individual, and on the indirect effect of protection through for example herd immunity. In an infectious disease modelling framework these indirect effects can be estimated from the individual level direct effects.
This study only considered how the overrepresentation might vary when the number of contacts and susceptibility differ between groups, assuming different underlying contact structures. Other factors which could affect overrepresentation are different importation rates, infectivity, and duration of infectious period. It is straightforward to extend our framework also to consider the effect of these three factors.
The population structure assumed in this study is overly simplified, and random mixing is a very strong assumption which does not reflect well a realistic contact structure. Similarly, we assume little heterogeneity between individuals within the same subgroup (e.g. the same duration of infectiousness, susceptibility, infectivity, and contact pattern). Our aim has thus not been to perform an exhaustive overview of parameters and how they may affect the results. The model is constructed primarily for a theoretical, academic purpose with focus on parsimony to illustrate a point. Our model is thus not meant to be used to conclude about causes for the observed overrepresentation of COVID19 cases among ethnic minorities.
The fact that regression methods that do not consider transmission do not apply to infectious diseases is neither surprising nor novel. The bias of traditional measures like risk ratios and odds ratios was demonstrated in a study from 1991 [41] inspired by the AIDS epidemic, among several other studies [42,43,44]. However, recent statistical analyses of COVID19 case data show the necessity of a reminder. Moreover, to our knowledge, our study is the first to consider this in a regression setting.
Different methods suitable for inference on infectious diseases have been proposed [45,46,47,48]. The problem with many of these methods is that they are hard to use and may require detailed data about the contact pattern of the population. Such data are rarely available, particularly for diseases which do not necessarily require direct physical contact but may spread through the environment through aerosols and droplets (e.g. COVID19). In this paper, we show that if one can specify an assumed datagenerating model, it could be possible to estimate some of the parameters of interest. Such methods require a lot of data, especially about contact structures, and must be tailored to each study context. For intervenable factors like for example vaccines, there have been suggested experimental designs which would allow the use of classical statistical methods to analyse the effects of the factors, as mentioned above. However, for immutable properties (like ethnicity), we believe that it is better to use pure descriptive analyses rather than perform regression methods where the coefficients are not interpretable. In particular, one should not draw conclusions or interpret the effects from such studies. Hence, there is a need for popularised, easytouse methods applicable to inference on infectious diseases, which can be applied to the data typically available at hand. There is also a need for continuous ongoing surveys to collect data on social contacts and behaviour. In recent years and particularly during COVID19, new, alternative data streams like mobile phones have been used to inform models, enabling detailed realtime information on behaviour.
In this article we only consider using regression models to identify and measure riskfactors for infection. There are many other applications of regression models in infectious diseases that are not affected by the problems discussed here, including using regression to estimate the growth rate of cases and for anomaly detection in surveillance.
Conclusions
We conclude that using standard methods like Poisson regression models to study overrepresentation of different groups does not make sense for infectious diseases. If the methods developed for noncommunicable diseases are used to analyse infectious diseases, one can risk ending up with the wrong qualitative conclusions.
Availability of data and materials
All the source code used in the study is publicly available at: https://github.com/Gulfa/pitfalls. All analyses were performed in R [49] using the odin.dust Rpackage [50]. Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.
Abbreviations
 SIR:

Susceptibleinfectiousrecovered
 CI:

Confidence interval
 ABCMCMC:

Markov Chain Monte Carlo approximate Bayesian computation
References
Keeling MJ, Rohani P. Modeling infectious diseases in humans and animals. Princeton, NJ: Princeton University Press; 2008. p. 16–26.
Barthélemy M, Barrat A, PastorSatorras R, Vespignani A. Dynamical patterns of epidemic outbreaks in complex heterogeneous networks. J Theor Biol. 2005;235(2):275–88. https://doi.org/10.1016/j.jtbi.2005.01.011.
Mathur R, Rentsch CT, Morton CE, Hulme WJ, Schultze A, MacKenna B, et al. Ethnic differences in SARSCoV2 infection and COVID19related hospitalisation, intensive care unit admission, and death in 17 million adults in England: an observational cohort study using the OpenSAFELY platform. Lancet. 2021;397:1711–24. https://doi.org/10.1016/S01406736(21)006346.
RodriguezDiaz CR, GuilamoRamos V, Mena L, Hall E, Honermann B, Crowley JS, et al. Risk for COVID19 infection and death among Latinos in the United States: Examining heterogeneity in transmission dynamics. Ann Epidemiol. 2020;52:46–53. https://doi.org/10.1016/j.annepidem.2020.07.007.
Sundaram ME, Calzavara A, Mishra S, Kustra R, Chan AK, Hamilton AM, et al. Individual and social determinants of SARSCoV2 testing and positivity in Ontario, Canada: a populationwide study. CMAJ. 2021;193(20):E723–4. https://doi.org/10.1503/cmaj.202608.
Indseth T, Elgersma IH, Strand BH, Telle K, Labberton AS, Arnesen T et al. Covid19 blant personer født utenfor Norge, justert for yrke, trangboddhet, medisinsk risikogruppe, utdanning og inntekt [Covid19 among persons born outside Norway, adjusted for occupation, Household crowding, medical risk group, education and income]. Report, Norwegian Institute of Public Health, Norway, April 2021.
Millett GA, Jones AT, Benkeser D, Baral S, Mercer L, Beyrer C, et al. Assessing differential impacts of COVID19 on black communities. Ann Epidemiol. 2020;47:47–44. https://doi.org/10.1016/j.annepidem.2020.05.003.
Rostila M, Cederström A, Wallace M, Brandén M, Malmberg B, Andersson G. Disparities in Coronavirus disease 2019 mortality by country of birth in Stockholm, Sweden: A total populationbased cohort study. Am J Epidemiol 2021;kwab057. https://doi.org/10.1093/aje/kwab057.
Drefahl S, Wallace M, Mussino E, Aradhya S, Kolk M, Brandén M et al. A populationbased cohort study of sociodemographic risk factors for COVID19 deaths in Sweden. Nat Commun 2020;11(5097). doi:https://doi.org/10.1038/s41467020189263.
Seligman B, Ferranna M, Bloom DE. Social determinants of mortality from COVID19: a simulation study using NHANES. PLoS Med. 2021;18(1): e1003490. https://doi.org/10.1371/journal.pmed.1003490.
Zhang M. Estimation of differential occupational risk of COVID19 by comparing risk factors with case data by occupational group. Am J Ind Med. 2021;64(1):39–47. https://doi.org/10.1002/ajim.23199.
Organisation for Economic Cooperation and Development. What is the impact of the COVID19 Pandemic on Immigrants and Their Children? Report, OECD Publishing, October 2020. https://www.oecd.org/coronavirus/policyresponses/whatistheimpactofthecovid19pandemiconimmigrantsandtheirchildrene7cbb7de/.
Hooper MW, Nápoles AM, PérezStable EJ. COVID19 and racial/ethnic disparities. JAMA. 2020;323(24):2466–7. https://doi.org/10.1001/jama.2020.8598.
Johnson KM, Alarcón J, Watts DM, Rodriguez C, Velasquez C, Sanchez J, et al. Sexual networks of pregnant women with and without HIV infection. AIDS. 2003;17(4):605–12. https://doi.org/10.1097/0000203020030307000016.
Halloran ME, Haber M, Longini IM Jr, Struchiner CJ. Direct and indirect effects in vaccine efficacy and effectiveness. Am J Epidemiol. 1991;133(4):323–31. https://doi.org/10.1093/oxfordjournals.aje.a115884.
Halloran ME, Hudgens MG. Dependent happenings: a recent methodological review. Curr Epidemiol Rep. 2016;3(4):297–305. https://doi.org/10.1007/s4047101600864.
Hudgens MG, Halloran ME. Toward causal inference with interference. J Am Stat Assoc. 2008;103(482):832–42. https://doi.org/10.1198/016214508000000292.
McPherson M, SmithLovin L, Cook JM. Birds of a feather: homophily in social networks. Annu Rev Sociol. 2001;27(1):415–44. https://doi.org/10.1146/annurev.soc.27.1.415.
Christakis NA, Fowler JH. The spread of obesity in a large social network over 32 years. The collective dynamics of smoking in a large social network. N Engl J Med. 2007;357(4):370–9. https://doi.org/10.1056/nejmsa066082.
Christakis NA, Fowler JH. The collective dynamics of smoking in a large social network. N Engl J Med. 2008;358(21):2249–58. https://doi.org/10.1056/NEJMsa0706154.
Newman ME. Assortative mixing in networks. Phys Rev Lett. 2002;89(20): 208701. https://doi.org/10.1103/PhysRevLett.89.208701.
Bollen J, Gonçalves B, Ruan G, Mao H. Happiness is assortative in online social networks. Artif Life. 2011;17(3):237–51. https://doi.org/10.1162/artl_a_00034.
Salathé M, Vu DQ, Khandelwal S, Hunter DR. The dynamics of health behavior sentiments on a large online social network. EPJ Data Sci. 2013;2(1):4. https://doi.org/10.1140/epjds16.
McMillan C. Tied Together: Adolescent Friendship networks, Immigrant Status, and Health Outcomes. Demography. 2019;56(39):1075–103. https://doi.org/10.1007/s1352401900770w.
Barstad A, Molstad CS. Integrering av innvandrere i Norge. Statistics Norway, Norway: Report; 2020.
Marjoram P, Molitor J, Plagnol V, Tavaré S. Markov chain Monte Carlo without likelihoods. PNAS. 2003;100(26):15324–8. https://doi.org/10.1073/pnas.0306899100.
Polack FP, Thomas SJ, Kitchin N, Absalon J, Gurtman A, Lockhart S, et al. Safety and efficacy of the BNT162b2 mRNA Covid19 vaccine. N Engl J Med. 2020;383:2603–14. https://doi.org/10.1056/NEJMoa2034577.
Haber M, Halloran MR, Longini IM Jr, Watelet L. Estimation of vaccine efficacy in nonrandomly mixing populations. Biom J. 1995;37(1):25–38. https://doi.org/10.1002/bimj.4710370103.
Sävje F, Aronow PM, Hudgens MG. Average treatment effects in the presence of unknown interference. Ann Stat. 2021;49(2):673–701. https://doi.org/10.1214/20AOS1973.
Haber M, Longini IM Jr, Halloran ME. Measures of the effects of vaccination in a randomly mixing population. Int J Epidemiol. 1991;20(1):300–10. https://doi.org/10.1093/ije/20.1.300.
Starrfelt J, Danielsen AS, Kacelnik O, Børseth AW, Seppälä E, Meijerink H. High vaccine effectiveness against COVID19 infection and severe disease among residents and staff of longterm care facilities in Norway, NovemberJune 2021. Preprint at medRxiv. 2021. https://doi.org/10.1101/2021.08.08.21261357.
Emborg HD, ValentinerBranth P, Schelde AB, Nielsen KF, Gram MA, MoustsenHelms IR, et al. Vaccine effectiveness of the BNT162b2 mRNA COVID19 vaccine against RTPCR confirmed SARSCoV2 infections, hospitalisations and mortality in prioritised risk groups. Preprint at medRxiv. 2021. https://doi.org/10.1101/2021.05.27.21257583.
Seppälä E, Veneti L, Starrfelt J, Danielsen AS, Bragstad K, Hungnes O, et al. Vaccine effectiveness against infection with the Delta (B. 1.617. 2) variant, Norway, April to August 2021. Euro Surveill. 2021;26(35):2100793. https://doi.org/10.2807/15607917.ES.2021.26.35.2100793.
Mossong J, Hens N, Jit M, Beutels P, Auranen K, Mikolajczyk R, Massari M, et al. Social contacts and mixing patterns relevant to the spread of infectious diseases. PLoS Med. 2008;5(3):e74. https://doi.org/10.1371/journal.pmed.0050074.
Zivich PN, Volfovsky A, Moody J, Aiello AE. Assortativity and Bias in Epidemiologic Studies of Contagious Outcomes: A simulated Example in the Context of Vaccination. Am J Epidemiol 2021;kwab167. https://doi.org/10.1093/aje/kwab167.
Tchetgen Tchetgen EJ, Fulcher IR, Shpitser I. Autogcomputation of causal effects on a network. J Am Stat Assoc. 2021;116(534):833–44. https://doi.org/10.1080/01621459.2020.1811098.
Ogburn EL, Sofrygin O, Diaz I, Van der Laan MJ. Causal inference for social network data. arXiv preprint 2017;arXiv:1705.08527v5.
Huitfeldt A. Is caviar a risk factor for being a millionaire? BMJ. 2016;355: i6536. https://doi.org/10.1136/bmj.i6536.
Westreich D, Greenland S. The table 2 fallacy: presenting and interpreting confounder and modifier coefficients. Am J Epidemiol. 2013;177(4):292–8. https://doi.org/10.1093/aje/kws412.
Phillips AN, Smith GD. How independent are “independent” effects? Relative risk estimation when correlated exposures are measured imprecisely. J Clin Epidemiol. 1991;44(11):1223–31. https://doi.org/10.1016/08954356(91)901553.
Koopman JS, Longini IM, Jacquez JA, Simon CP, Ostrow DG, Martin WR, et al. Assessing risk factors for transmission of infection. Am J Epidemiol. 1991;133(12):1199–209. https://doi.org/10.1093/oxfordjournals.aje.a115832.
Morozova O, Cohen T, Crawford FW. Risk ratios for contagious outcomes. J R Soc Interface. 2018;15:20170696. https://doi.org/10.1098/rsif.2017.0696.
O’Hagan JJ, Lipsitch M, Hernán MA. Estimating the perexposure effect of infectious disease interventions. Epidemiology. 2014;25(1):134–8. https://doi.org/10.1097/EDE.0000000000000003.
Pitzer VE, Basta NE. Linking data and models: the importance of statistical analyses to inform models for the transmission dynamics of infections. Epidemiology. 2012;23(4):520–2. https://doi.org/10.1097/EDE.0b013e31825902ab.
Cai X, Loh WW, Crawford FW. Identification of causal intervention effects under contagion. J Causal Inference. 2021;9(1):9–38. https://doi.org/10.1515/jci20190033.
Kenah E. Semiparametric relativerisk regression for infectious disease transmission data. J Am Stat Assoc. 2015;110(509):313–25. https://doi.org/10.1080/01621459.2014.896807.
Rampey AH Jr, Longini IM Jr, Haber M, Monto AS. A discretetime model for the statistical analysis of infectious disease incidence data. Biometrics. 1992;48(1):117–28. https://doi.org/10.2307/2532743.
Haber M, Longini IM Jr, Cotsonis GA. Models for the statistical analysis of infectious disease data. Biometrics. 1988;44(1):163–73. https://doi.org/10.2307/2531904.
R Core Team 2020. R: A Language and Environment for Statistical Computing. Version 4.0.2. Vienna, Austria: R Foundation for Statistical Computing, 2020.
FitzJohn R, Lees J. odin.dust: Compile Odin to Dust. R package version 0.2.7; 2021. https://github.com/mrcide/odin.dust
Acknowledgements
Not applicable.
Funding
This work was supported by the Norwegian Research Council project number 312721. SE acknowledges partial support from the Norwegian Research Council centre Big Insight project 237718.
Author information
Authors and Affiliations
Contributions
S.E. participated in conceptualisation, formal analysis, methodology, visualisation, writing the original draft and reviewing and editing the manuscript. G.R. participated in conceptualisation, formal analysis, methodology, visualisation, writing the original draft and reviewing and editing the manuscript. B.F.dB. participated in conceptualisation, methodology, writing the original draft and reviewing and editing the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Engebretsen, S., Rø, G. & de Blasio, B.F. A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation study. BMC Med Res Methodol 22, 146 (2022). https://doi.org/10.1186/s12874022015651
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874022015651
Keywords
 Relative risk
 Communicable diseases
 Infectious diseases
 Regression models
 Overrepresentation