Framework
We consider analyses where the goal is to use observed counts or prevalence of infection in various groups to learn about the underlying parameters of disease transmission. In this paper we analyse a simplified SIRmodel where we divide the population into first two and then extend to four subgroups. The twogroup setting is the most parsimonious setting considered and is used to illustrate the relationship between individual and grouplevel risks, whether one can use a fitted regression model for contrafactual predictions to generalise to other group sizes, and the dependence between the incidence ratio in the two groups. The fourgroup setting is motivated by the studies of ethnicity and COVID19 and is used to study the estimated regression coefficients from observational data.
We assume that we have data on the number of infected and total population sizes for each subgroup (individuallevel data will give similar results) at timepoint t. In this simplified setup, either four or sixteen key parameters control the transmission dynamics between the two or four subgroups, respectively. These parameters, \({\beta }_{ij}\), typically derived from contact studies, are summarised in a WhoAcquiresInfectionfromWhom matrix as usual in infectious disease modelling, where \({\beta }_{ij}\) is the rate of transmission from an individual in group \(j\) to an individual in group \(i\). We assume here that all groups have the same duration of infectiousness.
An important distinction is that “highrisk” can in this setup be due to at least three different factors that alone or in combination can explain an overrepresentation of cases in one of the subgroups. We can decompose each \({\beta }_{ij}\) as \({\beta }_{ij}={inf}_{j}\times {c}_{ij}\times {susc}_{i}\), where \({inf}_{j}\) is the infectivity of group \(j\), \({c}_{ij}\) is the number of contacts group \(i\) has with group \(j\) per time unit, and \(sus{c}_{i}\) is the susceptibility of group \(i\). Hence, we in general expect higher incidence in a group with higher susceptibility, or if the group has overall more contacts. We also expect higher incidence in a group with higher infectivity, if there are more contacts within than between groups. These three different potential causes of increased infection prevalence may have different effects on the overall disease dynamics in the population. In this study, we assume for simplicity that \({inf}_{j}=1\) in all simulations.
We define \({c}_{ij}\) as the total contact rate between groups \(i\) and \(j\), hence depending on the population sizes in the groups. It is easier to specify our scenarios in terms of \({p}_{ij}\), which is the relative frequency of contacts between individuals in groups \(i\) and \(j\). We specify a situation in which individuals have twice as many contacts within their group as between groups by \({p}_{ij}=\left(\begin{array}{cc}2/3& 1/3\\ 1/3& 2/3\end{array}\right).\)
This matrix is related to \({c}_{ij}\) as follows:
$${c}_{ij}=\frac{{p}_{ij}}{{w}_{i}},{w}_{i}=\frac{{\sum }_{j}{p}_{ij}{N}_{j}}{N{C}_{i}},$$
where \({C}_{i}\) is the total relative contact rate for group \(i\). If all the groups have the same number of contacts, we have \({C}_{i}=1\). Hence, the \({c}_{ij}\) are calculated from the \({p}_{ij}\) in a way that ensures that the total number of contacts per time unit for everyone in group \(i\) is given by \({C}_{i}\). In all settings except the last one (defined as case 4), we use \({C}_{i}=1\), such that all groups have the same total contact rate. Hence, when \({C}_{i}=1\), the \({c}_{ij}\) are calculated from the \({p}_{ij}\), ensuring that all groups have the same total number of contacts in the population per time unit.
In all simulations, we start with 0.1% infected in each group as our initial conditions.
Simulated population
Two subgroups
We simulate data in a population with \(N=100 000\) citizens. We first split the population into two subgroups \(\mathrm{A}\) and \(\mathrm{B}\), and we assume a higher susceptibility in group \(\mathrm{B}\) than in group \(\mathrm{A}\), such that \(sus{c}_{B}=a\cdot {susc}_{A}\), where \(a\ge 1\). We let \({N}_{A}=90 000\) and \({N}_{B}=10 000\) be the number of individuals in groups \(\mathrm{A}\) and \(\mathrm{B}\), respectively. The relative contact matrix is defined by
\(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right),\) where \({p}_{ij}\), \(i,j\in A,B\) is the relative number of contacts group \(i\) has with individuals in group \(j\).
As an example of a contact structure in this population, let \({C}_{A}={C}_{B}=1, {p}_{AB}={p}_{BA}=1/3,{ p}_{AA}={p}_{BB}=2/3\). This corresponds to a setting where all individuals have the same total number of contacts, but twice as many contacts within their own subgroup. Plugging in the quantities we get \({{c}_{AA}=1.05, c}_{AB}=0.53, {c}_{BA}=0.91,{c}_{BB}=1.82.\)
Cases
We simulate two cases in the twogroup setting, case 1 and case 2.
Case 1: We let \(a=1.2\). We vary the basic reproduction number \({R}_{0}\) (or, equivalently, \({susc}_{A}\)) and the timepoint for which we compare the outcome (T). We assume no contact between the groups, so \(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right)=\left(\begin{array}{cc}1& 0\\ 0& 1\end{array}\right)\). We compute the incidence rate ratio (hereafter denoted as relative risk) resulting from the simulations, that is, the proportion infected in subgroup \(B\) divided by the proportion infected in subgroup \(A\). For noncommunicable diseases, one would expect a onetoone relationship between individual risk of disease and relative risk, i.e. a relative risk of 1.2.
Case 2: We set \({susc}_{A}\) such that for \(a=1\), \({R}_{0}=0.9\), and then vary \(a\). See supplementary material for the expression for \({R}_{0}\). We assume random mixing between the groups, so \(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right)=\left(\begin{array}{cc}1/2& 1/2\\ 1/2& 1/2\end{array}\right)\). We study the results for time point T = 200 days, which for most parameter choices in the paper corresponds to after the epidemic has burnt out. We compare two different outcomes:

A)
We investigate whether the predictions of a Poisson regression model (see Sect. Poisson regression model) fitted to simulated data on the number infected in each group from a setting with a difference between the two groups can be used to predict the total number of infections in the setting with no difference between the groups. Specifically, the fitted regression model is used to predict the proportion infected in the contrafactual scenario where the entire population belongs to the lowrisk group \(A\), that is, \({N}_{A}=100 000, {N}_{B}=0\). We compare this prediction with the simulations when the whole population is in \(A\). For noncommunicable diseases, one would expect no discrepancy between the simulations and the predictions from the fitted model.

B)
We plot the proportion infected in the lowrisk group when we vary the susceptibility in the highrisk group. If individual risks of infection only depended on individual characteristics, one would expect the proportion infected in the lowrisk group to be independent of the properties of the highrisk group.
Four subgroups
We split the population into four groups, \({A}_{h}\), \({A}_{l}\), \({B}_{h}\), and \({B}_{l}\), inspired by the recent analyses of ethnicity and risk of COVID19 infection. We assume two ethnicity groups \(A\) and \(B\), and that one ethnicity group (\(B\)) is disproportionately represented in a highrisk group. Hence, the two ethnicity groups are divided into two risk groups, with one risk group (\(h\)) having a higher risk than the other (\(l\)). This could for example represent a highrisk occupation. As before, we let \({N}_{A}=90 000\) individuals, and \({N}_{B}=10 000\) individuals. We further assume that 10% and 50% of the individuals in ethnicity groups \(A\) and \(B\) are in the highrisk group, respectively. Hence, we let group \({A}_{h}\) be the \({N}_{{A}_{h}}=9 000\) individuals in ethnicity group \(A\) with the high individual risk, group \({A}_{l}\) be the \({N}_{{A}_{l}}=81 000\) individuals in ethnicity group \(A\) with low individual risk, group \({B}_{h}\) be the \({N}_{{B}_{h}}=5 000\) individuals in ethnicity group \(B\) with high individual risk, and finally \({B}_{l}\) be the \({N}_{{B}_{l}}=5 000\) individuals in ethnicity group \(B\) with low individual risk.
The contact matrix is defined by the contacts between the risk levels \(h\) and \(l\), and the contacts between the ethnicity groups \(A\) and \(B\). Hence, let
\(\left(\begin{array}{cc}{p}_{hh}& {p}_{hl}\\ {p}_{lh}& {p}_{ll}\end{array}\right)\) be the relative contact matrix between the risk levels, where \({p}_{ij}\), \(i,j\in h,l\) is the relative number of contacts risk group \(i\) has with individuals in risk group \(j\). Further, let
\(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right)\) be the relative contact matrix between ethnicity groups \(A\) and \(B\), where \({p}_{ij}\), \(i,j\in A,B\) is the relative number of contacts ethnicity group \(i\) has with individuals in ethnicity group \(j\). The relative contact matrix between the groups \({A}_{h}\), \({A}_{l}\), \({B}_{h}\), and \({B}_{l}\) is then defined by the outer product of these two matrices, such that
$$\left(\begin{array}{cc}\begin{array}{cc}{p}_{{A}_{h}{A}_{h}}& {p}_{{A}_{h}{A}_{l}}\\ {p}_{{A}_{l}{A}_{h}}& {p}_{{A}_{l}{A}_{l}}\end{array}& \begin{array}{cc}{p}_{{A}_{h}{B}_{h}}& {p}_{{{A}_{h}B}_{l}}\\ {p}_{{A}_{l}{B}_{h}}& {p}_{{A}_{l}{B}_{l}}\end{array}\\ \begin{array}{cc}{p}_{{B}_{h}{A}_{h}}& {p}_{{B}_{h}{A}_{l}}\\ {p}_{{B}_{l}{A}_{h}}& {p}_{{B}_{l}{A}_{l}}\end{array}& \begin{array}{cc}{p}_{{B}_{h}{B}_{h}}& {p}_{{B}_{h}{B}_{l}}\\ {p}_{{B}_{l}{B}_{h}}& {p}_{{B}_{l}{B}_{l}}\end{array}\end{array}\right)=\left(\begin{array}{cc}\begin{array}{cc}{p}_{hh}{p}_{AA}& {p}_{hl}{p}_{AA}\\ {p}_{lh}{p}_{AA}& {p}_{ll}{p}_{AA}\end{array}& \begin{array}{cc}{p}_{hh}{p}_{AB}& {p}_{hl}{p}_{AB}\\ {p}_{lh}{p}_{AB}& {p}_{ll}{p}_{AB}\end{array}\\ \begin{array}{cc}{p}_{hh}{p}_{BA}& {p}_{hl}{p}_{BA}\\ {p}_{lh}{p}_{BA}& {p}_{ll}{p}_{BA}\end{array}& \begin{array}{cc}{p}_{hh}{p}_{BB}& {p}_{hl}{p}_{BB}\\ {p}_{lh}{p}_{BB}& {p}_{ll}{p}_{BB}\end{array}\end{array}\right).$$
We vary the amount of mixing within and between the ethnicity groups (assortativity). We define random mixing in ethnicity groups as a contact structure where every individual has the same probability of being in contact with any other person irrespective of ethnicity groups, while we denote an assortative contact structure as homogeneous mixing. We simulate with different levels of assortative mixing, defined by the relative number of contacts within to between the ethnicity groups. An assortativity of 1 then corresponds to random mixing with \({p}_{AA}={p}_{AB}={p}_{BA}={p}_{BB}=1/2\). An assortativity of 2 is defined as twice as many contacts within ethnicity groups, that is, \({p}_{AA}={p}_{BB}=2/3, {p}_{AB}={p}_{BA}=1/3\), where we divide by 3 for normalisation. More generally, an assortativity of \(x\) is defined by \(x\) times as many contacts within as between ethnicity groups, such that \({p}_{AA}={p}_{BB}=x/(x+1), {p}_{AB}={p}_{BA}=1/(x+1)\).
Cases
We simulate two different cases in the fourgroup setting, cases 3 and 4, with varying definitions of the highrisk individuals. We set the parameters in both cases such that \({R}_{0}=1.3\). We study the results at T = 200 days.
Case 3: In case 3, the highrisk individuals are defined by a higher susceptibility than the lowrisk individuals, in such manner that \({susc}_{{A}_{h}}={susc}_{{B}_{h}}=a\cdot sus{c}_{{A}_{l}}={a\cdot susc}_{{B}_{l}}\), with \(a\ge 1\). We assume \({p}_{ij}=1/2\) for all \(i,j\in h,l\) and vary \(a\). This definition of elevated risk could correspond to closer contact, genetic or biological differences, fewer hygienic precautions, or a mixture of those effects. We compute how much of the relative risk of ethnicity group \(B\) compared to \(A\) can be explained by a higher proportion of highrisk individuals (see Sect. Unexplained relative risk). For noncommunicable diseases, we would expect no unexplained relative risk. For \(a=2\) we fit a Poisson regression model on the outcome of the simulations, adjusting for risk level and ethnicity. We investigate how the confidence interval (CI) of the regression coefficient related to ethnicity varies with assortativity.
Case 4: In case 4, we assume \({susc}_{A_h}={susc}_{A_l}={susc}_{B_h}={susc}_{B_l}=1\), but we let the highrisk individuals have more contacts than the lowrisk individuals. The additional contacts are with other highrisk individuals, such that \({p}_{hh}=d\cdot {p}_{hl}={d\cdot p}_{lh}=d\cdot {p}_{ll}\), where \(d\ge 1\). We assume \({p}_{hl}={p}_{lh}={p}_{ll}=1\) and let \(d=3\). To account for the increased number of contacts, we now use \({C}_{{A}_{h}}={C}_{{B}_{h}}=1, {C}_{{A}_{l}}={C}_{{B}_{l}}=1.28\). As for case 3, we fit a Poisson regression model to the simulation outcome and investigate the CI for the ethnicity regression coefficient when varying assortativity. This definition of elevated risk could be due to larger households, less adherence to social distancing advice, or an occupation that requires more contacts.
Transmission model
We simulate disease spread by a stochastic SIRmodel [1]. Let \({S}_{i}\), \({I}_{i},\) and \({R}_{i}\) be the number of susceptible (\(S\)), infectious (\(I\)), and recovered (\(R\)) individuals in group \(i\). The following set of difference equations describes the disease development over time \({S}_{i}\left(t+\Delta t\right)= {S}_{i}\left(t\right){X}_{1},\) \({I}_{i}\left(t+\Delta t\right)= {I}_{i}\left(t\right)+{X}_{1}{X}_{2},\) where \({X}_{1}\sim \mathrm{Binom}\left({S}_{i}\left(t\right),\Delta t\sum_{j}\frac{{susc}_{i}{p}_{ij} in{f}_{j}{I}_{j}}{N}\right),\) and \({X}_{2}\sim \mathrm{Binom}\left({I}_{i}\left(t\right), {\Delta }_{t}\gamma \right),\) where \(1/\gamma =3\) days is the assumed duration of the infectious period, \(\Delta t=0.2\) is the timestep used in the simulations, and \(i=A, B\) in the twogroup setting, and \(i={A}_{h}\), \({A}_{l}\), \({B}_{h}\), \({B}_{l}\) in the fourgroup setting. We assume a constant population size so that \({R}_{i}={N}_{i}{S}_{i}{I}_{i}\).
Analysis of simulation results
Poisson regression model
We fit a Poisson regression model on the outcome of the disease simulation, here exemplified in the fourgroup setting. We include ethnicity and risk level as covariates, resulting in the following regression model \({\mathrm{log}{\mu }_{i}=\mathrm{log}{N}_{i}+{\beta }_{0}+\beta }_{e}{e}_{i}+{\beta }_{r}{r}_{i}\), where \({Y}_{i}\sim \mathrm{Poisson}\left({\mu }_{i}\right)\) denotes the number of infected individuals in group \(i\), \({e}_{i}\) and \({r}_{i}\) denote ethnicity and risk group status for group \(i\), respectively, and as before, \({N}_{i}\) are the population sizes of each group used as an offset. We are interested in the estimated regression coefficient \({\beta }_{e}\), which quantifies the effect of ethnicity. We let ethnicity group \(B\) and \(l\) be the reference levels for ethnicity and risk level, respectively. We assume a standard significance level of 0.05.
Unexplained relative risk
Since a higher proportion of ethnicity group \(B\) belongs to the highrisk group, we expect a larger infected proportion in ethnicity group \(B\). We denote the explained relative risk in ethnicity group \(B\) compared to \(A\) as the relative risk which can be explained by a higher proportion in the high individual risk group. This can be computed from the proportion of highrisk individuals in each ethnic group, together with the observed relative risk between the high and lowrisk group. Specifically, let \({R}_{{A}_{h}}, {R}_{{A}_{l}, }, {R}_{{B}_{h}},\) and \({R}_{{B}_{l}}\) be the total proportion of infected individuals in each group. The explained relative risk, \(ER\), is then given as \(ER=(0.5+0.5{RR}^{r})/(0.9+0.1{RR}^{r})\), where \({RR}^{r}=({R}_{{A}_{h}}+{R}_{{B}_{h}})/({R}_{{A}_{l}}+{R}_{{B}_{l}})\), is the observed relative risk between the high and lowrisk groups. The unexplained relative risk, \(UR\), in ethnic group \(B\) is then given by the observed relative risk in ethnic group \(B\), minus the explained relative risk, that is \(UR=({R}_{{B}_{h}}+{R}_{{B}_{l}})/({R}_{{A}_{h}}+{R}_{{A}_{l}})ER\). For noncommunicable diseases, we would expect no unexplained relative risk.
Estimating from the datageneration model using approximate Bayesian computation
In addition to analysing the simulated data with regression models, we apply a simple Markov Chain Monte Carlo approximate Bayesian computation (ABCMCMC) algorithm [26] to estimate the risk parameters in the groups from the simulations. We use these simulations to explore if we can obtain the correct parameters when the true datagenerating model is accounted for. Details are provided in the supplementary material.