Framework
We consider analyses where the goal is to use observed counts or prevalence of infection in various groups to learn about the underlying parameters of disease transmission. In this paper we analyse a simplified SIR-model where we divide the population into first two and then extend to four sub-groups. The two-group setting is the most parsimonious setting considered and is used to illustrate the relationship between individual and group-level risks, whether one can use a fitted regression model for contrafactual predictions to generalise to other group sizes, and the dependence between the incidence ratio in the two groups. The four-group setting is motivated by the studies of ethnicity and COVID-19 and is used to study the estimated regression coefficients from observational data.
We assume that we have data on the number of infected and total population sizes for each sub-group (individual-level data will give similar results) at timepoint t. In this simplified setup, either four or sixteen key parameters control the transmission dynamics between the two or four sub-groups, respectively. These parameters, \({\beta }_{ij}\), typically derived from contact studies, are summarised in a Who-Acquires-Infection-from-Whom matrix as usual in infectious disease modelling, where \({\beta }_{ij}\) is the rate of transmission from an individual in group \(j\) to an individual in group \(i\). We assume here that all groups have the same duration of infectiousness.
An important distinction is that “high-risk” can in this setup be due to at least three different factors that alone or in combination can explain an overrepresentation of cases in one of the sub-groups. We can decompose each \({\beta }_{ij}\) as \({\beta }_{ij}={inf}_{j}\times {c}_{ij}\times {susc}_{i}\), where \({inf}_{j}\) is the infectivity of group \(j\), \({c}_{ij}\) is the number of contacts group \(i\) has with group \(j\) per time unit, and \(sus{c}_{i}\) is the susceptibility of group \(i\). Hence, we in general expect higher incidence in a group with higher susceptibility, or if the group has overall more contacts. We also expect higher incidence in a group with higher infectivity, if there are more contacts within than between groups. These three different potential causes of increased infection prevalence may have different effects on the overall disease dynamics in the population. In this study, we assume for simplicity that \({inf}_{j}=1\) in all simulations.
We define \({c}_{ij}\) as the total contact rate between groups \(i\) and \(j\), hence depending on the population sizes in the groups. It is easier to specify our scenarios in terms of \({p}_{ij}\), which is the relative frequency of contacts between individuals in groups \(i\) and \(j\). We specify a situation in which individuals have twice as many contacts within their group as between groups by \({p}_{ij}=\left(\begin{array}{cc}2/3& 1/3\\ 1/3& 2/3\end{array}\right).\)
This matrix is related to \({c}_{ij}\) as follows:
$${c}_{ij}=\frac{{p}_{ij}}{{w}_{i}},{w}_{i}=\frac{{\sum }_{j}{p}_{ij}{N}_{j}}{N{C}_{i}},$$
where \({C}_{i}\) is the total relative contact rate for group \(i\). If all the groups have the same number of contacts, we have \({C}_{i}=1\). Hence, the \({c}_{ij}\) are calculated from the \({p}_{ij}\) in a way that ensures that the total number of contacts per time unit for everyone in group \(i\) is given by \({C}_{i}\). In all settings except the last one (defined as case 4), we use \({C}_{i}=1\), such that all groups have the same total contact rate. Hence, when \({C}_{i}=1\), the \({c}_{ij}\) are calculated from the \({p}_{ij}\), ensuring that all groups have the same total number of contacts in the population per time unit.
In all simulations, we start with 0.1% infected in each group as our initial conditions.
Simulated population
Two sub-groups
We simulate data in a population with \(N=100 000\) citizens. We first split the population into two sub-groups \(\mathrm{A}\) and \(\mathrm{B}\), and we assume a higher susceptibility in group \(\mathrm{B}\) than in group \(\mathrm{A}\), such that \(sus{c}_{B}=a\cdot {susc}_{A}\), where \(a\ge 1\). We let \({N}_{A}=90 000\) and \({N}_{B}=10 000\) be the number of individuals in groups \(\mathrm{A}\) and \(\mathrm{B}\), respectively. The relative contact matrix is defined by
\(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right),\) where \({p}_{ij}\), \(i,j\in A,B\) is the relative number of contacts group \(i\) has with individuals in group \(j\).
As an example of a contact structure in this population, let \({C}_{A}={C}_{B}=1, {p}_{AB}={p}_{BA}=1/3,{ p}_{AA}={p}_{BB}=2/3\). This corresponds to a setting where all individuals have the same total number of contacts, but twice as many contacts within their own sub-group. Plugging in the quantities we get \({{c}_{AA}=1.05, c}_{AB}=0.53, {c}_{BA}=0.91,{c}_{BB}=1.82.\)
Cases
We simulate two cases in the two-group setting, case 1 and case 2.
Case 1: We let \(a=1.2\). We vary the basic reproduction number \({R}_{0}\) (or, equivalently, \({susc}_{A}\)) and the timepoint for which we compare the outcome (T). We assume no contact between the groups, so \(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right)=\left(\begin{array}{cc}1& 0\\ 0& 1\end{array}\right)\). We compute the incidence rate ratio (hereafter denoted as relative risk) resulting from the simulations, that is, the proportion infected in sub-group \(B\) divided by the proportion infected in sub-group \(A\). For non-communicable diseases, one would expect a one-to-one relationship between individual risk of disease and relative risk, i.e. a relative risk of 1.2.
Case 2: We set \({susc}_{A}\) such that for \(a=1\), \({R}_{0}=0.9\), and then vary \(a\). See supplementary material for the expression for \({R}_{0}\). We assume random mixing between the groups, so \(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right)=\left(\begin{array}{cc}1/2& 1/2\\ 1/2& 1/2\end{array}\right)\). We study the results for time point T = 200 days, which for most parameter choices in the paper corresponds to after the epidemic has burnt out. We compare two different outcomes:
-
A)
We investigate whether the predictions of a Poisson regression model (see Sect. Poisson regression model) fitted to simulated data on the number infected in each group from a setting with a difference between the two groups can be used to predict the total number of infections in the setting with no difference between the groups. Specifically, the fitted regression model is used to predict the proportion infected in the contrafactual scenario where the entire population belongs to the low-risk group \(A\), that is, \({N}_{A}=100 000, {N}_{B}=0\). We compare this prediction with the simulations when the whole population is in \(A\). For non-communicable diseases, one would expect no discrepancy between the simulations and the predictions from the fitted model.
-
B)
We plot the proportion infected in the low-risk group when we vary the susceptibility in the high-risk group. If individual risks of infection only depended on individual characteristics, one would expect the proportion infected in the low-risk group to be independent of the properties of the high-risk group.
Four sub-groups
We split the population into four groups, \({A}_{h}\), \({A}_{l}\), \({B}_{h}\), and \({B}_{l}\), inspired by the recent analyses of ethnicity and risk of COVID-19 infection. We assume two ethnicity groups \(A\) and \(B\), and that one ethnicity group (\(B\)) is disproportionately represented in a high-risk group. Hence, the two ethnicity groups are divided into two risk groups, with one risk group (\(h\)) having a higher risk than the other (\(l\)). This could for example represent a high-risk occupation. As before, we let \({N}_{A}=90 000\) individuals, and \({N}_{B}=10 000\) individuals. We further assume that 10% and 50% of the individuals in ethnicity groups \(A\) and \(B\) are in the high-risk group, respectively. Hence, we let group \({A}_{h}\) be the \({N}_{{A}_{h}}=9 000\) individuals in ethnicity group \(A\) with the high individual risk, group \({A}_{l}\) be the \({N}_{{A}_{l}}=81 000\) individuals in ethnicity group \(A\) with low individual risk, group \({B}_{h}\) be the \({N}_{{B}_{h}}=5 000\) individuals in ethnicity group \(B\) with high individual risk, and finally \({B}_{l}\) be the \({N}_{{B}_{l}}=5 000\) individuals in ethnicity group \(B\) with low individual risk.
The contact matrix is defined by the contacts between the risk levels \(h\) and \(l\), and the contacts between the ethnicity groups \(A\) and \(B\). Hence, let
\(\left(\begin{array}{cc}{p}_{hh}& {p}_{hl}\\ {p}_{lh}& {p}_{ll}\end{array}\right)\) be the relative contact matrix between the risk levels, where \({p}_{ij}\), \(i,j\in h,l\) is the relative number of contacts risk group \(i\) has with individuals in risk group \(j\). Further, let
\(\left(\begin{array}{cc}{p}_{AA}& {p}_{AB}\\ {p}_{BA}& {p}_{BB}\end{array}\right)\) be the relative contact matrix between ethnicity groups \(A\) and \(B\), where \({p}_{ij}\), \(i,j\in A,B\) is the relative number of contacts ethnicity group \(i\) has with individuals in ethnicity group \(j\). The relative contact matrix between the groups \({A}_{h}\), \({A}_{l}\), \({B}_{h}\), and \({B}_{l}\) is then defined by the outer product of these two matrices, such that
$$\left(\begin{array}{cc}\begin{array}{cc}{p}_{{A}_{h}{A}_{h}}& {p}_{{A}_{h}{A}_{l}}\\ {p}_{{A}_{l}{A}_{h}}& {p}_{{A}_{l}{A}_{l}}\end{array}& \begin{array}{cc}{p}_{{A}_{h}{B}_{h}}& {p}_{{{A}_{h}B}_{l}}\\ {p}_{{A}_{l}{B}_{h}}& {p}_{{A}_{l}{B}_{l}}\end{array}\\ \begin{array}{cc}{p}_{{B}_{h}{A}_{h}}& {p}_{{B}_{h}{A}_{l}}\\ {p}_{{B}_{l}{A}_{h}}& {p}_{{B}_{l}{A}_{l}}\end{array}& \begin{array}{cc}{p}_{{B}_{h}{B}_{h}}& {p}_{{B}_{h}{B}_{l}}\\ {p}_{{B}_{l}{B}_{h}}& {p}_{{B}_{l}{B}_{l}}\end{array}\end{array}\right)=\left(\begin{array}{cc}\begin{array}{cc}{p}_{hh}{p}_{AA}& {p}_{hl}{p}_{AA}\\ {p}_{lh}{p}_{AA}& {p}_{ll}{p}_{AA}\end{array}& \begin{array}{cc}{p}_{hh}{p}_{AB}& {p}_{hl}{p}_{AB}\\ {p}_{lh}{p}_{AB}& {p}_{ll}{p}_{AB}\end{array}\\ \begin{array}{cc}{p}_{hh}{p}_{BA}& {p}_{hl}{p}_{BA}\\ {p}_{lh}{p}_{BA}& {p}_{ll}{p}_{BA}\end{array}& \begin{array}{cc}{p}_{hh}{p}_{BB}& {p}_{hl}{p}_{BB}\\ {p}_{lh}{p}_{BB}& {p}_{ll}{p}_{BB}\end{array}\end{array}\right).$$
We vary the amount of mixing within and between the ethnicity groups (assortativity). We define random mixing in ethnicity groups as a contact structure where every individual has the same probability of being in contact with any other person irrespective of ethnicity groups, while we denote an assortative contact structure as homogeneous mixing. We simulate with different levels of assortative mixing, defined by the relative number of contacts within to between the ethnicity groups. An assortativity of 1 then corresponds to random mixing with \({p}_{AA}={p}_{AB}={p}_{BA}={p}_{BB}=1/2\). An assortativity of 2 is defined as twice as many contacts within ethnicity groups, that is, \({p}_{AA}={p}_{BB}=2/3, {p}_{AB}={p}_{BA}=1/3\), where we divide by 3 for normalisation. More generally, an assortativity of \(x\) is defined by \(x\) times as many contacts within as between ethnicity groups, such that \({p}_{AA}={p}_{BB}=x/(x+1), {p}_{AB}={p}_{BA}=1/(x+1)\).
Cases
We simulate two different cases in the four-group setting, cases 3 and 4, with varying definitions of the high-risk individuals. We set the parameters in both cases such that \({R}_{0}=1.3\). We study the results at T = 200 days.
Case 3: In case 3, the high-risk individuals are defined by a higher susceptibility than the low-risk individuals, in such manner that \({susc}_{{A}_{h}}={susc}_{{B}_{h}}=a\cdot sus{c}_{{A}_{l}}={a\cdot susc}_{{B}_{l}}\), with \(a\ge 1\). We assume \({p}_{ij}=1/2\) for all \(i,j\in h,l\) and vary \(a\). This definition of elevated risk could correspond to closer contact, genetic or biological differences, fewer hygienic precautions, or a mixture of those effects. We compute how much of the relative risk of ethnicity group \(B\) compared to \(A\) can be explained by a higher proportion of high-risk individuals (see Sect. Unexplained relative risk). For non-communicable diseases, we would expect no unexplained relative risk. For \(a=2\) we fit a Poisson regression model on the outcome of the simulations, adjusting for risk level and ethnicity. We investigate how the confidence interval (CI) of the regression coefficient related to ethnicity varies with assortativity.
Case 4: In case 4, we assume \({susc}_{A_h}={susc}_{A_l}={susc}_{B_h}={susc}_{B_l}=1\), but we let the high-risk individuals have more contacts than the low-risk individuals. The additional contacts are with other high-risk individuals, such that \({p}_{hh}=d\cdot {p}_{hl}={d\cdot p}_{lh}=d\cdot {p}_{ll}\), where \(d\ge 1\). We assume \({p}_{hl}={p}_{lh}={p}_{ll}=1\) and let \(d=3\). To account for the increased number of contacts, we now use \({C}_{{A}_{h}}={C}_{{B}_{h}}=1, {C}_{{A}_{l}}={C}_{{B}_{l}}=1.28\). As for case 3, we fit a Poisson regression model to the simulation outcome and investigate the CI for the ethnicity regression coefficient when varying assortativity. This definition of elevated risk could be due to larger households, less adherence to social distancing advice, or an occupation that requires more contacts.
Transmission model
We simulate disease spread by a stochastic SIR-model [1]. Let \({S}_{i}\), \({I}_{i},\) and \({R}_{i}\) be the number of susceptible (\(S\)), infectious (\(I\)), and recovered (\(R\)) individuals in group \(i\). The following set of difference equations describes the disease development over time \({S}_{i}\left(t+\Delta t\right)= {S}_{i}\left(t\right)-{X}_{1},\) \({I}_{i}\left(t+\Delta t\right)= {I}_{i}\left(t\right)+{X}_{1}-{X}_{2},\) where \({X}_{1}\sim \mathrm{Binom}\left({S}_{i}\left(t\right),\Delta t\sum_{j}\frac{{susc}_{i}{p}_{ij} in{f}_{j}{I}_{j}}{N}\right),\) and \({X}_{2}\sim \mathrm{Binom}\left({I}_{i}\left(t\right), {\Delta }_{t}\gamma \right),\) where \(1/\gamma =3\) days is the assumed duration of the infectious period, \(\Delta t=0.2\) is the time-step used in the simulations, and \(i=A, B\) in the two-group setting, and \(i={A}_{h}\), \({A}_{l}\), \({B}_{h}\), \({B}_{l}\) in the four-group setting. We assume a constant population size so that \({R}_{i}={N}_{i}-{S}_{i}-{I}_{i}\).
Analysis of simulation results
Poisson regression model
We fit a Poisson regression model on the outcome of the disease simulation, here exemplified in the four-group setting. We include ethnicity and risk level as covariates, resulting in the following regression model \({\mathrm{log}{\mu }_{i}=\mathrm{log}{N}_{i}+{\beta }_{0}+\beta }_{e}{e}_{i}+{\beta }_{r}{r}_{i}\), where \({Y}_{i}\sim \mathrm{Poisson}\left({\mu }_{i}\right)\) denotes the number of infected individuals in group \(i\), \({e}_{i}\) and \({r}_{i}\) denote ethnicity and risk group status for group \(i\), respectively, and as before, \({N}_{i}\) are the population sizes of each group used as an offset. We are interested in the estimated regression coefficient \({\beta }_{e}\), which quantifies the effect of ethnicity. We let ethnicity group \(B\) and \(l\) be the reference levels for ethnicity and risk level, respectively. We assume a standard significance level of 0.05.
Unexplained relative risk
Since a higher proportion of ethnicity group \(B\) belongs to the high-risk group, we expect a larger infected proportion in ethnicity group \(B\). We denote the explained relative risk in ethnicity group \(B\) compared to \(A\) as the relative risk which can be explained by a higher proportion in the high individual risk group. This can be computed from the proportion of high-risk individuals in each ethnic group, together with the observed relative risk between the high and low-risk group. Specifically, let \({R}_{{A}_{h}}, {R}_{{A}_{l}, }, {R}_{{B}_{h}},\) and \({R}_{{B}_{l}}\) be the total proportion of infected individuals in each group. The explained relative risk, \(ER\), is then given as \(ER=(0.5+0.5{RR}^{r})/(0.9+0.1{RR}^{r})\), where \({RR}^{r}=({R}_{{A}_{h}}+{R}_{{B}_{h}})/({R}_{{A}_{l}}+{R}_{{B}_{l}})\), is the observed relative risk between the high- and low-risk groups. The unexplained relative risk, \(UR\), in ethnic group \(B\) is then given by the observed relative risk in ethnic group \(B\), minus the explained relative risk, that is \(UR=({R}_{{B}_{h}}+{R}_{{B}_{l}})/({R}_{{A}_{h}}+{R}_{{A}_{l}})-ER\). For non-communicable diseases, we would expect no unexplained relative risk.
Estimating from the data-generation model using approximate Bayesian computation
In addition to analysing the simulated data with regression models, we apply a simple Markov Chain Monte Carlo approximate Bayesian computation (ABC-MCMC) algorithm [26] to estimate the risk parameters in the groups from the simulations. We use these simulations to explore if we can obtain the correct parameters when the true data-generating model is accounted for. Details are provided in the supplementary material.