A standard approach when describing infectious disease transmission are compartmental models or Susceptible-Infected-Recovered (SIR) -like models [5]. In general, both approaches divide the population into subpopulations with disjoint properties. Transition rates allow for flows between the subpopulations and define, in combination with the initial values of the subpopulations, the time evolution of the system. The ordinary differential equation (ODE) representation of the compartmental scheme we use is the well-known Susceptible-Exposed-Infected-Recovered (SEIR) model [6]:
$$\begin{array}{*{20}l} \begin{array}{rrrrrl} \dot{S} &=& -\beta(t)\cdot I\cdot S/N & & \\ \dot{E} &=& \beta(t)\cdot I\cdot S/N & -\delta\cdot E & \\ \dot{I} &=& & \delta\cdot E & -\gamma\cdot I \\ \dot{R} &=& & & \gamma\cdot I \end{array} \end{array} $$
(1)
with N=S+E+I+R resembling the entire population and where the dot notation is used to indicate time derivatives. Furthermore β,γ,δ resemble the infection rate, the rate to become infectious and the rate with which one dies or recoveres, respectively. The rationale in choosing this model class is that it is concise which is important for frequent evaluation and allows for a more flexible infection time when compared with the standard SIR model.
A special characteristic of the current pandemic is the massive political and social reaction. In contrast to, e.g. the annual influenza season during which the social and professional life used to proceed pretty much as usual, the COVID-19 pandemic has led to vast political interventions and personal restrictions aiming mainly at the reduction of infections [7]. Within the SEIR scheme these changes over time can be described by a time-dependent infection rate β(t) which translates to an effective time-dependent reproduction number \(R(t)=\frac {\beta (t)\cdot S}{\gamma \cdot N}\). The latter quantifies how many other people are infected on average by a single infectious individual and determines at which rate the number of currently infectious individuals is growing (R(t)>1) or decaying (R(t)<1). It should be noted that, despite the fact that β(t) is extrapolated as remaining constant (see Eq. 4), R(t) is not necessarily constant. This is because R(t) includes the monotonously decreasing susceptible density \(\frac {S(t)}{N}\). The dynamics of all additional states can, for one example, be found in the supplement (Additional file 1).
There are several studies dealing with the problem of time-dependent infection rate in different manners. For example, at the beginning of the COVID-19 pandemic the impact of different non-pharmacological interventions (NPIs) was examined via step functions that implement β(t) via different variants of (smoothed) step functions, e.g. to examine the impact of different NPIs [8–11]. Often, these approaches are restricted to time ranges in which the infection rate is assumed to be constant or monotonously decreasing or increasing, respectively.
In contrast, we aim for a more general approach which enables the infection rate to vary flexibly, i.e. to decrease and/or increase repeatedly within the considered time range. This is necessary for an accurate description of the COVID-19 transmission dynamics since it is influenced by many factors that may vary over the course of the ongoing COVID-19 pandemics:
-
1.
Various NPIs are implemented, repealed and reintroduced iteratively [12].
-
2.
The population’s compliance to regulative measures changes over time [13].
-
3.
Seasonal effects, e.g. weather conditions, lead to changes in infection risk [14].
-
4.
Mutations alter the physiological mechanisms underlying the disease transmission and other aspects [15].
-
5.
Vaccinations reduce the population’s susceptible fraction [16].
-
6.
Air pollution may enhance COVID-19 severity [17].
Quantifying the effects of the above points on the infection rate is hardly feasible and within an evolving pandemic it is fairly impossible. Therefore, we omit an explicit formulation of the above effects and strive for an estimation of an effective infection rate. In order to fit a strictly positive and time-dependent infection rate simultaneously with the SEIR model’s parameters, we introduce the following parametrization for the infection rate:
$$\begin{array}{*{20}l} \beta(t) = b\cdot \frac{1}{1+\mathrm{e}^{-f(t)}}\:\:, \end{array} $$
(2)
where the argument of the exponential function is given by an interpolating cubic spline
$$\begin{array}{*{20}l} f(t) = \text{cubic\_spline}\left(t,\{\tau_{i},u_{i}\}_{i\in\{1,\dots,n\}}\right)\:\:. \end{array} $$
(3)
We utilize joint estimation of input spline and ODE parameters as introduced for biological systems in [18]. The composition of the interpolating spline (3) with the logistic function (2) allows for a nearly arbitrary time dependence, while still ensuring that the infection rate β(t) is strictly positive, smooth and restricted to a maximal value b. The cubic spline curve is determined by estimated parameters ui=cubic_spline(τi) that represent its values at fixed and evenly spaced dates τi for i∈{1, …, n−2} which cover the time range of observed data. We chose n=15 which leads to roughly one degree of freedom per month which turned out to be a reasonable choice during the development process. In general, there is a trade off: It should be flexible enough to describe all infection waves, but it is also necessary to have no overfitting in any of the fitted regions.
In our model, the last two spline knots are placed after the date tLast of the last data point: τn−1=tLast+50d and τn=tLast+300d. The value un−1 is fitted to allow for some flexibility in the most recent regime, whereas un=0 is fixed for numerical stability and reflecting the end of the pandemic in at least 300 days.
The predictions for the infection dynamics are primarily determined by the time-dependent infection rate β(t). In general, assumptions for the future development of β(t) are difficult to justify as many different factors contribute to it. For illustrative purposes, several different assumptions could be made and visualised as done e.g. in various online simulator tools [19]. For example, one such scenario study nicely illustrates the effectiveness of a Test-Trace-Isolate strategy [20].
For a data-driven approach focused on short-term forecasts, we need to be more practical: For extrapolation purposes, we fix
$$ \beta(t>t_{\text{Last}}) = \beta(t_{\text{Last}}) $$
(4)
i.e. we assume the infection rate to be constant starting from the day where the last data point is reported. Alternatively, for β(t>tLast) some functional form incorporating the derivative or even higher-order derivatives could be utilized. As it is a priori totally unclear, which functional form and additional assumptions might be appropriate, we decided to go for the most simple ansatz by fixing it to β(tLast). Note also, that by fixing at t>tLast we already have some kind of extending as the model system has an integrated delay due to its structure.
Data-driven approach
Typically, there exist a multitude of model classes and structures which can be used to describe the same phenomenon. However, it is generally not possible to transfer results about estimated parameters between different models in a straightforward manner due to their differing mechanistic structures. To circumvent this problem, we here rely on a purely data-driven approach meaning that no prior knowledge about parameter values is incorporated into the optimization procedure. The only three a priori fixed parameters are the initial number of individuals in the susceptible, the exposed and the recovered state: Sinit,Einit and Rinit. Time point zero t0 is set to the first day that has at least a total of 100 reported cases to ensure the well-mixing assumption of ODE modeling. Sinit was set to the total population of the respective region as given by the Federal Statistical Office of Germany [21]. Einit was set to γ·Iinit/δ, which is motivated by the assumption that \(\dot {I}\approx 0\) at the beginning of an epidemic reflecting a slow onset. Rinit is set to zero. The only remaining initial occupation number Iinit is estimated from the data.
Link between model and observed data
In order to calibrate the ODE model, it needs to be linked to the observed data. The data we use for calibration is the daily incidence yi published by the reporting date (Meldedatum) ti at the local health authority. Therefore, we introduce the observation function
$$\begin{array}{*{20}l} y(t_{i}) = q\cdot\lambda_{D(t_{i})}\cdot\left(\delta\cdot E(t_{i})\cdot \Delta\right)\:\:, \end{array} $$
(5)
where the parameters can be interpreted as follows:
-
q∈[0,1] is the fraction of all infectious individuals that are detected and reported.
-
D(ti)∈{1,...,7} is an index for the weekday at date ti where {1,...,7} are naturally identified with the weekdays W={Monday,…,Sunday}.
-
λD is a factor for the weekday D that adjusts for the weekly modulation occurring in the IfSG data (see Weekly modulation factors).
-
(δ·E(t)·Δ) approximates the influx into the state I(t) of Eq. 1. As the considered data represents daily incidences, we set Δ to 1 day. This approximation of the true incidence quantity \(\int _{t-1}^{t}\delta \cdot E(t^{\prime }) \mathrm {d}t^{\prime }\) is exact if the state E(t) remains constant within that day. Comparison with this exact but computationally much more expensive approach showed minor deviations for real data applications.
The observable function (5) connects the model’s predictions to the reported data. The observations are assumed to scatter around this mean according to a normal distribution:
$$\begin{array}{*{20}l}y_{i} = y(t_{i}) + \epsilon_{i},\quad\epsilon_{i}\sim\mathcal{N}(0,\sigma_{i}^{2})\:\:. \end{array} $$
(6)
As we are dealing with a count process we use the standard deviation inspired by a Poisson model
$$ \sigma_{i}=C\cdot\sqrt{1+y(t_{i})}\:\: $$
(7)
where the addition of 1 accounts for numerical instabilities if the number of infected y(ti) becomes very low. As the standard deviation grows with the square root of the incidences, the variance grows linearly with the expectation value. The error parameter C is fitted jointly with all others.
Investigated time frame
The results of the presented ansatz are calculated on a daily basis. The data used for fitting consists then of a time course from the start of the pandemic, March 1st, 2020 through the most recent report with one data point per day. In this paper, we present the methodology and the results were generated on April 1st, 2021. The data fitted had therefore registered infections up to March 31st, 2021. We publish and assess predictions for a forecast horizon of three weeks. This period was selected because we think that the assumption of Eq. 4 is justifying no much longer time frame.
Weekly modulation factors
The IfSG data shows an oscillatory pattern with a period of one week which can be quickly evaluated by plotting distribution of incidences per weekday relative to the rolling 7-day average: we provide an analyzing figure in the supplement. The main reason for this is the reporting procedure, displaying a major delay during weekends, instead of actual infection dynamics. Therefore, we account for this effect within the observation function via seven weekday-specific factors λD with the integer D∈{1,...,7}. In order to
-
1.
guarantee that the factors λD essentially do not change the 7-day-incidence and
-
2.
separate the weekly modulation from a global scaling of the observation function, which is realized via the factor q,
we, furthermore, set the constraint that
$$\begin{array}{*{20}l} \sum_{D\in\{1,...,7\}}\lambda_{D}=7\:\:. \end{array} $$
(8)
As a consequence, we are left with six degrees of freedom to describe the weekly effects. For a convenient implementation in the used software, we introduce a Fourier series with six parameters Θweekly={A1,A2,A3,ϕ1,ϕ2,ϕ3}:
$$\begin{array}{*{20}l} \psi(t) = A_{0} + \sum_{k=1}^{3} A_{k}\cdot \cos\left(k\omega t + \phi_{k}\right) \end{array} $$
(9)
where offset and frequency are fixed to
$$\begin{array}{*{20}l} A_{0} = 1,\quad \omega = \frac{2\pi}{7\,\text{days}}\:\:. \end{array} $$
(10)
Instead of fitting the factors λD directly, we rewrite them in terms of equation (9) as
$$\begin{array}{*{20}l} \lambda_{D} = \frac{\psi(D)}{\sum_{j=1}^{7}\psi(j)}\:\: \end{array} $$
(11)
and calibrate the parameters Θweekly. Doing so allows to set the amplitudes A1,A2 and A3 to zero in order to get an adjusted curve that does not feature the weekly oscillations and therefore reflects the ideal case of no reporting artifacts in the data.
Correction of last data points
The IfSG data published on date tn contains information about the reported cases at all past dates tn,tn−1,…,t1 since the beginning of reporting. However, due to reporting delays between the test facilities, the local health authorities and the RKI, the data update from date tn−1 to tn contains not only cases that were reported to the local health authorities at date tn−1, but also before that at dates tn−2,tn−3,… and so on. This means that the number of reported cases on day tn will be underestimated especially for the most recent dates.
Meaningful handling of this data artifact can be done in at least two ways: For instance, one could choose to ignore some of the latest data points, since they are most prominently affected by this data artifact. An alternative is to estimate the systematic deviation from historically published data sets. In order to avoid the bias towards smaller incidences in the prediction, the data can be adjusted accordingly. Therefore, one assumes, that the future data sets of tn will not change reported counts older than four weeks tn−28. Let \(N_{t_{1}}^{t_{2}}\) denote the number of reported cases, that were published at time point t1 to be reportedly infected at date t2 where \(N_{t_{1}}^{t_{2}>t_{1}} = 0\) as future cases cannot be reported. Then, one can learn from this history of published data sets the correction factor CFk
$$ CF_{k} = \frac{\sum_{\hat t} N_{\hat t}^{\hat t-k}}{\sum_{\hat t} N_{t_{\text{Last}}}^{\hat t-k}} $$
(12)
the initial publication of k day old counts had to be corrected to obtain the number in the latest data set tn. The factors CFk can then be applied to the newest data set.
This was done for Germany and all the federal states separately. We showcase the resulting differences of these two data preprocessing strategies in Averaging of approaches section. We give some summary statistics of this quantity in the supplement.
For the county level, this adjustment is not as crucial for two reasons: 1) the count numbers are much lower, so the stochasticity can lead to wrong correction factors and 2) the shape of the estimated dynamics is inherited from the federal states in our model.
Parameter estimation
In general, we follow the maximum likelihood estimation (MLE) approach. As there are a total of 429 regions for which the data has to be fitted and predictions are calculated, we rely on a two-step procedure to reduce computation time which is described in the following paragraphs.
Federal states and Germany
The parameter estimation problem given by the above defined ODE model and the IfSG daily incidence data is solved separately for Germany and each federal state by an MLE approach. The latter has been well established for ODE models [22]. The deviation between data and the model’s observation function as specified in Eq. 5 is minimized, taking into account the error model of Eqs. 6 and (7). The simultaneous parameter estimation of the spline parameters ui follows the lines of [18]. In particular, no explicit regularization term is implemented that penalizes non-vanishing spline curvatures. A full list of parameters Θ and their estimation results \(\hat \Theta \) is shown in the supplement (Additional file 1) for one example, the region of Germany.
County level
Analysis at the rural and urban county level (Land- and Stadtkreise) is important to obtain a spatially resolved picture of the infection dynamics in Germany. The previously described approach is computationally not feasible because the analysis of 429 regions cannot be performed within 24 hours without access to a sufficiently large computing cluster which can be used 24/7 without queuing. Moreover, the number of infected individuals can generally be so small at the county level that inference and prediction based on a purely deterministic model is not appropriate. Therefore, we used the results on the higher-level administrative structure, i.e. the fitted model of the federal state, as prior information about the dynamics, and scaled it down to the county level for predictions.
More specifically, the county-level data was used to merely estimate two parameters in a county-specific manner: the scaling parameter q from equation (5), which in this context can be related to the proportion of current infections occurring in the county c, and the error parameter C from equation (7) which quantifies the stochasticity of county-level observations analogous to its meaning on the level of federal states. All other parameter values for a county c are taken from the estimated set of parameters \(\hat \Theta _{FS(c)}\) for the corresponding federal state FS(c).
The county-level dynamics might change rapidly as new clusters of infection emerge. For predictions, it is important that such rapid changes are detected by the model calibration procedure, i.e. fitting of q and C has to account for such rapid changes. We implemented this requirement by exponentially weighting down the county level data observed in the past by increasing the standard deviations via
$$ {}\sigma^{2}_{i}\longleftarrow\frac{\sigma^{2}_{i}}{w_{i}}\:\:,\quad w_{i}=A\cdot\sqrt{\left(\exp{(t_{i}-t_{\text{Last}})/\tau}\right)^{2}+\left(w_{\min}/A\right)^{2}}\:\:. $$
(13)
Here, A=7.56 denotes the normalization factor that ensures that the sum of all weights wi is equal to one. Furthermore, wmin=0.01·A denotes the minimal weight factor used for data observed in the past. wmin is necessary for numerical reasons: the first summand of the square root is exponentially decreasing towards zero and would (without additional second summand) lead to a divergence of the used standard deviation. The value of 0.01 is somewhat arbitrary. It effectively serves as a lower bound on the weights (or upper bound on standard deviation, respectively) for data points that are long time ago. Thorough evaluation of this hyperparameter of value 0.01 has not been performed, however it is not expected to have a crucial impact on results. Moreover, we chose τ=7 as time-constant of this weighting step. To be clear, on the county-level, σi from equation (7) should be thought of as first being transformed according to the mapping (13) before entering equation (6) as the standard deviation of Gaussian observation errors.
Just as the analysis for the federal states, the described scaling procedure for the counties is updated on a daily basis, i.e. the county-specific parameters q and C are updated every day. This accounts for time-dependent deviations of the local infection history on the federal state level, i.e. each county has an individual kinetics.
Calculation of uncertainties
To quantify the uncertainty in the predictions of the model, our forecasting tool provides confidence intervals along with proposed predictions. Here, we describe two main sources of uncertainties: parameter uncertainty and approach uncertainty. The first is captured by simulating all parameter combinations that agree with the observed data as will be explained in Profile likelihood analysissection, the second is incorporated by running the analysis with several models as detailed in Averaging of approaches section.
Profile likelihood analysis
For non-linear models, uncertainties for estimated parameters can be determined using the profile likelihood (PL) method which estimates parameter values that still ensure agreement of model and data to a certain confidence level in a pointwise and iterative manner [23]. This approach has been showcased for infectious disease models [24]. Parameter uncertainties naturally translate to prediction uncertainties which can be analyzed systematically [25]. Following the given references, we simulate the data-compatible parameter combinations from the parameter profiles and then take the envelope of the resulting family of curves to obtain confidence intervals.
One could also analyze the uncertainty of a model prediction directly via the prediction profile likelihood method [26]. Prediction profiles need to be computed via a costly iterative fitting procedure for each predicted quantity and time point separately. However, by using the parameter combinations from the profile likelihood method, we can calculate uncertainties for any desired model quantities and time points only by simulation, thus rendering this method more efficient for our purposes.
Averaging of approaches
When utilizing ODE models to describe certain aspects of reality, a multitude of assumptions are implicitly made, which include (but are not limited to) the selected model structure, the noise model of the data, the appropriate data preprocessing. All these decisions result in a certain approach. These necessary decisions along the modeling process impact the space of possibly described and therefore also predicted dynamics. To account for this origin of uncertainty, we perform the procedure described so far simultaneously for several approaches and merge their results into one comprehensive result. The latter is done by taking the mean / minimum / maximum of the different approaches’ MLE / lower bound / upper bound curves. Accounting for different modeling decisions prevents overconfidence in the results.