Data-driven prediction of COVID-19 cases in Germany for decision making

Background The COVID-19 pandemic has led to a high interest in mathematical models describing and predicting the diverse aspects and implications of the virus outbreak. Model results represent an important part of the information base for the decision process on different administrative levels. The Robert-Koch-Institute (RKI) initiated a project whose main goal is to predict COVID-19-specific occupation of beds in intensive care units: Steuerungs-Prognose von Intensivmedizinischen COVID-19 Kapazitäten (SPoCK). The incidence of COVID-19 cases is a crucial predictor for this occupation. Methods We developed a model based on ordinary differential equations for the COVID-19 spread with a time-dependent infection rate described by a spline. Furthermore, the model explicitly accounts for weekday-specific reporting and adjusts for reporting delay. The model is calibrated in a purely data-driven manner by a maximum likelihood approach. Uncertainties are evaluated using the profile likelihood method. The uncertainty about the appropriate modeling assumptions can be accounted for by including and merging results of different modelling approaches. The analysis uses data from Germany describing the COVID-19 spread from early 2020 until March 31st, 2021. Results The model is calibrated based on incident cases on a daily basis and provides daily predictions of incident COVID-19 cases for the upcoming three weeks including uncertainty estimates for Germany and its subregions. Derived quantities such as cumulative counts and 7-day incidences with corresponding uncertainties can be computed. The estimation of the time-dependent infection rate leads to an estimated reproduction factor that is oscillating around one. Data-driven estimation of the dark figure purely from incident cases is not feasible. Conclusions We successfully implemented a procedure to forecast near future COVID-19 incidences for diverse subregions in Germany which are made available to various decision makers via an interactive web application. Results of the incidence modeling are also used as a predictor for forecasting the need of intensive care units. Supplementary Information The online version contains supplementary material available at (10.1186/s12874-022-01579-9).


Modelling choices
For an ODE-model description of current Sars-Cov-2 pandemic at least the following points should be taken into account. This list is most probably not complete. Age structure not taken into account fit age groups of RKI data set Number of parameters grows at least quadratically, but amount of data linearly when adding age groups. Also, computation time becomes an issue Vaccination not taken into account include info on vaccinations For constant vaccnation rate, beta(t) incorporates vaccination e↵ect.
No information on county level available.
Reporting delay ignoring latest two days OR adjusting by historical data ignoring it altogether latest data points a↵ect prediction, hence important.
Infection rate (t) cubic spline with 15 d.o.f piecewise constant function between di↵erent NPIs many unknown e↵ects altering infection rate Weekly oscillation is accounted for assuming no weekly data artifact latest data points a↵ect prediction, hence important.
Mutations not taken into account use additional sequencing data for fitting model no comprehensive data on mutations available for Germany 2 Weekday e↵ect in data Figure 1: Weekday has e↵ect on reporting probability In the data, there is a clear dependence on the weekday. Plotted is the distribution of the fraction (or fold change) of the daily value and the 7-day-rolling average for that day per weekday. If there was no dependence on the weekday in the data, then these fractions should fluctuate around the value of one (dashed black line). Note, that all quantity in this figure were computed based on the published incidence datano ODE modeling has been performed for this. The time course from March 2020 through 1st December 2021 was analyzed for this particular figure. Table 2: Summary statistics of correction factor For every federal state and Germany we calculated correction factors. The correction factor accounts for delayed reporting and is calculated from historic data sets. Here, we display summary statistics depending on number of days that have passed for the day that shall be corrected. Ideally, this value is 1 which corresponds to no correction at all. The table can be read as follows: In the latest data set, the number of incident cases that has been reported for the day before yesterday (delay of two days) needs -on average -to be multiplied by a factor of 1.26 (or increased by 26%). For individual federal states this number might di↵er. A measure of how much it varies across regions is the standard deviation given in column 3.

Model equations: SIR model with a spline input
We present here the full list of equations of the model used for fitting time course data of federal states and Germany.

Dynamic variables
The model contains 4 dynamic variables. The dynamics of those variables evolve according to a system of ordinary di↵erential equations (ODE) as will be defined in the following. The following list indicates the unique variable names and their initial conditions.

Reactions
The model contains 3 reactions. Reactions define interactions between dynamics variables and build up the ODE systems. The following list indicates the reaction laws and their corresponding reaction rate equations. In the reaction rate equations dynamic and input variables are indicated by square brackets. The remaining variables are model parameters that remain constant over time.

ODE system
The specified reation laws and rate equations v determine an ODE system. The time evolution of the dynamical variables is calculated by solving this equation system.
Substituting the reaction rates v i yields: The ODE system was solved by a parallelized implementation of the CVODES algorithm. It also supplies the parameter sensitivities utilized for parameter estimation.

Derived variables
The model contains 6 derived variables. Derived variables are calculated after the ODE system was solved. Dynamic and input variables are indicated by square brackets. The remaining variables are model parameters that remain constant over time. • Derived variable 4: The value of init Susceptibles is the number of inhabitants of Germany and will obviously di↵er dependent on the fitted federal state.

Observables
The following observable and its error function are added to fit the data. The weekday modulation factor D is described in the main paper. Note, that for scaling down the dynamics from federal state level to county level, the error function is adjusted by equation (13) of the main manuscript.
5 Comprehensive results for one region: Germany 5.1 Model variables and trajectories Figure 2: Time courses of states and derived quantities All states and derived variables from the above model definition are plotted here. Over time, the Susceptibles get move though Exposed and Infectious states to reach the Removed state. The InputSpline is transformed to a factor function (b time dependence) and is fixed at time of the last data point by a Heaviside function (1-fixSpline), as can be seen by the first two panels. The di↵erence between R zero t and R t is whether or not they factor in the depletion of Susceptibles or not. The Total confirmed flux variable it the one eventually entering the observation function and is therefore used for calibrating the system. Time point zero is the first day where a total of 100 confirmed cases has been reached. The displayed shadow visualizes the measurement noise that is estimated based on Poisson error model.

Parameter estimation and confidence intervals
In total 19 parameters are estimated from the experimental data. The best fit yields a value of the objective function 2 log(L) = 6326.98 for a total of 397 data points. The model parameters were estimated by maximum likelihood estimation. In Table 3 the estimated parameter values are given. In Table 4, 95% confidence intervals for the estimated parameter values derived by the profile likelihood method as shown in figure 4 are given.  Table 3: Estimated parameter valueŝ ✓ indicates the estimated value of the parameters. ✓ min and ✓ max indicate the upper and lower bounds for the parameters. The log-column indicates if the value of a parameter was logtransformed. If log ⌘ 1 the non-log-column indicates the non-logarithmic value of the estimate.
The estimated-column indicates if the parameter value was estimated (1), or was fixed (0).

Figure 4: Parameter profiles
Uncertainty in predictions is determined by uncertainty in estimated parameters which can be calculated by profile likelihood analysis. Parameters AA N and phi N (N = 1,2,3) determine the modeulation factor per weekday (see figure 21 of the supplement). Spline Parameters spValueN (N = 8,...,14) resemble those spline parameters that will a↵ect uncertainty in predictions. The first spline parameters will only a↵ect the fit of the early time course, and not the predictions. Confidence intervals are determined by the profile intersecting the threshold, see  6 Results for federal states

Predictions for all federal states
Analogous to figure 2 of the main paper, we present here the independent results for all 16 federal states. Note the di↵erent scales of estimated e↵ective R(t) which might point to a degree of freedom that is not incorporated in the data. Further investigation about this model feature might be required. The caption of the figure is repeated here: The incidence data of the entire time course is fitted (panel a) to estimate all dynamic parameters including the time-dependent infection rate that corresponds to R(t)(panel d). Predictions of incidences (panels b and c) and derived quantities (panels e and f) for a zoomed in time span are shown. 95%-confidence intervals (color-shaded areas) are inferred by profile likelihood calculation.

Results for all regions and counties: Landkreise
For all counties, the federal state's trajectory is scaled down and uncertainties for scaling down are computed. Due to computing power and time limitations as results should be computed with the latest data, this method was chosen and turned out to be feasible for daily computations of results. In the following figures we give an example of how results for the county level look like for the relevant time frame which is the most recent couple of months and the prediction for the upcoming weeks. For clarity, we display the trajectories without weekly modulation and after merging the two di↵erent approaches described in the main paper.