Skip to main content

Estimated Covid-19 burden in Spain: ARCH underreported non-stationary time series



The problem of dealing with misreported data is very common in a wide range of contexts for different reasons. The current situation caused by the Covid-19 worldwide pandemic is a clear example, where the data provided by official sources were not always reliable due to data collection issues and to the high proportion of asymptomatic cases. In this work, a flexible framework is proposed, with the objective of quantifying the severity of misreporting in a time series and reconstructing the most likely evolution of the process.


The performance of Bayesian Synthetic Likelihood to estimate the parameters of a model based on AutoRegressive Conditional Heteroskedastic time series capable of dealing with misreported information and to reconstruct the most likely evolution of the phenomenon is assessed through a comprehensive simulation study and illustrated by reconstructing the weekly Covid-19 incidence in each Spanish Autonomous Community.


Only around 51% of the Covid-19 cases in the period 2020/02/23–2022/02/27 were reported in Spain, showing relevant differences in the severity of underreporting across the regions.


The proposed methodology provides public health decision-makers with a valuable tool in order to improve the assessment of a disease evolution under different scenarios.

Peer Review reports


The Covid-19 pandemic that is hitting the world since late 2019 has made evident that having quality data is essential in the decision-making chain, especially in epidemiology but also in many other fields. There is an enormous global concern around this disease, leading the World Health Organization (WHO) to declare public health emergency [1]. Many methodological efforts have been made to deal with misreported Covid-19 data, following ideas introduced in the literature since the late nineties [2,3,4,5,6,7]. These proposals range from the usage of multiplication factors [8] to Markov-based models [9, 10] or spatio-temporal models [11]. Additionally, a new R [12] package able to fitting endemic-epidemic models based on approximative maximum likelihood to underreported count data has been recently published [13]. However, as a large proportion of the cases run asymptomatically [14] and mild symptoms could have been easily confused with those of similar diseases at the beginning of the pandemic, it is reasonable to expect that Covid-19 incidence has been notably underreported. Very recently several approaches based on discrete time series have been proposed [15,16,17] although there is a lack of continuous time series models capable of dealing with misreporting, a characteristic of the Covid-19 data and typically present in infectious diseases modeling. In this sense, a new approach for longitudinal data not accounting for temporal correlations is introduced in [18] and a model capable of dealing with temporal structures using a different approach is presented in [19]. A typical limitation of these kinds of models is the computational effort needed in order to properly estimate the parameters.

Synthetic likelihood is a recent and very powerful alternative for parameter estimation in a simulation based schema when the likelihood is intractable and, conversely, the generation of new observations given the values of the parameters is feasible. The method was introduced in [20] and placed into a Bayesian framework in [21], showing that it could be scaled to high dimensional problems and can be adapted in an easier way than other alternatives like approximate Bayesian computation (ABC). The method takes a vector summary statistic informative about the parameters and assumes it is multivariate normal, estimating the unknown mean and covariance matrix by simulation to obtain an approximate likelihood function of the multivariate normal.


Auto Regressive Conditional Heteroskedasticity (ARCH) models are a well-known approach to fitting time series data where the variance error is believed to be serially correlated. Consider an unobservable process \({X}_{t}\) following an AutoRegressive (\(AR\left(1\right)\)) model with ARCH(1) errors structure, defined by

$${X}_{t}={\phi }_{0}+{\phi }_{1}\cdot {X}_{t-1}+{Z}_{t},$$

where \({Z}_{t}^{2}={\alpha }_{0}+{\alpha }_{1}\cdot {Z}_{t-1}^{2}+{\epsilon }_{t},\) being \({\epsilon }_{t}\sim N\left({\mu }_{\epsilon }\left(t\right),{\sigma }_{\epsilon }^{2}\right)\). The process \({X}_{t}\) represents the actual Covid-19 incidence. In our setting, this process \({X}_{t}\) cannot be directly observed, and all we can see is a part of it, expressed as

$${Y}_{t}=\left\{\begin{array}{c}{X}_{t}\text{ with probability }1-\omega \\ q\cdot {X}_{t}\text{ with probability }\omega ,\end{array}\right.$$

where \(q\) is the overall intensity of misreporting (if \(0<q<1\) the observed process \({Y}_{t}\) would be underreported while if \(q>1\) the observed process \({Y}_{t}\) would be overreported) and \(\omega\) can be interpreted as the overall frequency of misreporting (proportion of misreported observations). To model consistently the spread of the disease, the expectation of the innovations \({\epsilon }_{t}\) is linked to a simplified version of the well-known compartimental Susceptible-Infected-Recovered (SIR) model. At any time \(t\in R\) there are three kinds of individuals: Healthy individuals susceptible to be infected (\(S\left(t\right)\)), infected individuals who are transmitting the disease at a certain speed (\(I\left(t\right)\)) and individuals who have suffered the disease, recovered and cannot be infected again (\(R\left(t\right)\)). As shown in [17], the number of affected individuals at time \(t\), \(A\left(t\right)=I\left(t\right)+R\left(t\right)\) can be approximated by

$$A\left(t\right)=\frac{{M}^{*}\left({\beta }_{0},{\beta }_{1},{\beta }_{2},t\right){A}_{0}{e}^{kt}}{{M}^{*}\left({\beta }_{0},{\beta }_{1},{\beta }_{2},t\right)+{A}_{0}\left({e}^{kt}-1\right)},$$

where \({M}^{*}\left({\beta }_{0},{\beta }_{1},{\beta }_{2},t\right)={\beta }_{0}+{\beta }_{1}\cdot {C}_{1}\left(t\right)+{\beta }_{2}\cdot {C}_{2}\left(t\right)\), being \({C}_{1}\left(t\right)\) and \({C}_{2}\left(t\right)\) dummy variables indicating if time \(t\) corresponds to a period where a mandatory confinment was implemented by the government and if the number of people with at least one dose of a Covid-19 vaccine in Spain was over 50% respectively. At any time \(t\) the condition \(S\left(t\right)+I\left(t\right)+R\left(t\right)=N\) is fulfilled. The expression (3) allow us to incorporate the behavior of the epidemics in a realistic way, defining \({\mu }_{\epsilon }\left(t\right)=A\left(t\right)-A\left(t-1\right)\), the new affected cases produced at time \(t\).

The Bayesian Synthetic Likelihood (BSL) simulations are based on the described model and the chosen summary statistics are the mean, standard deviation and the three first coefficients of autocorrelation of the observed process. Parameter estimation was carried out by means of the BSL [22, 23] package for R [12]. Taking into account the posterior distribution of the estimated parameters, the most likely unobserved process is reconstructed, resulting in a probability distribution at each time point. The prior of each parameter is set to be uniform on the corresponding feasible region of the parameter space and zero elsewhere.


This section presents the results of the analyses using the proposed methodology over a real data set and they are compared to the most common alternatives. The performance of the method is also studied by means of a comprehensive simulation study, with and without covariates.

The performance and an application of the proposed methodology are studied through a comprehensive simulation study and a real dataset on Covid-19 incidence in Spain on this Section.

Simulation study

Although the estimation method is already known and has been tested before, to the best of our knowledge it has never been used in the context of ARCH time series, and therefore a thorough simulation study has been conducted to ensure that the model behaves as expected, including \(ARCH\left(1\right)\), \(AR\left(1\right)\), \(MA\left(1\right)\) and \(ARMA\left(\mathrm{1,1}\right)\) structures for the hidden process \({X}_{t}\) defined as

$$\begin{array}{c}{X}_{t}={\phi }_{0}+{\phi }_{1}\cdot {X}_{t-1}+{Z}_{t},{Z}_{t}^{2}={\alpha }_{0}+{\alpha }_{1}\cdot {Z}_{t-1}^{2}+{\epsilon }_{t}\text{ (ARCH(1))}\\ {X}_{t}={\phi }_{0}+\alpha \cdot {X}_{t-1}+{\epsilon }_{t}\text{ (AR(1))}\\ {X}_{t}={\phi }_{0}+\theta \cdot {\epsilon }_{t-1}+{\epsilon }_{t}\text{ (MA(1))}\\ {X}_{t}={\phi }_{0}+\alpha \cdot {X}_{t-1}+\theta \cdot {\epsilon }_{t-1}+{\epsilon }_{t}\text{ (ARMA(1, 1))}\end{array}$$

where \({\epsilon }_{t}\sim N\left({\mu }_{\epsilon }\left(t\right),{\sigma }_{\epsilon }^{2}\right)\).

The values for the parameters \({\phi }_{1}\), \({\alpha }_{0}\), \({\alpha }_{1}\), \(\alpha\), \(\theta\), \(q\) and \(\omega\) ranged from 0.1 to 0.9 for each parameter. Average absolute bias, average interval length (AIL) and average 95% credible interval coverage are shown in Table 1. To summarize model robustness, these values are averaged over all combinations of parameters, considering their prior distribution is a Dirac’s delta with all probability concentrated in the corresponding parameter value.

Table 1 Model performance measures (average absolute bias, average interval length and average coverage) summary based on a simulation study

For each autocorrelation structure and parameters combination, a random sample of size \(n=1000\) has been generated using the R function arima.sim, and the parameters \(m=log\left({M}^{*}\right)\) and \(\beta\) have been fixed to \(0.2\) and \(0.4\) respectively. Several values for these parameters were considered but no substantial differences in the model performance were observed related to the value of these parameters or sample size, besides a poorer coverage for lower sample sizes, as could be expected.

Real incidence of Covid-19 in Spain

The betacoronavirus SARS-CoV-2 has been identified as the causative agent of an unprecedented world-wide outbreak of pneumonia starting in December 2019 in the city of Wuhan (China) [1], named as Covid-19. Considering that many cases run without developing symptoms or just with very mild symptoms, it is reasonable to assume that the incidence of this disease has been underregistered. This work focuses on the weekly Covid-19 incidence registered in Spain in the period (2020/02/23–2022/02/27). It can be seen in Fig. 1 that the registered data (turquoise) reflect only a fraction of the actual incidence (red). The grey area corresponds to 95% probability of the posterior distribution of the weekly number of new cases (the lower and upper limits of this area represent the percentile 2.5% and 97.5% respectively), and the dotted red line corresponds to its median.

Fig. 1
figure 1

Registered and predicted weekly new Covid-19 cases in each Spanish region

In the considered period, the official sources reported 11,056,797 Covid-19 cases in Spain, while the model predicts a total of 21,639,627 cases (only 51.10% of actual cases were reported). This work also revealed that while the frequency of underreporting is extremely high for all regions (values of \(\widehat{\omega }\) over 0.80 in all cases), the intensity of this underreporting is not uniform across the considered regions: Aragón and Ceuta are the CCAAs with highest underreporting intensity (\(\widehat{q}=0.28\)) while Región de Murcia and País Valencià are the regions where the predicted values are closest to the number of reported cases (\(\widehat{q}=0.46\)). Detailed underreporting parameters estimates for each region can be found in Table 2. Although the main impact of the vaccination programmes can be seen in mortality data, the results of this work also showed a significant decrease in the weekly number of cases as well in all CCAA except Euskadi, as can also be seen in Table 2 through the estimates corresponding to parameter \({\beta }_{2}\). Figure 2 represents the predicted and registered Covid-19 weekly incidence globally for Spain.

Table 2 Estimated underreporting parameters and impact of the considered covariates (\({\beta }_{1}\) and \({\beta }_{2}\) are the coefficients for confinement and vaccination respectively) for each Spanish CCAA. Reported values correspond to the median and percentiles 25% and 75% of the corresponding posterior distribution
Fig. 2
figure 2

Registered and predicted weekly new Covid-19 cases globally in Spain

Figure 2 shows the evolution of the registered (turquoise) and predicted (red) weekly number of Covid-19 cases in Spain in the period 2020/02/23–2022/02/27.

As can be seen in Figs. 1 and 2, there are 4 weeks (2021–12-26, 2022–01-02, 2022–01-09 and 2022–01-16) for which the predicted values coincide with those registered in all simulations, so no underreporting is detected these weeks. This behavior might be due to the breakout of a new variant with different characteristics (for instance producing more symptomatic cases and therefore reducing the underreporting) around these dates.

The registered values predicted by the model can also be obtained as \(\widehat{Y_t}=\left(1-\widehat\omega+\widehat\omega\cdot\widehat q\right)\cdot\widehat{X_t}\), and compared to the actual registered values \({Y}_{t}\). That allows computing standard forecasting error measures as Root Mean Squared Error (RMSE) or Mean Absolute Percentage Error (MAPE). Globally, the RMSE was 113,145.4 and MAPE was around 8%, ranging between 4 to 13% across regions. The specific RMSE and MAPE for each region are described in Table S1 in the Supplementary Material.

The global differences in underreporting magnitudes across regions can be represented by the percentage of reported cases in each CCAA (compared to model estimates), as shown in Fig. 3.

Fig. 3
figure 3

Percentage of reported cases in each CCAA


Although it is very common in biomedical and epidemiological research to get data from disease registries, there is a concern about their reliability, and there have been some recent efforts to standardize the protocols in order to improve the accuracy of health information registries (see for instance [24, 25]). However, as the Covid-19 pandemic situation has made evident, it is not always possible to implement these recommendations in a proper way.

Another work analyzing the cumulated burden of Covid-19 in Spain [26] estimated that only around 21% of the cases were reported in the period 2020/01/01–2020/06/01, but it should be taken into account that it seems reasonable to assume that the underreporting intensity was higher at the early stages of the pandemic, and therefore a lower overall underreporting is expected in the longer period considered in this work. Additionally, the presented methodology allows for a real time monitoring and not only cumulated over a time period.

Having accurate data is key in order to provide public health decision-makers with reliable information, which can also be used to improve the accuracy of dynamic models aimed to estimate the spread of the disease [27] and to predict its behavior.


The proposed methodology can deal with misreported (over- or under-reported) data in a very natural and straightforward way and is able to reconstruct the most likely hidden process, providing public health decision-makers with a valuable tool in order to predict the evolution of the disease under different scenarios.

Using a flexible approach for the underlying hidden process, such as ARCH time series, are a natural extension to recent developments (see for instance [19]) proposed for fitting underreported time series but restricted to the case when the underlying process has an ARMA structure and allow us to model phenomena presenting more complex behavior like Covid-19 in the long time period considered in the present work.

The analysis of the Spanish Covid-19 data shows that in average only around 51% of the cases in the period 2020/02/23–2022/02/27 were reported, and that there are important differences in the severity of underreporting across the Spanish regions. The impact of the vaccination program can also be assessed, achieving a significant decrease in the Covid-19 incidence in almost all regions after 50% of the population had one dose at least (although these results would probably be notably different if including SARS-CoV-2 immunity-escape variants like BA.4 or BA.5, which are currently predominant in many countries), while the impact of the mandatory lockdown could only be detected by the model in 7 out of 19 regions.

The simulation study shows that the proposed methodology behaves as expected and that the parameters used in the simulations, under different autocorrelation structures, can be recovered, even with severely underreported data.

Availability of data and materials

The datasets generated and/or analyzed during the current study are available in the GitHub repository,



Approximate Bayesian computation


Average interval length


AutoRegressive model


AutoRegressive Conditional Heteroskedasticity model


AutoRegressive Moving Average model


Bayesian Synthetic Likelihood


Spanish autonomous community


Credible interval




Moving average model


Mean Absolute Percentage Error


Root Mean Squared Error


Susceptible-Infected-Recovered model


  1. Sohrabi Catrin, Alsafi Zaid, O’Neill Niamh, Khan Mehdi, Kerwan Ahmed, Al-Jabir Ahmed, Iosifidis Christos, Agha Riaz. World Health Organization declares global emergency: a review of the 2019 Novel Coronavirus (COVID-19). Int J Surg. 2020.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Bernard H, Werber D, Höhle M. Estimating the under-reporting of norovirus illness in Germany utilizing enhanced awareness of diarrhoea during a large outbreak of Shiga toxin-producing E. coli O104: H4 in 2011 - a time series analysis. BMC Infect Dis. 2014;14(1):116.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Arendt S, Rajagopal L, Strohbehn C, Stokes N, Meyer J, Mandernach S. Reporting of foodborne illness by U.S. consumers and healthcare professionals. Int J Environ Res Public Health. 2013;10(8):3684–714.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Rosenman KD, Kalush A, Reilly MJ, Gardiner JC, Reeves M, Luo Z. How much work-related injury and illness is missed by the current national surveillance system? J Occup Environ Med. 2006;48(4):357–65.

    Article  PubMed  Google Scholar 

  5. Alfonso JH, Løvseth EK, Samant Y, Holm JØ. Work-related skin diseases in Norway may be underreported: data from 2000 to 2013. Contact Dermatitis. 2015;72(6):409–12.

    Article  PubMed  Google Scholar 

  6. Winkelmann R. Markov Chain Monte Carlo analysis of underreported count data with an application to worker absenteeism. Empir Econom. 1996;21(4):575–87.

    Article  Google Scholar 

  7. Gibbons CL, Mangen M-J, Plass D, Havelaar AH, Brooke RJ, Kramarz P, Peterson KL, et al. Measuring underreporting and under-ascertainment in infectious disease datasets: a comparison of methods. BMC Public Health. 2014;14(1):147.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Stocks T, Britton T, Höhle M. Model selection and parameter estimation for dynamic epidemic models via iterated filtering: application to rotavirus in Germany. Biostatistics. 2018.

    Article  PubMed Central  Google Scholar 

  9. Azmon A, Faes C, Hens N. On the estimation of the reproduction number based on misreported epidemic data. Stat Med. 2014;33(7):1176–92.

    Article  PubMed  Google Scholar 

  10. Magal P, Webb G. The parameter identification problem for SIR epidemic models: identifying unreported cases. J Math Biol. 2018;77(6–7):1629–48.

    Article  PubMed  Google Scholar 

  11. Stoner O, Economou T, Drummond Marques da Silva G. A hierarchical framework for correcting under-reporting in count data. J Am Stat Assoc. 2019.

    Article  Google Scholar 

  12. R Core Team. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2019. (

    Google Scholar 

  13. Bracher J. hhh4underreporting: fitting endemic-epidemic models to underreported data. 2021.

    Google Scholar 

  14. Oran DP, Topol EJ. Prevalence of asymptomatic SARS-CoV-2 infection. Ann Intern Med. 2020.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Fernández-Fontelo A, Cabaña A, Puig P, Moriña D. Under-reported data analysis with INAR-hidden Markov chains. Stat Med. 2016;35(26):4875–90.

    Article  PubMed  Google Scholar 

  16. Fernández-Fontelo A, Cabaña A, Joe H, Puig P, Moriña D. Untangling serially dependent underreported count data for gender-based violence. Stat Med. 2019;38(22):4404–22.

    Article  PubMed  Google Scholar 

  17. Fernández-Fontelo Amanda, Moriña David, Cabaña Alejandra, Arratia Argimiro, Puig Pere. Estimating the real burden of disease under a pandemic situation: the SARS-CoV2 case. PLoS One. 2020;15(12 December):e0242956.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Moriña D, Fernández-Fontelo A, Cabaña A, Puig P, Monfil L, Brotons M, Diaz M. Quantifying the under-reporting of uncorrelated longitudal data: the genital warts example. BMC Med Res Methodol. 2021;21(1):6.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Moriña D, Fernández-Fontelo A, Cabaña A, Puig P. New statistical model for misreported data with application to current public health challenges. Sci Rep. 2021;11(1):23321.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Wood SN. Statistical inference for noisy nonlinear ecological dynamic systems. Nature. 2010;466(7310):1102–4.

    Article  CAS  PubMed  Google Scholar 

  21. Price LF, Drovandi CC, Lee A, Nott DJ. Bayesian synthetic likelihood. J Comput Graph Stat. 2018;27(1):1–11.

    Article  Google Scholar 

  22. An, Ziwen, Leah F. South, and Christopher C. Drovandi. 2019b. BSL: Bayesian synthetic likelihood.

  23. An, Ziwen, Leah F South, and Christopher Drovandi. 2019a. BSL: an r package for efficient parameter estimation for simulation-based models via bayesian synthetic likelihood. arXiv.

  24. Kodra Y, Weinbach J, Posada-De-La-Paz M, AlessioCoi S, Lemonnier L, van Enckevort D, Roos M, et al. Recommendations for improving the quality of rare disease registries. MDPI AG. 2018.

    Article  Google Scholar 

  25. Harkener S, Stausberg J, Hagel C, Siddiqui R. Towards a core set of indicators for data quality of registries. Stud Health Technol Inform. 2019;267:39–45.

    Article  PubMed  Google Scholar 

  26. Moriña D, Fernández-Fontelo A, Cabaña A, Arratia A, Ávalos G, Puig P. Cumulated burden of COVID-19 in Spain from a Bayesian perspective. Eur J Pub Health. 2021;31(4):917–20.

    Article  Google Scholar 

  27. Zhao M, Lin R, Yang W, Lou, et al. Estimating the unreported number of Novel Coronavirus (2019-nCoV) cases in China in the first half of January 2020: a data-driven modelling analysis of the early outbreak. J Clin Med. 2020;9(2):388.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


Not applicable.


Research funded by Fundación MAPFRE. This work was partially supported by grant RTI2018-096072-B-I00 from the Spanish Ministry of Science and Innovation and by the Spanish State Research Agency, through the Severo Ochoa and María de Maeztu Program for Centers and Units of Excellence in R&D (CEX2020–001084-M). A.F-F acknowledges Agencia Estatal de Investigación for the financial support IJC2020-045188I/AEI/10.13039/501100011033. AC was partially financed by PID2021-123733NB-I00 (Ministerio de Ciencia e Innovación, Spain). AC and AA were partially supported by Project “EcoDep” CY-AAP2020-0000000013 (“Investissements d’Avenir” ANR-16-IDEX-0008, France).

Author information

Authors and Affiliations



DM, AF-F, AC, AA and PP participated in the development of the statistical model. DM and PP derived the described properties, and DM implemented the model in R software and conducted the analyses. All authors have read and approved the manuscript.

Corresponding author

Correspondence to David Moriña.

Ethics declarations

Ethics approval and consent to participate

Not applicable (no human or animal experimentation and data and source codes fully available).

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1.

Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) for the predicted number of cases in each Spanish CCAA.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moriña, D., Fernández-Fontelo, A., Cabaña, A. et al. Estimated Covid-19 burden in Spain: ARCH underreported non-stationary time series. BMC Med Res Methodol 23, 75 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: