The overall distribution of publications at the start of the SARS-CoV-2 and Zika virus epidemics was similar. Epidemiological research was more commonly published than laboratory research and non-original contributions accounted for a substantial fraction of all publications for both infections. For both infections, case reports and case series, mathematical modelling and phylogenetic studies were prominent at the start of the epidemic, whereas analytical study designs, such as cohort and case-control studies, appeared later. Trials emerged later and accounted for a small proportion of all studies. The volume and speed of evolution were much higher for SARS-CoV-2 than for Zika virus. Modelling studies were more prominent and basic research studies were less common for SARS-CoV-2 than for Zika virus. More studies were published as preprints for SARS-CoV-2, but this proportion declined over time.
Strengths and limitations
Strengths of this study include the comparable and reproducible search strategies for two emerging infectious diseases and categorisation of study design by a volunteer crowd of epidemiologist reviewers. A limitation is that the design of an epidemiological study is not always clear, and different scientists might classify the same study differently. We tried to tackle this limitation by screening and training of the volunteer scientists, verification of decisions and having a third person resolving disagreements [12]. There are other limitations. First, we only recorded the study design of publications and did not assess the content or its methodological quality. To trace the evolution of evidence for specific research questions, in-depth studies are needed. Second, for SARS-CoV-2, the volume of publications meant that we only annotated a sample of records. The total in the first 5 months of the pandemic was, however, higher than for 1 year of publications about Zika virus and the proportions of different study designs for Zika virus stabilised quickly. Third, the searches do not include all sources of peer-reviewed evidence or preprint sources. Incompleteness of the evidence base should not affect our conclusions as long as other sources account for a stable proportion of publications.
We followed two dimensions of the publication of evidence about two newly emerging infectious diseases; the overall distribution of publication types and changes over time. Similarities in the overall distribution of epidemiological, basic science and non-original publications for SARS-CoV-2 and Zika virus could reflect patterns of the overall trajectory of research about emerging infectious diseases. In the initial phase of an outbreak with a novel pathogen, case reports and case series predominate. These types of study describe and refine the clinical characteristics of the disease [21]. Observations from these studies are commonly used to define research questions and formulate hypotheses about various aspects of transmission and disease. More formal, hypothesis-driven and interventional research follows later [6].
The differences between study designs in the two epidemics are compatible with differences in characteristics of the diseases. The higher proportion of basic research in Zika virus research may have several explanations. First, the occurrence of congenital abnormalities following a vector-borne infection was poorly understood; in vivo and in vitro studies were essential to investigate in utero transmission and mechanisms for neurotoxicity and neuropathology [22]. Second, the establishment of mouse models was more successful in Zika virus research than for SARS-CoV-2 research, [23] although efforts are ongoing [24]. Third, the later occurrence of case-control studies and cohort studies in Zika virus, might be caused by the delay to congenital outcomes, compared to the shorter delay in outcomes caused by SARS-CoV-2. Fourth, the prominent role of mathematical modelling studies during the beginning of the SARS-CoV-2 pandemic, probably reflects early recognition of the pandemic potential and the need for forecasts of the global spread. Mechanistic models describing the transmission SARS-CoV-2 are also less complex than for arboviruses like Zika virus, allowing many to explore transmission dynamics [10]. The higher volume of observational research about SARS-CoV-2 research could reflect both the 50-fold higher numbers of cases than for Zika virus and the severity of the pandemic, whereas Zika virus was largely limited to the Americas and cases of infection were already declining as the research volume started to increase. The increasing role of preprints during the SARS-CoV-2 coincided with developments in open access publishing and the need for speedy access to outbreak research [25]. The increase of preprint publishing results in faster access to evidence, which will benefit the public health response. The rapid pace of publication in both preprint and peer-reviewed publications mean that readers need to carefully appraise the methodological quality of the research.
Other researchers have studied the evolution of evidence during disease outbreaks as well. During the SARS outbreak in 2003, Xing et al. (2010) described epidemiological studies from Toronto and Hong Kong [26], whereas we included epidemiological and non-epidemiological articles all over the world. Xing and colleagues primarily studied the publication time delay during the outbreak and concluded that only a minority (7%) of the publications was published during the time outbreak, while we investigated the proportion of the preprints over peer-reviewed publications [26]. For the SARS-CoV-2 pandemic, Liu et al. (2020) performed a bibliometric analysis of the SARS-CoV-2 literature up to March 24, 2020 [27], classifying research by theme, rather than by study design. They observed that clinical features of the COVID-19 were studied heavily, whereas other research areas such as mental health, the use of novel technologies and artificial intelligence, and pathophysiology remained underexplored. In contrary to the manual annotation of our project, Tran et al. (2020) performed automatic Latent Dirichlet allocation topic modelling of publications on SARS-CoV-2, published up to April 23, 2020 [28], with findings similar to those of Liu. et al. [27]. While we validated classification of study design manually, Tran et al. did not describe a validation of their automated modelling method. Jones et al. (2020), who classified study designs using categories comparable to ours, showed a similar pattern of study design occurrence during the early SARS-CoV-2 pandemic, where case reports and narrative reviews were found to be most published [8]. However, they merely present absolute numbers and a comparison with other outbreaks is absent [8]. Similarly, Fidahic et al. (2020) concluded that early in the SARS-CoV-2 pandemic articles were predominantly retrospective case reports and modelling studies [9]. Haghani and Bliemer (2020) compared SARS, MERS and SARS-CoV-2 literature and showed that around 50% of studies were non-original, which is in line with our results [29]. Unlike our categorization method, Haghani and Bliemer used the categorization by the citation database ‘SCOPUS’ and conclude that the studies linked to public health response are first to emerge .
Our work has several implications for public health policy and research. The change over time in the types of studies has particular implications for synthesis of evidence and for public health as more research is published. Policy makers and public health practitioners need to keep up with rapid changes in the state of the evidence, because these changes affect recommendations for control measures. For example, preprint publication in December 2020, about the spread of new more infectious variants of SARS-CoV-2 in the United Kingdom and South Africa [30], provided the scientific evidence for strengthening control measures in several European Countries and changing vaccination policy in South Africa. The earliest studies published might not be the most appropriate to answer specific questions, for example, about causality [31], or to quantify disease characteristics. Triangulation of different sources of evidence using frameworks, such as those based on the Bradford Hill criteria [15], and careful interpretation through explicit acknowledgment of limitations can help. Living systematic reviews are particularly useful because changes to inclusion criteria can be planned and protocols can be amended in advance of an update. For example, quantifying the proportion of asymptomatic SARS-CoV-2 infections in March 2020 relied largely on descriptions from contact investigations in single families [32]. By June 2020, there were also studies at lower risk of selection and measurement biases, such as population screening .
The vast quantity of evidence about emerging infections poses challenges for the efficient handling and evaluation of information. The speed with which the evidence about SARS-CoV-2 has accumulated is unprecedented. We recruited a large team of experienced scientists, but we were still not able to categorise all publications by the time this manuscript is written. Machine learning methods, such as natural language processing, to classify text is a promising approach for the triage of publication types [18]. We also see potential in a scaled-up version of collaborative crowd-sourcing among experts in the field, to increase efficiency and avoid research waste [33]. The technical tools to manage such efforts are available, but guidelines on how to best conduct the live synthesis of evidence should be developed and evaluated further [34].