Outbreaks of publications about emerging infectious diseases: the case of SARS-CoV-2 and Zika virus

Background Outbreaks of infectious diseases generate outbreaks of scientific evidence. In 2016 epidemics of Zika virus emerged, and in 2020, a novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused a pandemic of coronavirus disease 2019 (COVID-19). We compared patterns of scientific publications for the two infections to analyse the evolution of the evidence. Methods We annotated publications on Zika virus and SARS-CoV-2 that we collected using living evidence databases according to study design. We used descriptive statistics to categorise and compare study designs over time. Results We found 2286 publications about Zika virus in 2016 and 21,990 about SARS-CoV-2 up to 24 May 2020, of which we analysed a random sample of 5294 (24%). For both infections, there were more epidemiological than laboratory science studies. Amongst epidemiological studies for both infections, case reports, case series and cross-sectional studies emerged first, cohort and case-control studies were published later. Trials were the last to emerge. The number of preprints was much higher for SARS-CoV-2 than for Zika virus. Conclusions Similarities in the overall pattern of publications might be generalizable, whereas differences are compatible with differences in the characteristics of a disease. Understanding how evidence accumulates during disease outbreaks helps us understand which types of public health questions we can answer and when. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01244-7.

respiratory route [4]. There are marked differences in the natural history of the two diseases, where microcephaly caused by Zika virus infection only emerges months after infection, COVID-19 occurs acutely. Intensive research efforts for both infections were catalysed by the needs of national, regional and global health agencies to answer key questions on transmission, prevention, and interventions at the individual and community level [5]. During the SARS-CoV-2 pandemic, the accumulation of peer-reviewed and preprint publications has been vast; from April, 2020 onwards, an average of 2000 scientific publications appeared per week. A similar, albeit smaller surge in publications occurred in 2016 during the Zika virus epidemic. The sudden large increases in publications about these conditions over a short time can also be described as outbreaks.
The emergence of a new disease provides an opportunity to examine how research evidence emerges and develops, according to the research question and the feasibility of the study methods. Hierarchies of evidence are often used to rank the value of epidemiological study designs, prioritising experimental methods, but these do not take account of purposes, other than the effects of interventions. Anecdotal observations allow for the discovery and description of phenomena, studies with comparison groups are more appropriate to test hypotheses, and randomised trials test the causal effects of interventions [6]. Early on in the Zika epidemic, questions about causality were important because the link between clusters of babies born with microcephaly and Zika virus infection was not obvious; congenital abnormalities caused by a mosquito-borne virus had never been reported. In an earlier analysis of 346 publications about Zika virus alone, we described the temporal sequence of publication of types of study to investigate causality [7]. Others have assessed the accumulation of study designs over time during the SARS-CoV-2 outbreak and concluded that early in the outbreak, simple observational studies, mathematical modelling studies and narrative reviews were most abundant [8,9]. Here, we proposed a hypothetical sequence: first, anecdotal observations are reported in case reports or case series. Analytical observational studies follow. In parallel, basic research studies investigate the biology and pathogenesis of the disease. Mathematical modelling can provide evidence where direct observations are not available [10]. After a delay, controlled trials examining interventions are published.
The emergence of SARS-CoV-2 allows a comparison with Zika virus between the timing and types of evidence published at the start of an outbreak of a new disease. The objectives of this study were to analyse the patterns of evolution of the evidence over time during the 2016 Zika virus epidemic and the 2020 SARS-CoV-2 pandemic. We compare the sequence of evidence accumulation with the previously hypothesised pattern [7].

Data collection Searches and sources
We used databases that were created for the Zika Open Access Project (ZOAP) [11] and COVID-19 Open Access Project (COAP) [12]. Both databases are maintained by the authors and are used to conduct living systematic reviews [13,14]. For each pathogen, we ran daily automated searches to index and deduplicate records of articles about Zika virus (from January 1, 2016) and SARS-CoV-2 research (from January 1, 2020) in EMBASE via OVID, MEDLINE via PubMed, and the preprint server bioRxiv (for SARS-CoV-2 we also searched medRxiv). These data have been collected and deduplicated daily for several living systematic reviews and detailed methodology is described elsewhere [11, [13][14][15][16]. We specify the search terms in the Additional file 1 Text 1.

Annotation of records with study design
We screened the title and abstract, or full text when the first was insufficient, and annotated each record with its study design. For weeks where the volume SARS-CoV-2 of research was over 400 publications, starting mid-March, we drew a random sample of 400 publications with the R 'sample' function, without replacement. The number of selected publications was a pragmatic decision that balanced an adequate sample size and manageable workload for the number of crowd-volunteers. The annotation of the Zika virus dataset was performed for previous systematic reviews (from January 1, 2016 to December 31, 2016) [13,15]. We first classified publications into the broad groups "epidemiology" or "basic research", "non-original" articles (editorials, viewpoints, and commentaries) and "other". These are groups that we used in an earlier study about Zika virus [7], so for this comparative study, we applied them to the publications about SARS-CoV-2. We subdivided epidemiological and basic research further, based on their study design. We provide details on the classification of the study designs in the Additional file 1 Table S1 and in an online annotation guide [17].

Crowd
To distribute the annotation workload, we recruited a 'crowd' of volunteer scientists [18]. We included researchers with a background in medicine or public health, who qualified by passing a pilot test using an online tool that simulates classification tasks. A demonstration and the source code of the tool are provided online [19].
The crowd members used another online tool for screening, annotation, and verification of each record. A first crowd member screened and annotated a record, and a second crowd member verified the annotated data. Disagreements were resolved by a third member of the team. One person (MJC) distributed tasks centrally and a 'crowd supervisor' (AMI) monitored progress. Crowd members took part in the interpretation of the results.

Reported number of cases
To compare the number of publications against the number of reported cases, we used open-source data on Zika virus and SARS-CoV-2 from https://github.com/ andersen-lab/zika-epidemiology/tree/master/paho_case_ numbers and https://ourworldindata.org/covid-cases., see Availability of data and materials.

Date that a publication becomes available
We defined the date at which a publication became available as the date it was indexed in the MEDLINE or EMBASE database, or when it appeared on the preprint server.

Data analysis
First, we described the evolution of reported cases and publications over time. Second, we described the proportions of study designs, by week, for SARS-CoV-2 and by month for Zika virus, due to the differences in research volume. We omitted the first two weeks of 2020 for SARS-CoV-2 because there were only four publications, making the proportions unstable. To take into account the random sampling of the SARS-CoV-2 research, we provided the Wilson score 95% confidence intervals (CI) for the proportions. Third, we quantified the timing and speed of the accumulation of publications of different study designs: We plotted the time elapsed between the first and twentieth occurrence of publications of each study design. Last, we described the proportion of evidence that was published on preprint servers during the two epidemics, and by study design. All analyses were conducted in R 4.0.1 [20].

Results
Between week one and week 21 (up to May 24) 2020, we indexed 21,990 publications, and a crowd of 25 contributors annotated a sample of 5294 (24%) publications on SARS-CoV-2. For the Zika virus research, we annotated all 2286 identified publications for 2016. Both the volume of the weekly reported cases and number of publications were 30-50-fold higher for SARS-CoV-2 than for Zika virus (Fig. 1).

The proportion of different study designs
In both epidemics, a substantial and reasonably stable proportion of the publications were non-original research. The overall proportion of non-original publications was higher for Zika virus (55%, (Additional file 1 Table 2)) than for SARS-CoV-2 (34% [95% CI: 33-35], (Additional file 1 Table 3)). For publications of original research, the proportion of basic research publications increased over time for Zika virus, but decreased for SARS-Cov-2 research (Fig. 2a).
Within the epidemiological study designs, mathematical modelling studies had a larger role at the beginning of the SARS-CoV-2 pandemic (10.1%, [95% CI: 9.3-11.0]) and compared to the Zika virus outbreak (3.2%). Many of these were published as preprint publications. When we excluded preprint publications, the evolution of evidence over time became more similar between the two epidemics (Additional file 1 Fig. 2). Case reports and case series accounted for approximately 10% of the total body of evidence; 10.7% [95% CI: 9.9-11.6] for SARS-CoV-2 and 9.7% for Zika virus research. Analytical epidemiological study designs became more prevalent  (Fig. 2b).

Accumulation of epidemiological and basic research
Despite the difference in volume, the accumulation of study designs over time for SARS-CoV-2 and Zika virus research show some similarities (Fig. 3). Case reports, case series and cross-sectional studies were the first epidemiological study designs to be reported, together with non-original articles and reviews. Case-control and cohort studies followed later; this delay was more prominent in the Zika virus research. Phylogenetic studies and mathematical modelling studies had a more prominent role early on during the SARS-CoV-2 pandemic than in the Zika virus epidemic. In vivo and in vitro laboratory studies followed between case reports and controlled observational studies. Trials were the last type of study to be published. This pattern of accumulation did not change when we did the same comparison but without preprint publications (Additional file 1 Fig. S3).

The role of preprint publications
The role of preprint publications was more prominent at the start of the SARS-CoV-2 pandemic than the Zika virus epidemic. In January and February 2020, the majority of publications on SARS-CoV-2 were manuscripts on preprint servers (Fig. 4a). Basic research reviews were seldom published on preprint servers, whereas 77% of the mathematical modelling studies were initially made available on preprint servers (Fig. 4b). The proportion of modelling and sequencing studies that were published as preprints was high throughout the first 21 weeks of 2020, whereas other designs reduced over time (Additional file 1 Fig. S4). The proportion of preprints decreased over time.

Discussion
The overall distribution of publications at the start of the SARS-CoV-2 and Zika virus epidemics was similar. Epidemiological research was more commonly published than laboratory research and non-original contributions accounted for a substantial fraction of all publications for both infections. For both infections, case reports and case series, mathematical modelling and phylogenetic studies were prominent at the start of the epidemic, whereas analytical study designs, such as cohort and case-control studies, appeared later. Trials emerged later and accounted for a small proportion of all studies. The volume and speed of evolution were much higher for SARS-CoV-2 than for Zika virus. Modelling studies were more prominent and basic research studies were less common for SARS-CoV-2 than for Zika virus. More studies were published as preprints for SARS-CoV-2, but this proportion declined over time.

Strengths and limitations
Strengths of this study include the comparable and reproducible search strategies for two emerging infectious diseases and categorisation of study design by a volunteer crowd of epidemiologist reviewers. A limitation is that the design of an epidemiological study is not always clear, and different scientists might classify the same study differently. We tried to tackle this limitation by screening and training of the volunteer scientists, verification of decisions and having a third person resolving disagreements [12]. There are other limitations. First, we only recorded the study design of publications and did not assess the content or its methodological quality. To trace the evolution of evidence for specific research questions, in-depth studies are needed. Second, for SARS-CoV-2, the volume of publications meant that we only annotated a sample of records. The total in the first 5 months of the pandemic was, however, higher than for 1 year of publications about Zika virus and the proportions of different study designs for Zika virus stabilised quickly. Third, the searches do not include all sources of peer-reviewed evidence or preprint sources. Incompleteness of the evidence base should not affect our conclusions as long as other sources account for a stable proportion of publications.
We followed two dimensions of the publication of evidence about two newly emerging infectious diseases; the overall distribution of publication types and changes over time. Similarities in the overall distribution of epidemiological, basic science and non-original publications for SARS-CoV-2 and Zika virus could reflect patterns of the overall trajectory of research about emerging infectious diseases. In the initial phase of an outbreak with a novel pathogen, case reports and case series predominate. These types of study describe and refine the clinical characteristics of the disease [21]. Observations from these studies are commonly used to define research questions and formulate hypotheses about various aspects of transmission and disease. More formal, hypothesis-driven and interventional research follows later [6].
The differences between study designs in the two epidemics are compatible with differences in characteristics of the diseases. The higher proportion of basic research in Zika virus research may have several explanations. First, the occurrence of congenital abnormalities following a vector-borne infection was poorly understood; in vivo and in vitro studies were essential to investigate in utero transmission and mechanisms for neurotoxicity and neuropathology [22]. Second, the establishment of mouse models was more successful in Zika virus research than for SARS-CoV-2 research, [23] although efforts are ongoing [24]. Third, the later occurrence of case-control studies and cohort studies in Zika virus, might be caused by the delay to congenital outcomes, compared to the shorter delay in outcomes caused by SARS-CoV-2. Fourth, the prominent role of mathematical modelling studies during the beginning of the SARS-CoV-2 pandemic, probably reflects early recognition of the pandemic potential and the need for forecasts of the global spread. Mechanistic models describing the transmission SARS-CoV-2 are also less complex than for arboviruses like Zika virus, allowing many to explore transmission dynamics [10]. The higher volume of observational research about SARS-CoV-2 research could reflect both the 50-fold higher numbers of cases than for Zika virus and the severity of the pandemic, whereas Zika virus was largely limited to the Americas and cases of infection were already declining as the research volume started to increase. The increasing role of preprints during the SARS-CoV-2 coincided with developments in open access publishing and the need for speedy access to outbreak research [25]. The increase of preprint publishing results in faster access to evidence, which will benefit the public health response. The rapid pace of publication in both preprint and peer-reviewed publications mean that readers need to carefully appraise the methodological quality of the research.
Other researchers have studied the evolution of evidence during disease outbreaks as well. During the SARS outbreak in 2003, Xing et al. (2010) described epidemiological studies from Toronto and Hong Kong [26], whereas we included epidemiological and nonepidemiological articles all over the world. Xing and colleagues primarily studied the publication time delay during the outbreak and concluded that only a minority (7%) of the publications was published during the time outbreak, while we investigated the proportion of the preprints over peer-reviewed publications [26]. For the SARS-CoV-2 pandemic, Liu et al. (2020) performed a bibliometric analysis of the SARS-CoV-2 literature up to March 24, 2020 [27], classifying research by theme, rather than by study design. They observed that clinical features of the COVID-19 were studied heavily, whereas other research areas such as mental health, the use of novel technologies and artificial intelligence, and pathophysiology remained underexplored. In contrary to the manual annotation of our project, Tran et al. (2020) performed automatic Latent Dirichlet allocation topic modelling of publications on SARS-CoV-2, published up to April 23, 2020 [28], with findings similar to those of Liu. et al. [27]. While we validated classification of study design manually, Tran et al. did not describe a validation of their automated modelling method. Jones et al. (2020), who classified study designs using categories comparable to ours, showed a similar pattern of study design occurrence during the early SARS-CoV-2 pandemic, where case reports and narrative reviews were found to be most published [8]. However, they merely present absolute numbers and a comparison with other outbreaks is absent [8]. Similarly, Fidahic et al. (2020) concluded that early in the SARS-CoV-2 pandemic articles were predominantly retrospective case reports and modelling studies [9]. Haghani and Bliemer (2020) compared SARS, MERS and SARS-CoV-2 literature and showed that around 50% of studies were non-original, which is in line with our results [29]. Unlike our categorization method, Haghani and Bliemer used the categorization by the citation database 'SCOPUS' and conclude that the studies linked to public health response are first to emerge .
Our work has several implications for public health policy and research. The change over time in the types of studies has particular implications for synthesis of evidence and for public health as more research is published. Policy makers and public health practitioners need to keep up with rapid changes in the state of the evidence, because these changes affect recommendations for control measures. For example, preprint publication in December 2020, about the spread of new more infectious variants of SARS-CoV-2 in the United Kingdom and South Africa [30], provided the scientific evidence for strengthening control measures in several European Countries and changing vaccination policy in South Africa. The earliest studies published might not be the most appropriate to answer specific questions, for example, about causality [31], or to quantify disease characteristics. Triangulation of different sources of evidence using frameworks, such as those based on the Bradford Hill criteria [15], and careful interpretation through explicit acknowledgment of limitations can help. Living systematic reviews are particularly useful because changes to inclusion criteria can be planned and protocols can be amended in advance of an update. For example, quantifying the proportion of asymptomatic SARS-CoV-2 infections in March 2020 relied largely on descriptions from contact investigations in single families [32]. By June 2020, there were also studies at lower risk of selection and measurement biases, such as population screening .
The vast quantity of evidence about emerging infections poses challenges for the efficient handling and evaluation of information. The speed with which the evidence about SARS-CoV-2 has accumulated is unprecedented. We recruited a large team of experienced scientists, but we were still not able to categorise all publications by the time this manuscript is written. Machine learning methods, such as natural language processing, to classify text is a promising approach for the triage of publication types [18]. We also see potential in a scaled-up version of collaborative crowd-sourcing among experts in the field, to increase efficiency and avoid research waste [33]. The technical tools to manage such efforts are available, but guidelines on how to best conduct the live synthesis of evidence should be developed and evaluated further [34].

Conclusions
The findings of this study show how description of the types and timing of publications during outbreaks of emerging and re-emerging diseases can help us understand which types of public health questions we can answer and when. Further analyses of the generation and accumulations of research evidence during disease outbreaks could help to improve the public health response.  Figure S4. The proportion of preprint publications compared to and peer-reviewed publications for SARS-CoV-2 by study design and week.