Sample size calculation for estimating key epidemiological parameters using serological data and mathematical modelling

Background Our work was motivated by the need to, given serum availability and/or financial resources, decide on which samples to test in a serum bank for different pathogens. Simulation-based sample size calculations were performed to determine the age-based sampling structures and optimal allocation of a given number of samples for testing across various age groups best suited to estimate key epidemiological parameters (e.g., seroprevalence or force of infection) with acceptable precision levels in a cross-sectional seroprevalence survey. Methods Statistical and mathematical models and three age-based sampling structures (survey-based structure, population-based structure, uniform structure) were used. Our calculations are based on Belgian serological survey data collected in 2001–2003 where testing was done, amongst others, for the presence of Immunoglobulin G antibodies against measles, mumps, and rubella, for which a national mass immunisation programme was introduced in 1985 in Belgium, and against varicella-zoster virus and parvovirus B19 for which the endemic equilibrium assumption is tenable in Belgium. Results The optimal age-based sampling structure to use in the sampling of a serological survey as well as the optimal allocation distribution varied depending on the epidemiological parameter of interest for a given infection and between infections. Conclusions When estimating epidemiological parameters with acceptable levels of precision within the context of a single cross-sectional serological survey, attention should be given to the age-based sampling structure. Simulation-based sample size calculations in combination with mathematical modelling can be utilised for choosing the optimal allocation of a given number of samples over various age groups. Electronic supplementary material The online version of this article (10.1186/s12874-019-0692-1) contains supplementary material, which is available to authorized users.


Age structures
. Comparison of the three age distributions.
"survey" (red line) refers to the age structure derived from the serological survey data (all individuals included, aged from 1 to 65 years old); "population 2003" (black line) refers to the age structure of the Belgian population in 2003; "uniform" (blue dashed line) refers to a uniform age structure based on 65 age groups (1-year interval).

MSIRW model with boosting and age-specific waning (MSIRWb-ext AW)
In Goeyvaerts et al. (2011), different compartmental models were considered to infer on basic immunological processes for parvovirus B19 (Goeyvaerts et al. 2011). In the MSIRWb-ext AW model, they considered waning of disease-acquired antibodies. Individuals move at a rate ( ) from a high immunity state R to a low immunity state W in which they are still protected from infection however categorized as being seronegative, that is, with antibody levels falling below the serostatus threshold (see Figure S2). In addition, they assumed that low immunity can be boosted by exposure to infectious individuals. The boosting rate and the force of infection are proportional with a proportionality constant , such that the rate at which individuals move back from W to R equals • ( ). By solving the corresponding set of differential equations, one finds that the fraction in state S equals: The estimation method assumes endemic and demographic equilibrium. The transmission rates ( , ′) are assumed to be directly proportional to age-specific rates of making social contact, ( , ′), with a disease-specific proportionality factor ("constant proportionality assumption"): ( , ′) = • ( , ′). The contact rates are estimated from the POLYMOD contact survey using a non-parametric model (Goeyvaerts et al. 2010). Since the above integral equation has no closed-form solution, we turn to discrete age classes to estimate the parameters. Through an iterative procedure, the Bernoulli log-likelihood for the serological data is maximized. The waning rate is modelled as a piecewise constant function with a cut-off point at predetermined age : ( ) = 1 , if ∈ ( , ), and ( ) = 2 , if ≥ . Comparing the overall likelihood for different values of (from 5 to 50 years in 5 years steps) led to the choice of = 35 years.

Calculations of the key epidemiological parameters
-Calculation of the overall prevalence: ̂= ∑̂× ∑ where ̂ is the estimated agespecific prevalence and the proportion of individuals aged a from the population distribution.
-Calculation of the overall force of infection: where ̂ is the estimated age-specific force of infection, ̂ is the estimated age-specific prevalence and the proportion of individuals aged a from the population distribution.

Precision values
In Tables S5-12, the following abbreviations are used:    Table S8. VZV serological data -Exponentially damped model     Figure S4. Parvovirus B19 serological data: mean, median, and 95% confidence interval for the relative boosting factor φ (left) and basic reproduction number R0 (right) over 500 simulations as a function of the total number of sampled individuals (N) for the MSIR model allowing for agespecific waning of disease-acquired antibodies and boosting of low immunity (MSIRWb-ext AW) model. "True" value is the value estimated using the model on the observed serological data (with integer age values). The y-axes have different ranges of values for better legibility.