Gaussian process emulation to improve efficiency of computationally intensive multidisease models: a practical tutorial with adaptable R code

Sawe, Sharon Jepkorir; Mugo, Richard; Wilson-Barthes, Marta; Osetinsky, Brianna; Chrysanthopoulou, Stavroula A.; Yego, Faith; Mwangi, Ann; Galárraga, Omar

doi:10.1186/s12874-024-02149-x

Research
Open access
Published: 27 January 2024

Gaussian process emulation to improve efficiency of computationally intensive multidisease models: a practical tutorial with adaptable R code

BMC Medical Research Methodology volume 24, Article number: 26 (2024) Cite this article

870 Accesses
2 Altmetric
Metrics details

Abstract

Background

The rapidly growing burden of non-communicable diseases (NCDs) among people living with HIV in sub-Saharan Africa (SSA) has expanded the number of multidisease models predicting future care needs and health system priorities. Usefulness of these models depends on their ability to replicate real-life data and be readily understood and applied by public health decision-makers; yet existing simulation models of HIV comorbidities are computationally expensive and require large numbers of parameters and long run times, which hinders their utility in resource-constrained settings.

Methods

We present a novel, user-friendly emulator that can efficiently approximate complex simulators of long-term HIV and NCD outcomes in Africa. We describe how to implement the emulator via a tutorial based on publicly available data from Kenya. Emulator parameters relating to incidence and prevalence of HIV, hypertension and depression were derived from our own agent-based simulation model and other published literature. Gaussian processes were used to fit the emulator to simulator estimates, assuming presence of noise for design points. Bayesian posterior predictive checks and leave-one-out cross validation confirmed the emulator’s descriptive accuracy.

Results

In this example, our emulator resulted in a 13-fold (95% Confidence Interval (CI): 8–22) improvement in computing time compared to that of more complex chronic disease simulation models. One emulator run took 3.00 seconds (95% CI: 1.65–5.28) on a 64-bit operating system laptop with 8.00 gigabytes (GB) of Random Access Memory (RAM), compared to > 11 hours for 1000 simulator runs on a high-performance computing cluster with 1500 GBs of RAM. Pareto k estimates were < 0.70 for all emulations, which demonstrates sufficient predictive accuracy of the emulator.

Conclusions

The emulator presented in this tutorial offers a practical and flexible modelling tool that can help inform health policy-making in countries with a generalized HIV epidemic and growing NCD burden. Future emulator applications could be used to forecast the changing burden of HIV, hypertension and depression over an extended (> 10 year) period, estimate longer-term prevalence of other co-occurring conditions (e.g., postpartum depression among women living with HIV), and project the impact of nationally-prioritized interventions such as national health insurance schemes and differentiated care models.

Peer Review reports

Background

The need for emulation

In situations where empirical data on disease impact(s) are not universally available or a randomized controlled trial may not be feasible, complex mathematical computer models (referred to as “simulators”) [1,2,3] can characterize disease prevalence, forecast incidence, and help identify cost-effective approaches for meeting short- and long-term health system needs [4,5,6]. Though simulation models play a critical role in helping public health decision-makers synthesize data from multiple sources and compare anticipated outcomes over time, the limitations of these models are not trivial. Simulation models based on central processing units (CPUs), such as grid computing and computing clusters, [7] require significant infrastructure that can incur high costs for hardware and oversight. Even with access to high-computing infrastructure, long run times of several hours for a single simulation and large numbers of input and output parameters can greatly inhibit fitting such models, [8] and in turn restrict analysts to considering only a subset of all possible simulated scenarios [9, 10]. Furthermore, because of their complexity, components of microsimulation models can still be perceived as a black box [11, 12] because their functions and behaviors are often not exhaustively described or immediately accessible at the time of publication, all of which makes it difficult for external users to interpret and adapt model processes for their local context.

Emulators are one tool that can help mitigate these limitations [9]. An emulator, also known as a metal-model, [13] is an approximation of one or more complex mathematical model(s) that is constructed using a training sample of simulator runs [14] and computationally more efficient. Emulators reduce costs by negating the need for super-computing infrastructure and, once developed, can substantially shorten the amount of time needed to implement model runs and interpret results.

In the last decade there has been an accelerated demand for integrated responses [15,16,17,18,19,20] to the growing burden of non-communicable diseases (NCDs) – including cardiovascular disease, cancers, diabetes, and mental illness – among people living with HIV (PLWH) in low and middle income countries (LMICs) [21,22,23,24]. In response to this call, the authors of this paper recently extended an established agent-based model of HIV transmission and treatment impact to include hypertension in two rural settings in Sub-Saharan Africa [24]. The authors’ simulation model was able to generate robust estimates of changing risks across age groups and predict growing population burdens of HIV and hypertension as comorbidities; however, the simulations were resource and time intensive and it is unlikely that novice modelers would be able to adapt the model’s components without input from an expert biostatistician. Other simulators of HIV and non-communicable diseases have met the same challenges [25]. To facilitate a greater understanding and usability of these complex models, we therefore share our experience developing an open-source emulator that approximates estimates from two simulators over an input subdomain of parameters related to HIV, hypertension and depression in a Sub-Saharan African (SSA) country with a generalized HIV epidemic. The tutorial presented in this paper (i) describes the steps involved in emulator development and validation, (ii) illustrates how to interpret the emulator’s predictive accuracy and outputs in relation to those from each simulator using case study data from Kenya, and (iii) provides annotated, open-source and adaptable R code to facilitate the use of the discussed methods in practice. We expect researchers with a basic understanding of Bayesian statistics and some familiarity with R software to be able to implement this protocol independently.

The emulation method described in this paper relies on well-established and validated Gaussian processes [14, 25, 26]. To the best of our knowledge, Gaussian-process emulation has not yet been used to mimic simulation models that predict the burden of HIV-comorbidities over time, nor has it been described in sufficient detail to enable use by non-biostatisticians. Thus, this tutorial is scientifically significant in that is uses a didactic approach to demystify Gaussian process emulation methods, and offers a new tool that can potentially improve public health decision-making with less resources.

Methods: emulator development and validation

Simulator description and source data

Evidence overwhelming indicates that the disproportionate burden of non-communicable diseases among people living with HIV – compared to individuals not living with HIV – will increase rapidly in the coming decades, [27, 28] and that most health systems in Sub-Saharan Africa are not currently equipped to treat the more than 15 million patients who require integrated HIV and NCD care [19]. In this tutorial we focus on two published simulation models that have estimated the future burden of HIV and non-communicable diseases in Kenya, as well as the costs and epidemiological impact of strengthening integrated care systems in the country. The two simulators were selected as examples for this tutorial because of their longer-term (i.e., > = 10 year) forecast periods and available details surrounding their design points. The two simulators were also selected because the rise of non-communicable diseases among persons living with HIV in Kenya is indicative and representative of the rise of NCDs on the continent, which are estimated to overtake infectious diseases worldwide by 2030 [19, 27, 28].

The first model is the authors’ Integrated Modeling of Epidemiologic and Economic Long-term Outcomes in Africa (inMODELA) microsimulation model, [29] which simulates HIV and hypertension in Kenya and South Africa from 2018 to 2028. The model is an extension of the Sexually Transmitted Diseases Simulator (STDSIM), [30, 31] a stochastic agent-based model that simulates transmission of HIV and other sexually transmitted diseases (STDs) through dynamic sexual networks. inMODELA was partly calibrated using population surveillance data of hypertension and HIV from western Kenya. National-level data on the hypertension prevalence were extracted from the 2015 Kenya STEPwise Approach to NCD Risk Factor Surveillance (STEPS) survey [32] while HIV modelling was calibrated using reports from 2007 and 2012 Kenya AIDS Indicator Surveys (KAIS) and the 2016 Kenya County HIV profiles, [33, 34] the most recent national data available at the time of simulator development. HIV was modelled as having four stages: early infection, asymptomatic, symptomatic and AIDS, and treatment with antiretroviral therapy was operationalized as individual ART demand and health system capacity to meet ART demand. Hypertension was modelled as being normotensive or hypertensive (i.e., blood pressure > 140/90 mmHg), accounting for the potential effects of age, gender, and economic development on hypertension risk. Key outputs of the inMODELA model include total annual mortality, incidence, and prevalence, as well as the health system burden of hypertension, HIV, and comorbid HIV and hypertension. Additional details of the inMODELA simulator are available in a separate publication [29].

The second model in this tutorial is an individual-based simulator initially developed by Smit et al. for Zimbabwe [35] and adapted for Kenya [36]. The model estimates current and future births, deaths, HIV disease and treatment, as well as prevalence/incidence of cardiovascular disease, chronic kidney disease, depression, diabetes, hypertension, and other NCDs and cancers among adults for the period of 2018 through 2035. Simulator calibration relied on data from the Joint United Nations Programme on HIV/AIDS, [37] 2016 Global Burden of Disease estimates [38] and other sources [36].

Relevant input parameters used to calibrate the two simulation models are summarized in Supplementary Table 1.

Overview of the emulation process

Figure 1 summarizes the steps that were used to develop and validate an emulator to approximate epidemiological outputs projected over 10 years by the multidisease simulators. To develop the emulator, we first abstract relevant parameters from the more complex simulation models to serve as the emulator’s design points. In this example, parameters related to the prevalence of HIV, hypertension, comorbid HIV and hypertension, and depression among PLWH were selected. Prevalence parameters were ascertained for 2018 and for 2028. Second, we use Gaussian processes (GP) to approximate the mean and variance of each simulator’s outputs, assuming presence of noise for our design points. Third, we use Bayesian posterior predictive analysis to analyse the credibility of future emulator predictions based on the posterior distribution. Lastly, we use leave-one-out cross validation to confirm the emulator’s predictive accuracy and compare emulator estimates to simulation results.

Step 1. Install the program

This tutorial uses the GauPro package [39], the rstanarm package [40] and the loo package [41, 42] in R version 4.1.2 for emulator construction and application. To replicate results from this tutorial, or to adapt this emulator to new simulation data, the latest version of the free R software environment needs to be installed, and can be downloaded from https://www.r-project.org/.

Step 2. Select key design points from simulation model(s)

Not all parameters of a complex simulation model will be needed for emulation. Only those most informative for your research question should be ascertained. In this example using simulator data from Kenya, parameters relating to the prevalence of HIV, hypertension, and comorbid HIV and hypertension were ascertained from the inMODELA simulation model [29] for the period of 2018 through 2028. Parameters relating to the prevalence of depression among people living with HIV were ascertained from the Smit et al. model [36] for the period of 2018 through 2030. (Table 1) To reconcile the different forecasting periods used by the two simulation models, we assumed the prevalence of depression in 2028 to be the average prevalence for years 2025 and 2030. Supplementary Table 2 specifies the design points used to emulate annual HIV and NCD prevalence for the period of 2018 through 2028.

Table 1 Summary of final design points used for emulation, based on example national data from Kenya

Full size table

Update the emulator’s R code to reflect relevant design points and corresponding emulation time period(s) abstracted from simulator data.

Step 3. Fit Gaussian processes to simulation data

A Gaussian process refers to a collection of any finite number of random variables which have a Gaussian (normal) distribution [25]. A GP emulator uses a statistical model to fit a Gaussian process to a dataset, and is defined by (1) a mean function describing the mean at any point of the input space and (2) a covariance function describing the covariance between points [25, 39]. When emulating a stochastic simulator, the unknown function is assumed to be the expectation of the i^th simulator output denoted as f_i(x).

We construct our GP emulator such that, for each simulator output f_i(x), we select active variables (x^A) and then emulate using the following process:

$${f}_i(x)=\sum\nolimits_{i=1}^q{\beta}_i{g}_i\left({x}^A\right)+{\mu}_i\left({x}^A\right)+{\delta}_i(x).$$

(1)

The first part of the emulator, $\sum_{i=1}^q{\beta}_i{g}_i\left({x}^A\right),$ is a polynomial of active inputs x^A running from i = 1, …q chosen from the simulator outputs, β_i are the regression coefficients, and g_i(˙) are the deterministic functions of x^A which are known [43]. The second part of the emulator, μ_i(x^A), represents a collection of any finite number of random variables which have a Gaussian distribution [43]. Supposing that μ_i(x^A) is a Gaussian process with zero mean and known variance, we can define it as:

$$\mu \left({x}^A\right)\sim GP\left(0,c\left({x}^A,{x^A}^{\prime}\right)\right)$$

(2)

where c(x^A, x^A′) is a covariance function that determines the relationship between μ(x^A) and μ(x^A)^′ based on the distance between x^A and x^A′. The Gaussian process used to develop the current emulator has a covariance structure given by:

$$Cov\left({\mu}_i\Big({x}_1^A\right),{\mu}_i\left({x}_2^A\right)\Big)={\sigma}_i^2\exp \left[\frac{-{\left|{x}_1^A-{x}_2^A\right|}^2}{\theta_i^2}\right].$$

(3)

The parameter σ in eq. 3 can be varied to obtain the desired amount of waves in the emulator, whereby a smaller value of σ results in less extreme waves [44]. θ > 0 are unknown correlation length parameters where large values of θ indicate a smooth output function of the i^th input and small values suggest high non-linearity [45]. The last part of the emulator, δ_i(x), models the effects of inactive variables as random noise.

For each output of interest, the emulator provides the expectation E [f_i(x)] and variance var(f_i(x)) at x for every output given by i = 1, 2, …n where x denotes a vector of the emulator inputs. We evaluate f_i(x) as the prevalence value(s) projected from 2018 to 2028 for HIV, hypertension, comorbid HIV and hypertension, and depression. The results are gathered into a vector D: ${D}_i=\Big(f{\left({x}_1^A,\dots, f\left({x}_q^A\right)\right)}^T$ where q represents the number of design points. The emulator is then plotted as an adjusted expectation and variance function of f_j(x): E_D(f_j(x)) and Var_D(f_j(x)).

Step 4. Run graphical posterior predictive checks of emulator fit

In this step, we use graphical displays to check that the disease burden trends produced by our emulator look similar to the simulator data we observed. Our emulator uses Bayesian posterior predictive analysis to assess the credibility of future observable data based on the posterior distribution [46]. We analyze the Bayesian posterior distributions graphically to check predictive accuracy of the posterior distributions, plotting simulated y values from the posterior distribution against the actual values of y. The posterior predictive distribution is defined as:

$$p\left({y}^{rep}|y\right)=\int p\left({y}^{rep}|\theta \right)p\left(y|\theta \right)p\left(\theta \right) d\theta$$

(4)

where y^rep represents future data that could be drawn from the posterior distribution, y is the current simulator data, θ is the model parameter, and p(y| θ) is the sampling density for future data conditional on the parameter [46]. The size of the posterior sample was based on 50 draws which showed to be sufficient enough to generate the y^rep matrix from the posterior predictive distribution [40].

Step 5. Validate the emulator’s accuracy

We use the Bayesian leave-one-out cross validation (LOOCV) technique to confirm the emulator’s predictive accuracy. The LOOCV method applies the log-likelihood evaluation of posterior parameter values, [42] and is appropriate for smaller data sets and when similar distributions exist in the training and testing data [47]. LOOCV assesses the predictive ability of posterior simulations in which the data is iteratively partitioned into either calibration (training) sets or validation (test) sets. LOOCV is one of the most accurate ways to estimate how well a model will perform on unseen, “out-of-sample” data [42]. The calibration set is used to train the model and produce output values that are compared with the test set [48]. We validated the emulator’s accuracy using the loo package in R whereby the computation uses Pareto-smoothed importance sampling (PSIS) to regularize importance weights [42]. Following prior work [42, 49], a Pareto k estimate less than or equal to a 0.70 threshold was used to confirm reliability of the emulator’s performance. The LOOCV approach is specified in eqs. 5–7.

Given that data y₁, …y_n are independently modelled given the parameters θ then, $p\left(y|\theta \right)={\prod}_i^np\left({y}_i|\theta \right)$. Suppose a prior distribution p(θ) gives a posterior distribution p(θ| y) and a posterior predictive distribution p(y˜| y) = ∫ p(y˜_i|θ)p(θ| y)dθ. A predictive accuracy measure for n data points termed as expected log pointwise predictive density (elpd) for a new dataset is given as:

$$\sum\nolimits_1^n\int {p}_t\left(y{\sim}_i\right)\mathit{\log}\ p\left(y{\sim}_i|y\right) dy{\sim}_i$$

(5)

where p_t(y˜_i) is the distribution of the real data generation for y˜_i. The values of p_t(y˜_i) are unknown and therefore cross-validation is used to approximate (5). The Bayesian leave-one-out (LOO) estimate of the predictive fit is:

$${elpd}_{loo}=\sum\nolimits_{i=1}^n\log p\left({y}_i|{y}_{-1}\right)$$

(6)

where

$$p\left({y}_i|{y}_{-1}\right)=\int p\left({y}_i|\theta \right)p\left(\theta |{y}_{-1}\right) d\theta$$

(7)

is the leave-one-out predictive density given the data without the i^th data point.

To train the emulator, we followed the Emulation and History Matching Handbook for R [50] whereby we chose a number of training points equal to ten times the number of parameters. Given that we were estimating only one parameter (prevalence) for each of the four outcomes of interest, this resulted in 10 training points.

Results: emulator interpretation

Using outputs from the two simulation models of HIV, hypertension and depression burden, our emulator shows to be as accurate and computationally more efficient at predicting prevalence of these comorbidities in Kenyan populations over 10-years. On average, the inMODELA simulator took 11 hours and 53 minutes to perform 1000 runs on a high-performance computing cluster (HPC) and used 208 central processing units (CPU) cores, 2 graphics processing units (GPUs), and 1500 gigabytes (GBs) of random-access memory (RAM). The model by Smit et al. was coded in C++, ran using Xcode (Mac), and took a little over 3 minutes per iteration on an 8-core machine. Although run times were not overwhelmingly long, the Smit et al. simulation model faced computational constraints due to the large size of model outputs; this necessitated an additional post-processing step of aggregating and summarizing each batch of 10 model iterations in MatLab. By comparison, the present emulator was developed and validated on a Hewlett-Packard Intel Core i5 laptop with a 64-bit operating system, 2.50 CPU of 2.50 GHz and 8.00 GBs of RAM, and took only a few seconds for a single run using the same statistical program for development, validation and output interpretation.

Figure 2 shows the emulator’s projected mean prevalence over time for the example data in terms of: (A) HIV, (B) hypertension, (C) comorbid HIV & hypertension, and (D) depression among adults living with HIV. In each sub-figure of Fig. 2, black points should be interpreted as the design points (i.e., simulator outputs) used to fit the emulator, dashed red lines show the mean prevalence plotted as a function of the emulator’s predictions, solid blue lines represent the 95% Confidence Intervals (CI) for the emulator’s mean estimates, and the yellow lines are the 95% Confidence Intervals for the simulators’ mean estimates (Confidence Intervals for the annual mean prevalence of depression among PLWH were not publically available). We can then visually understand the emulated trends in disease burden to closely reproduce the trends projected by each simulator. For example, the emulator predicts that the mean prevalence of HIV will be 0.02776 [95% CI: 0.02774–0.02778] in the year 2027 compared to 0.02750 [95% CI: 0.02379–0.03121] predicted by the inMODELA simulator. The pointwise confidence intervals used for statistical uncertainty quantification are wider between each emulation year than at each annual point estimate (e.g., at the midpoint between 2027 and 2028, the mean prevalence of HIV is predicted to be 0.02573 (95% CI: 0.02219, 0.02928). The width of the credibility interval is largest between points and approaches zero near each point estimate due to the Gaussian process regression towards the mean. When considering the prevalence(s) of HIV, hypertension, and comorbid HIV & hypertension in this example, the emulator’s 95% Confidence Intervals for each mean estimate are very similar to those projected by the inMODELA simulator. Given the linearity and lack of noise in the original data, the emulator’s uncertainty ranges are within those of the inMODELA estimates. This indicates that our emulator accurately approximates outputs from the more complex model.

Figure 3 shows the Gaussian posterior predictive distributions for the example Kenyan data for HIV, hypertension, comorbid HIV & hypertension, and depression. The solid black lines can be understood as the distribution of y; the dashed red lines represent the y posterior predictive distribution, and the dotted grey lines represent the distribution of the simulations performed. We observe from panel A, B, C and D in Fig. 3 that the posterior predictive distributions denoted by post-pred y are not far off from the current fitted data denoted by y. Density values differ for each panel due to different prevalence values for the respective disease(s) on the x-axis, with greater imprecision for depression estimates (panel D) because of fewer available simulation data at each projection year.

For each output of interest, the LOOCV yielded Pareto k diagnostic values of < 0.70, indicating practical convergence rates and reliable Monte Carlo error estimates. Validating diagnostics for each of the emulator’s output are provided in Supplementary Figures 1–4.

Supplementary Figure 5a shows the emulator’s projected mean prevalence of HIV when the simulation model assumes that Kenya’s ART coverage targets are achieved (i.e., 90% of people with HIV are aware of their status, and, of those, 90% are enrolled in HIV care^{Footnote 1}) due to the implementation of effective interventions along the care continuum. Supplementary Figure 5b shows the emulator’s projected mean prevalence of hypertension when Kenya’s hypertension treatment targets are reached (i.e., when 50% of persons with confirmed hypertension are enrolled in care and receiving drug therapy and counselling). Supplementary Figures 5c and 5d show the projected disease prevalences when one or both of these treatment targets are achieved. Supplementary Figure 6 shows the Gaussian posterior predictive distributions for these emulations. When considering the prevalence(s) of HIV and hypertension in the presence of hypothetical interventions that help achieve treatment targets, the emulator’s 95% confidence intervals for each mean estimate closely approximate those projected by the inMODELA model. Similar to emulator outputs based on disease prevalence simulations in the absence treatment, the emulator’s uncertainty ranges for prevalence estimates in the presence of wider treatment uptake are within those of the inMODELA simulation estimates. This indicates that our emulator accurately approximates outputs from more complex modelling of the disease-impact of chronic disease intervention(s). We also see a substantial lack of density values for each panel for the respective disease(s) on the x-axis in Supplementary Figure 6, further indicating high precision of the emulator for estimating treatment impacts on chronic disease burden over time. For all emulations of disease burden in the presence of higher intervention uptake, the Pareto k diagnostic values were < 0.70 (Supplementary Figures 7–10).

Discussion

Observed accuracy and efficiency

Applying widely-used Gaussian processes, this tutorial details the steps used to develop a new, user-friendly emulator that can approximate multi-year estimates of chronic disease burden from two computationally intensive simulation models. We demonstrate that the emulator closely reproduced trends in disease burden projected by the published simulators from which our parameters were sourced [29, 36]. In this example, from 2018 through 2028, prevalence of HIV and depression among PLWH in Kenya is projected to decrease by approximately 2 and < 1 percentage points, respectively, with prevalence of hypertension increasing by 5 percentage points over the same 10-year forecast. Successful validation checks (Pareto k estimates < 0.70 for all emulations) confirmed the emulator’s predictive accuracy. Disease prevalence estimates were modelled in only a few seconds on a 64-bit operating system laptop with 2 CPU cores of 8 GBs of RAM using the emulator presented in this paper, while simulations [29] took more than 11 hours to perform 1000 runs on a high-performance computing cluster with 208 CPU cores, 2 GPUs and 1500 GBs of RAM. Thus, our emulator was able to more efficiently model disease burden over time without compromising the statistical accuracy of more computationally intensive simulators. Outputs from sensitivity analyses suggest that the emulator is equally efficient and reliable for approximating simulations of disease burden in the presence of effective treatment interventions.

Future applications

The emulator presented in this paper was developed and validated using 10-year demographic and epidemiological case study data from Kenya. However, the broader Gaussian processes described in this tutorial (and made available via the open source R code) are widely validated in the fields of public and environmental health as reliable methods for emulating results of complex and resource-intensive models, [9, 43, 49] including for those above and beyond Kenyan populations and adults living with chronic conditions. For example, a tutorial using HIV data from Uganda [8] found that history matching and emulation of an 18 output simulator had a 65% probability of fitting all simulator outputs and was several orders of magnitude faster to evaluate. A Bayesian optimization emulator with Gaussian processes [51] has similarly shown to adequately capture the input–output relationship of the OpenMalaria individual-based model (IBM) [52] of malaria transmission while improving upon the simulator’s overall goodness of fit. And a Gaussian process emulator of an IBM of microbial communities [53] has demonstrated an approximately 220-fold increase in computational efficiency, with the percentage of variance explained by the univariate emulator ranging from 83 to 99%. Thus, through the selection of alternative design points, the emulator in this tutorial has the potential to approximate other simulators outside of those for HIV, hypertension and diabetes. As is the goal with any emulation exercise, our emulator offers a statistical model that can be used as a surrogate for chronic disease simulators. Because our simple emulator showed to be valid and more efficient in mimicking 10-year prevalence of HIV comorbidities in the absence of intervention, the next step is to evaluate how the emulator will also be able to mimic future simulators that model disease burden in the presence of targeted interventions. For example, Hamilton et al. developed an agent-based simulation of HIV-1 transmission in Kenya to estimate the potential population-level impact of providing PrEP layered into standard care services over 10 years [54, 55]. These and other longitudinal simulator data offer ideal design points to further test the emulator’s predictive accuracy for modelling intervention effects on HIV treatment and prevention. Several microsimulation models have been developed to characterize and plan for the rapidly growing burden of non-communicable diseases in SSA [56,57,58] and elsewhere [59,60,61]. Also, though still in early stages, recent advances in computing power are now allowing large and complex ABMs to be simulated in reasonable amounts of time using desktop GPUs [62]. Yet, to the best of our knowledge, none of these NCD models have been made available to local public health decision makers through accessible and understandable platforms, which the present emulator may begin to correct.

Limitations and strengths of this tutorial

There are some limitations to the emulation methods presented in this paper. First, Gaussian process emulation was appropriate given the small number of design points being considered in this tutorial. However, GP emulators do not scale well when including many (e.g., > 50) inputs [9, 63, 64], such that several lower-dimensional emulators might be more appropriate when a greater number of simulator outputs are being considered [9]. Relatedly, because our original simulation model data were essential linear, the emulator increased the efficiency of our predictions (i.e., narrower uncertainty ranges). It is possible that our emulator may lose computational efficiency or yield estimates with greater uncertainty when applied to higher dimensional data or to data with asymmetric noise. Incorporating interval calibration [65], Monte Carlo [66], and other methods in future iterations of the emulator could help address this limitation. Second, we fit separate univariate GPs emulators to model each simulator output individually, which neglects any potential correlation [67] between the outputs [53] (e.g., between changes in the prevalence of hypertension and depression among PLWH over time). Future expansions of the emulator can address this issue by using a multivariate Gaussian process [68]. Lastly, as is common to mathematical modelling work, the simulation models selected for this tutorial suffered from incomplete surveillance data which restricted our ability to perform additional emulation procedures such as history matching [69] to reduce simulator input space or additional diagnostic checks [70] that rely on more robust training data.

Despite these limitations, the emulator presented in this paper offers an accessible tool for health policy makers who may not have a background in disease modelling. This will help to build the modelling capacity of local decision makers who are working to build integrated HIV and chronic disease programmes with limited resources [71]. While transparency surrounding microsimulation model development has increased in recent years as these models become more widespread, it is often impractical to document every detail of their functionality [11, 12]. Emulators do not address all transparency concerns related to black box modelling, but they can help address concerns related to communicating results from complex models for wider audiences. Thus, a key strength of this emulator is its simplicity; the step-by-step annotated code that was programmed using free R software and is available via an open-source repository can encourage future use and adaption at no cost to the user.

Secondly, our emulator uses Bayesian inferences rather than a frequentist approach in the posterior predictive analysis, which maximizes both the prior and currently available data. Third, we used the leave-one-out cross validation technique which offers an unbiased and reliable diagnostic check when similar distributions exist in the training and test data [47, 49, 70]. Fourth, our emulator fills a gap in the health decision-making toolbox in that it is one of the first to model the dual burden of HIV and hypertension and of HIV and depression for a country with a generalized HIV epidemic and growing non-communicable disease burden.

Conclusion

This emulation tutorial responds to calls from international donors and global health researchers [72] to “make modelling tools and analytic packages publicly available to wider audiences” and facilitate “training of decision makers to understand model outputs, particularly uncertainty and confidence intervals”. We envision future expansions of this emulator to be able to estimate changes in HIV and NCD burden with greater coverage of National Health Insurance schemes and in the presence of integrated care interventions [73, 74], and to estimate the cost-effectiveness of integrated interventions based on current [6, 75] and emerging data.

Availability of data and materials

All annotated code used to develop and run the emulator presented in this paper is freely available via the following digital repository: https://doi.org/10.7910/DVN/LUBYHQ

Notes

The simulation model was developed and published in 2019, at which time the UNAIDS 90–90-90 targets were in effect in Kenya.

Abbreviations

CPU:: Central Processing Unit
GB:: Gigabyte
GP:: Gaussian Process
GPU :: Graphics Processing Unit
HIV:: Human Immunodeficiency Virus
inMODELA:: Integrated Modeling of Epidemiologic and Economic Long-term Outcomes in Africa
KAIS:: Kenya AIDS Indicator Surveys (KAIS)
LMICs:: Low- and Middle-Income Countries
LOOCV:: Leave-One-Out Cross Validation
NCD:: Non-communicable Disease
PLWH :: People Living with HIV
RAM:: Random Access Memory
SSA :: Sub-Saharan Africa
STD:: Sexually Transmitted Disease
STDSIM :: Sexually Transmitted Diseases Simulator

References

Law AM. Simulation Modeling and Analysis. 5th ed. Mcgraw Hill Series in Industrial Engineering and Management; 2014. p. 776.
Google Scholar
Rutter CM, Zaslavsky AM, Feuer EJ. Dynamic microsimulation models for health outcomes: a review. Med Decis Mak. 2011;31:10–8.
Article Google Scholar
Elveback L, Varma A. Simulation of mathematical models for public health problems. Public Health Rep. 1965;80:1067–76.
Article PubMed PubMed Central CAS Google Scholar
Kopec JA, Edwards K, Manuel DG, Rutter CM. Advances in microsimulation modeling of population health determinants, diseases, and outcomes. Epidemiol Res Int. 2012;2012:1–3.
Article Google Scholar
Sugrue DM, Ward T, Rai S, McEwan P, van Haalen HGM. Economic modelling of chronic kidney disease: a systematic literature review to inform conceptual model design. Pharmacoeconomics. 2019;37:1451–68.
Article PubMed PubMed Central Google Scholar
Nugent R, Barnabas RV, Golovaty I, Osetinsky B, Roberts DA, Bisson C, et al. Costs and cost-effectiveness of HIV/NCD integration in Africa: from theory to practice HHS public access. Aids. 2018;32:83–92.
Article Google Scholar
Nobile MS, Cazzaniga P, Tangherloni A, Besozzi D. Graphics processing units in bioinformatics, computational biology and systems biology. Brief Bioinform. 2017;18:870–85.
PubMed Google Scholar
Andrianakis I, Vernon IR, McCreesh N, McKinley TJ, Oakley JE, Nsubuga RN, et al. Bayesian history matching of complex infectious disease models using emulation: a tutorial and a case study on HIV in Uganda. PLoS Comput Biol. 2015;11:e1003968.
Article PubMed PubMed Central Google Scholar
Ellis AG, Iskandar R, Schmid CH, Wong JB, Trikalinos TA. Active learning for efficiently training emulators of computationally expensive mathematical models. Stat Med. 2020;39:3521–48.
Article PubMed Google Scholar
Gulati R, Gore JL, Etzioni R. Comparative Effectiveness of Alternative Prostate-Specific Antigen–Based Prostate Cancer Screening Strategies. Ann Intern Med. 2013;158:145.
Article PubMed PubMed Central Google Scholar
Abraham JM. Using microsimulation models to inform U.S. health policy making. Health Serv Res. 2013;48(2 Pt 2):686–95.
Article PubMed PubMed Central Google Scholar
Lorscheid I, Heine B-O, Meyer M. Opening the ‘black box’ of simulations: increased transparency and effective communication through the systematic design of experiments. Comput Math Organ Theory. 2012;18:22–62.
Article Google Scholar
Blanning RW. The construction and implementation of metamodels. Simulation. 1975;24:177–84.
Article Google Scholar
Conti S, Gosling JP, Oakley JE, O’Hagan A. Gaussian process emulation of dynamic computer codes. Biometrika. 2009;96:663–76.
Article Google Scholar
De Foo C, Shrestha P, Wang L, Du Q, García Basteiro AL, Abdullah AS, et al. Integrating tuberculosis and noncommunicable diseases care in low- and middle-income countries (LMICs): a systematic review. PLoS Med. 2022;19(1):e1003899.
Article PubMed PubMed Central Google Scholar
Rohwer A, Uwimana Nicol J, Toews I, Young T, Bavuma CM, Meerpohl J. E ffects of integrated models of care for diabetes and hypertension in low-income and middle-income countries: a systematic review and meta-analysis. BMJ Open. 2021;11:1–9.
Article Google Scholar
Vorkoper S, Kupfer LE, Anand N, Patel P, Beecroft B, Tierney WM, et al. Building on the HIV chronic care platform to address noncommunicable diseases in sub-Saharan Africa: a research agenda. Aids. 2018;32(Suppl 1):S107–13.
Article PubMed Google Scholar
Kemp CG, Weiner BJ, Sherr KH, Kupfer LE, Cherutich PK, Wilson D, et al. Implementation science for integration of HIV and non-communicable disease services in sub-Saharan Africa: a systematic review. Aids. 2018;32:S93–105.
Article PubMed Google Scholar
Adeyemi O, Lyons M, Njim T, Okebe J, Birungi J, Nana K, et al. Integration of non-communicable disease and HIV/AIDS management: a review of healthcare policies and plans in East Africa. BMJ Glob Heal. 2021;6(5):e004669.
Article Google Scholar
Kasaie P, Weir B, Schnure M, Dun C, Pennington J, Teng Y, et al. Integrated screening and treatment services for HIV, hypertension and diabetes in Kenya: assessing the epidemiological impact and cost-effectiveness from a national and regional perspective. J Int AIDS Soc. 2020;23(Suppl 1):e25499.
Article PubMed PubMed Central Google Scholar
Achwoka D, Mutave R, Oyugi JO, Achia T. Tackling an emerging epidemic: the burden of non-communicable diseases among people living with hiv/aids in sub-Saharan Africa. Pan Afr Med J. 2020;36:1–9.
Article Google Scholar
Wanni Arachchige Dona S, Bohingamu Mudiyanselage S, Watts JJ, Sweeney R, Coghlan B, Majmudar I, et al. Added socioeconomic burden of non-communicable disease on HIV/AIDS affected households in the Asia Pacific region: a systematic review. Lancet Reg Health West Pac. 2021;9:100111.
Article PubMed PubMed Central Google Scholar
Hyle EP, Mayosi BM, Middelkoop K, Mosepele M, Martey EB, Walensky RP, et al. The association between HIV and atherosclerotic cardiovascular disease in sub-Saharan Africa: a systematic review. BMC Public Health. 2017;17:1–15.
Article Google Scholar
Patel P, Rose CE, Collins PY, Nuche-Berenguer B, Sahasrabuddhe VV, Peprah E, et al. Noncommunicable diseases among HIV-infected persons in low-income and middle-income countries: a systematic review and meta-analysis. AIDS. 2018;32:S5–20.
Article PubMed Google Scholar
Rasmussen CE, Williams CKI. Gaussian processes for machine learning. MIT Press; 2018.
Google Scholar
Sacks J, Welch WJ, Mitchell TJ, Wynn HP. Design and analysis of computer experiments. Stat Sci. 1989;4(4):409–23.
Google Scholar
Coetzee L, Bogler L, De Neve JW, Bärnighausen T, Geldsetzer P, Vollmer S. HIV, antiretroviral therapy and non-communicable diseases in sub-Saharan Africa: empirical evidence from 44 countries over the period 2000 to 2016. J Int AIDS Soc. 2019;22(7):e25364.
Article PubMed PubMed Central Google Scholar
Bloomfield GS, Khazanie P, Morris A, Rabadán-Diehl C, Benjamin LA, Murdoch D, et al. HIV and noncommunicable cardiovascular and pulmonary diseases in low-and middle-income countries in the art era: what we know and best directions for future research. J Acquir Immune Defic Syndr. 2014;67(Suppl 1):S40–53.
Article PubMed PubMed Central Google Scholar
Osetinsky B, Hontelez JAC, Lurie MN, McGarvey ST, Bloomfield GS, Pastakia SD, et al. Epidemiological and health systems implications of evolving HIV and hypertension in South Africa and Kenya. Health Aff. 2019;38:1173–81.
Article Google Scholar
Bakker R, Korenromp E, Meester E, Van Der Ploeg C, Voeten H, Van Vliet C, et al. STDSIM: a microsimulation model for decision support in the Control of HIV and other STDs. Am Sex Transm Dis Assoc. 2000;27:652.
Article Google Scholar
Hontelez JAC, Lurie MN, Bärnighausen T, Bakker R, Baltussen R, Tanser F, et al. Elimination of HIV in South Africa through expanded access to antiretroviral therapy: a model comparison study. PLoS Med. 2013;10:e1001534.
Article PubMed PubMed Central Google Scholar
Kenya Ministry of Health. STEPwise Survey for Non CommuniKenyacable Diseases Risk Factors 2015 Report. Kenya STEPwise Surv Non Commun Dis Risk Factors 2015 Rep, vol. 5. Kenya Ministry of Health; 2015. p. 8–210.
Google Scholar
National AIDS and STI Control Programme (NASCOP) K. Kenya AIDS Indicator survey 2012: final report. Nairobi; 2014. https://nsdcc.go.ke/wp-content/uploads/2015/10/KAIS-2012.pdf. Accessed 26 Sep 2023
National AIDS Control Council. Kenya HIV County profiles. 2016. https://nsdcc.go.ke/wp-content/uploads/2016/12/Kenya-HIV-County-Profiles-2016.pdf. Accessed 26 Sep 2023.
Google Scholar
Smit M, Olney J, Ford NP, Vitoria M, Gregson S, Vassall A, et al. The growing burden of noncommunicable disease among persons living with HIV in Zimbabwe. AIDS. 2018;32:773–82.
Article PubMed Google Scholar
Smit M, Perez-Guzman PN, Mutai KK, Cassidy R, Kibachio J, Kilonzo N, et al. Mapping the current and future noncommunicable disease burden in Kenya by human immunodeficiency virus status: a modeling study. Clin Infect Dis. 2020;71:1864–73.
Article PubMed Google Scholar
UNAIDS. On the Fast-Track to end AIDS 2016–2021 Strategy. p. 2015. https://www.unaids.org/sites/default/files/media_asset/20151027_UNAIDS_PCB37_15_18_EN_rev1.pdf. Accessed 26 Sep 2023
Hashiguchi L, Achoki T, Alam U, Fullman N. The Global Burden of Disease: Generating Evidence, Guiding Policy. Nairobi, Kenya. Institute for Health Metrics and Evaluation and the International Centre for Humanitarian Affairs. 2016. https://www.healthdata.org/research-analysis/library/global-burden-disease-generating-evidence-guidingpolicy-kenya. Accessed 26 Sep 2023.
Erickson C. A guide to the GauPro R package. 2023. https://cran.r-project.org/web/packages/GauPro/vignettes/GauPro.html. Accessed 26 Sep 2023.
Google Scholar
Gabry J, Simpson D, Vehtari A, Betancourt M, Gelman A. Visualization in Bayesian workflow. J R Stat Soc Ser A Stat Soc. 2019;182:389–402.
Article Google Scholar
Magnusson M, Andersen MR, Jonasson J, Vehtari A. Leave-one-out cross-validation for Bayesian model comparison in large data. Proc Mach Learn Res. 2020; arXiv:2001.00980
Vehtari A, Gelman A, Gabry J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput. 2017;27:1413–32.
Article Google Scholar
Vernon I, Goldsteiny M, Bowerz RG. Galaxy formation: a Bayesian uncertainty analysis. Bayesian Anal. 2010;05(04):619–70.
Google Scholar
Swiler LP, Gulian M, Frankel AL, Safta C, Jakeman JD. A survey of constrained Gaussian process regression: approaches and implementation challenges. J Mach Learn Model Comput. 2020;1:119–56.
Article Google Scholar
Al-taweel Y. Diagnostics and simulation-based methods for validating Gaussian process diagnostics and simulation-based methods for validating Gaussian process emulators. PhD Thesis. University of Sheffield; 2018. https://etheses.whiterose.ac.uk/19873/
Google Scholar
Lynch SM. Bayesian Statistics. Encycl Soc Meas. 2005:135–44.
Shao Z, Er MJ, Wang N. An Efficient Leave-One-Out Cross-Validation-Based Extreme Learning Machine (ELOO-ELM) With Minimal User Intervention. IEEE Trans Cybern. 2016;46:1939–51.
Article PubMed Google Scholar
Morrison RE, Bryant CM, Terejanu G, Prudhomme S, Miki K. Data partition methodology for validation of predictive models. Comput Math Appl. 2013;66:2114–25.
Article Google Scholar
Beddows AV, Kitwiroon N, Williams ML, Beevers SD. Emulation and sensitivity analysis of the community multiscale air quality model for a UK ozone pollution episode. Environ Sci Technol. 2017;51:6229–36.
Article PubMed CAS Google Scholar
Iskauskas A. Emulation and History Matching Handbook. https://cran.r-project.org/web/packages/hmer/vignettes/emulationhandbook.html Accessed 03 May 2023
Reiker T, Golumbeanu M, Shattock A, Burgert L, Smith TA, Filippi S, et al. Emulator-based Bayesian optimization for efficient multi-objective calibration of an individual-based model of malaria. Nat Commun. 2021;12:7212.
Article PubMed PubMed Central CAS Google Scholar
Smith T, Killeen GF, Maire N, Ross A, Molineaux L, Tediosi F, et al. Mathematical modeling of the impact of malaria vaccines on the clinical epidemiology and natural history of plasmodium falciparum malaria: overview. Am J Trop Med Hyg. 2006;75(Suppl 2):1–10.
Article PubMed Google Scholar
Oyebamiji OK, Wilkinson DJ, Jayathilake PG, Curtis TP, Rushton SP, Li B, et al. Gaussian process emulation of an individual-based model simulation of microbial communities. J Comput Sci. 2017;22:69–84.
Article Google Scholar
Hamilton DT, Agutu C, Sirengo M, Chege W, Goodreau SM, Elder A, et al. Modeling the impact of different PrEP targeting strategies combined with a clinic-based HIV-1 nucleic acid testing intervention in Kenya. Epidemics. 2023;44:100696.
Article PubMed CAS Google Scholar
Coll P, Jarrín I, Martínez E, Martínez-Sesmero JM, Domínguez-Hernández R, Castro-Gómez A, et al. Achieving the UNAIDS goals by 2030 in people living with HIV: A simulation model to support the prioritization of health care interventions. Enfermedades Infecc y Microbiol Clin. 2023;41(10):589–95.
Article Google Scholar
Gouda HN, Charlson F, Sorsdahl K, Ahmadzada S, Ferrari AJ, Erskine H, et al. Burden of non-communicable diseases in sub-Saharan Africa, 1990–2017: results from the global burden of disease study 2017. Lancet Glob Heal. 2019;7(10):E1375–87.
Article Google Scholar
Schnure M, Dowdy D. In: Bae K-H, Feng B, Kim S, Lazarova-Molnar S, Zheng Z, Roeder T, Thiesing R, editors. Proceedings of the 2020 Winter Simulation Conference. IEEE; 2020. p. 980–91.
Chapter Google Scholar
Haacker M, Bärnighausen T, Atun R. HIV and the growing health burden from noncommunicable diseases in Botswana: modelling study. J Glob Health. 2019;9(1):010428. https://doi.org/10.7189/jogh.09.010428.
Article PubMed PubMed Central Google Scholar
Nianogo RA, Arah OA. Forecasting obesity and type 2 diabetes incidence and burden: the ViLA-obesity simulation model. Front Public Health. 2022;10 https://doi.org/10.3389/fpubh.2022.818816.
Lymer S, Schofield D, Lee CMY, Colagiuri S. NCDMod: a microsimulation model projecting chronic disease and risk factors for Australian adults. Int J Microsimulation. 2015;9:103–39.
Article Google Scholar
Yarnoff B, Honeycutt A, Bradley C, Khavjou O, Bates L, Bass S, et al. Validation of the prevention impacts simulation model (PRISM). Prev Chronic Dis. 2021;18:200225.
Article Google Scholar
Fain BG, Dobrovolny HM. GPU acceleration and data fitting: agent-based models of viral infections can now be parameterized in hours. J Comput Sci. 2022;61:101662.
Article Google Scholar
Rasmussen CE. Gaussian Processes in Machine Learning. Bousquet, O., von Luxburg, U., Rätsch, G. (eds) Advanced Lectures on Machine Learning. ML 2003. Lecture Notes in Computer Science, vol 3176. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28650-9_4.
O’Hagan A. Bayesian analysis of computer code outputs: a tutorial. Reliab Eng Syst Saf. 2006;91(10–11):1290–300. https://doi.org/10.1016/j.ress.2005.11.025.
Article Google Scholar
Thiagarajan JJ, Venkatesh B, Anirudh R, Bremer P-T, Gaffney J, Anderson G, et al. Designing accurate emulators for scientific processes using calibration-driven deep models. Nat Commun. 2020;11:5622.
Article PubMed PubMed Central CAS Google Scholar
López-Lopera AF, Bachoc F, Durrande N, Rohmer J, Idier D, Roustant O. Approximating Gaussian Process Emulators with Linear Inequality Constraints and Noisy Observations via MC and MCMC. In: Tuffin B, L'Ecuyer P, editors. Monte Carlo and Quasi-Monte Carlo Methods. MCQMC 2018. Springer Proceedings in Mathematics & Statistics, vol 324. Cham: Springer; 2020. https://doi.org/10.1007/978-3-030-43465-6_18.
Mayala BK, Bhatt S, Gething P. Predicting HIV/AIDS at Subnational Levels using DHS Covariates related to HIV. DHS Spatial Analysis Reports No. 18. Rockville, Maryland, USA: ICF; 2020.
Google Scholar
Chen Z, Fan J, Wang K. Remarks on multivariate Gaussian Process. 2020. arXiv:2010.09830.
Google Scholar
Craig PS, Goldstein M, Seheult AH, Smith JA. Pressure matching for hydrocarbon reservoirs: a case study in the use of Bayes linear strategies for large computer experiments; 1997.
Google Scholar
Bastos LS, O’Hagan A. Diagnostics for Gaussian process emulators. Technometrics. 2009;51:425–38.
Article Google Scholar
Revill P, Rangaraj A, Makochekanwa A, Mpofu A, Ciaranello AL, Jahn A, et al. Perspectives on the use of modelling and economic analysis to guide HIV programmes in sub-Saharan Africa. Lancet HIV. 2022;3018:1–4.
Google Scholar
Kupfer LE, Beecroft B, Viboud C, Wang X, Brouwers P. A call to action: strengthening the capacity for data capture and computational modelling of HIV integrated care in low- and middle-income countries. Journal of the international AIDS. Society. 2020;23(Suppl 1):e25475.
Google Scholar
Genberg BL, Wachira J, Steingrimsson JA, Pastakia S, Tran DNT, Said JA, et al. Integrated community-based HIV and non-communicable disease care within microfinance groups in Kenya: study protocol for the Harambee cluster randomised trial. BMJ Open. 2021;11:e042662.
Article PubMed PubMed Central Google Scholar
Kibachio J, Mwenda V, Ombiro O, Kamano JH, Perez-Guzman PN, Mutai KK, et al. Recommendations for the use of mathematical modelling to support decision-making on integration of non-communicable diseases into HIV care. J Int AIDS Soc. 2020;23(Suppl 1):e25505.
Article PubMed PubMed Central Google Scholar
Osetinsky B, Mwangi A, Pastakia S, Wilson-Barthes M, Kimetto J, Rono K, et al. Layering and scaling up chronic non-communicable disease care on existing HIV care systems and acute care settings in Kenya: a cost and budget impact analysis. J Int AIDS Soc. 2020;23:e25496.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

Research reported in this publication was supported by the National Institute of Mental Health of the National Institutes of Health under award number R01MH118075, and by the Civilian Research and Development Foundation (CRDF) Global under award number R-202104-67710. Approximately 50 % of this research was financed with Federal money. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or CRDF Global.

Author information

Authors and Affiliations

African Center of Excellence in Data Science, University of Rwanda, Kigali, Rwanda
Sharon Jepkorir Sawe
Academic Model Providing Access to Healthcare, Eldoret, Kenya
Richard Mugo, Ann Mwangi & Omar Galárraga
Department of Epidemiology, Brown University School of Public Health, Providence, RI, USA
Marta Wilson-Barthes
Department of Epidemiology and Public Health, Swiss Tropical and Public Health Institute, Basel, Switzerland
Brianna Osetinsky
Department of Biostatistics, Brown University School of Public Health, Providence, RI, USA
Stavroula A. Chrysanthopoulou
Department of Health Policy Management & Human Nutrition, Moi University School Public Health, Eldoret, Kenya
Faith Yego
Department of Mathematics, Physics & Computing, School of Science and Aerospace Studies, Moi University, Eldoret, Kenya
Ann Mwangi
Department of Health Services, Policy and Practice, and International Health Institute, Brown University School of Public Health, Providence, RI, USA
Omar Galárraga

Authors

Sharon Jepkorir Sawe
View author publications
You can also search for this author in PubMed Google Scholar
Richard Mugo
View author publications
You can also search for this author in PubMed Google Scholar
Marta Wilson-Barthes
View author publications
You can also search for this author in PubMed Google Scholar
Brianna Osetinsky
View author publications
You can also search for this author in PubMed Google Scholar
Stavroula A. Chrysanthopoulou
View author publications
You can also search for this author in PubMed Google Scholar
Faith Yego
View author publications
You can also search for this author in PubMed Google Scholar
Ann Mwangi
View author publications
You can also search for this author in PubMed Google Scholar
Omar Galárraga
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

OG, SC, AM and FY led the conception and design of the work. SJS led data analysis and interpretation and wrote the first draft of the manuscript. RM, MWB and BO contributed to data analysis and to the conception and design of the work. All authors revised the work critically for important intellectual content, provided final approval of the version to be published, and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Omar Galárraga.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Sawe, S.J., Mugo, R., Wilson-Barthes, M. et al. Gaussian process emulation to improve efficiency of computationally intensive multidisease models: a practical tutorial with adaptable R code. BMC Med Res Methodol 24, 26 (2024). https://doi.org/10.1186/s12874-024-02149-x

Download citation

Received: 27 January 2023
Accepted: 11 January 2024
Published: 27 January 2024
DOI: https://doi.org/10.1186/s12874-024-02149-x

Gaussian process emulation to improve efficiency of computationally intensive multidisease models: a practical tutorial with adaptable R code

Abstract

Background

Methods

Results

Conclusions

Background

The need for emulation

Methods: emulator development and validation

Simulator description and source data

Overview of the emulation process

Step 1. Install the program

Step 2. Select key design points from simulation model(s)

Step 3. Fit Gaussian processes to simulation data

Step 4. Run graphical posterior predictive checks of emulator fit

Step 5. Validate the emulator’s accuracy

Results: emulator interpretation

Discussion

Observed accuracy and efficiency

Future applications

Limitations and strengths of this tutorial

Conclusion

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Additional file 1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Research Methodology

Contact us