 Research
 Open access
 Published:
Gaussian process emulation to improve efficiency of computationally intensive multidisease models: a practical tutorial with adaptable R code
BMC Medical Research Methodology volume 24, Article number: 26 (2024)
Abstract
Background
The rapidly growing burden of noncommunicable diseases (NCDs) among people living with HIV in subSaharan Africa (SSA) has expanded the number of multidisease models predicting future care needs and health system priorities. Usefulness of these models depends on their ability to replicate reallife data and be readily understood and applied by public health decisionmakers; yet existing simulation models of HIV comorbidities are computationally expensive and require large numbers of parameters and long run times, which hinders their utility in resourceconstrained settings.
Methods
We present a novel, userfriendly emulator that can efficiently approximate complex simulators of longterm HIV and NCD outcomes in Africa. We describe how to implement the emulator via a tutorial based on publicly available data from Kenya. Emulator parameters relating to incidence and prevalence of HIV, hypertension and depression were derived from our own agentbased simulation model and other published literature. Gaussian processes were used to fit the emulator to simulator estimates, assuming presence of noise for design points. Bayesian posterior predictive checks and leaveoneout cross validation confirmed the emulator’s descriptive accuracy.
Results
In this example, our emulator resulted in a 13fold (95% Confidence Interval (CI): 8–22) improvement in computing time compared to that of more complex chronic disease simulation models. One emulator run took 3.00 seconds (95% CI: 1.65–5.28) on a 64bit operating system laptop with 8.00 gigabytes (GB) of Random Access Memory (RAM), compared to > 11 hours for 1000 simulator runs on a highperformance computing cluster with 1500 GBs of RAM. Pareto k estimates were < 0.70 for all emulations, which demonstrates sufficient predictive accuracy of the emulator.
Conclusions
The emulator presented in this tutorial offers a practical and flexible modelling tool that can help inform health policymaking in countries with a generalized HIV epidemic and growing NCD burden. Future emulator applications could be used to forecast the changing burden of HIV, hypertension and depression over an extended (> 10 year) period, estimate longerterm prevalence of other cooccurring conditions (e.g., postpartum depression among women living with HIV), and project the impact of nationallyprioritized interventions such as national health insurance schemes and differentiated care models.
Background
The need for emulation
In situations where empirical data on disease impact(s) are not universally available or a randomized controlled trial may not be feasible, complex mathematical computer models (referred to as “simulators”) [1,2,3] can characterize disease prevalence, forecast incidence, and help identify costeffective approaches for meeting short and longterm health system needs [4,5,6]. Though simulation models play a critical role in helping public health decisionmakers synthesize data from multiple sources and compare anticipated outcomes over time, the limitations of these models are not trivial. Simulation models based on central processing units (CPUs), such as grid computing and computing clusters, [7] require significant infrastructure that can incur high costs for hardware and oversight. Even with access to highcomputing infrastructure, long run times of several hours for a single simulation and large numbers of input and output parameters can greatly inhibit fitting such models, [8] and in turn restrict analysts to considering only a subset of all possible simulated scenarios [9, 10]. Furthermore, because of their complexity, components of microsimulation models can still be perceived as a black box [11, 12] because their functions and behaviors are often not exhaustively described or immediately accessible at the time of publication, all of which makes it difficult for external users to interpret and adapt model processes for their local context.
Emulators are one tool that can help mitigate these limitations [9]. An emulator, also known as a metalmodel, [13] is an approximation of one or more complex mathematical model(s) that is constructed using a training sample of simulator runs [14] and computationally more efficient. Emulators reduce costs by negating the need for supercomputing infrastructure and, once developed, can substantially shorten the amount of time needed to implement model runs and interpret results.
In the last decade there has been an accelerated demand for integrated responses [15,16,17,18,19,20] to the growing burden of noncommunicable diseases (NCDs) – including cardiovascular disease, cancers, diabetes, and mental illness – among people living with HIV (PLWH) in low and middle income countries (LMICs) [21,22,23,24]. In response to this call, the authors of this paper recently extended an established agentbased model of HIV transmission and treatment impact to include hypertension in two rural settings in SubSaharan Africa [24]. The authors’ simulation model was able to generate robust estimates of changing risks across age groups and predict growing population burdens of HIV and hypertension as comorbidities; however, the simulations were resource and time intensive and it is unlikely that novice modelers would be able to adapt the model’s components without input from an expert biostatistician. Other simulators of HIV and noncommunicable diseases have met the same challenges [25]. To facilitate a greater understanding and usability of these complex models, we therefore share our experience developing an opensource emulator that approximates estimates from two simulators over an input subdomain of parameters related to HIV, hypertension and depression in a SubSaharan African (SSA) country with a generalized HIV epidemic. The tutorial presented in this paper (i) describes the steps involved in emulator development and validation, (ii) illustrates how to interpret the emulator’s predictive accuracy and outputs in relation to those from each simulator using case study data from Kenya, and (iii) provides annotated, opensource and adaptable R code to facilitate the use of the discussed methods in practice. We expect researchers with a basic understanding of Bayesian statistics and some familiarity with R software to be able to implement this protocol independently.
The emulation method described in this paper relies on wellestablished and validated Gaussian processes [14, 25, 26]. To the best of our knowledge, Gaussianprocess emulation has not yet been used to mimic simulation models that predict the burden of HIVcomorbidities over time, nor has it been described in sufficient detail to enable use by nonbiostatisticians. Thus, this tutorial is scientifically significant in that is uses a didactic approach to demystify Gaussian process emulation methods, and offers a new tool that can potentially improve public health decisionmaking with less resources.
Methods: emulator development and validation
Simulator description and source data
Evidence overwhelming indicates that the disproportionate burden of noncommunicable diseases among people living with HIV – compared to individuals not living with HIV – will increase rapidly in the coming decades, [27, 28] and that most health systems in SubSaharan Africa are not currently equipped to treat the more than 15 million patients who require integrated HIV and NCD care [19]. In this tutorial we focus on two published simulation models that have estimated the future burden of HIV and noncommunicable diseases in Kenya, as well as the costs and epidemiological impact of strengthening integrated care systems in the country. The two simulators were selected as examples for this tutorial because of their longerterm (i.e., > = 10 year) forecast periods and available details surrounding their design points. The two simulators were also selected because the rise of noncommunicable diseases among persons living with HIV in Kenya is indicative and representative of the rise of NCDs on the continent, which are estimated to overtake infectious diseases worldwide by 2030 [19, 27, 28].
The first model is the authors’ Integrated Modeling of Epidemiologic and Economic Longterm Outcomes in Africa (inMODELA) microsimulation model, [29] which simulates HIV and hypertension in Kenya and South Africa from 2018 to 2028. The model is an extension of the Sexually Transmitted Diseases Simulator (STDSIM), [30, 31] a stochastic agentbased model that simulates transmission of HIV and other sexually transmitted diseases (STDs) through dynamic sexual networks. inMODELA was partly calibrated using population surveillance data of hypertension and HIV from western Kenya. Nationallevel data on the hypertension prevalence were extracted from the 2015 Kenya STEPwise Approach to NCD Risk Factor Surveillance (STEPS) survey [32] while HIV modelling was calibrated using reports from 2007 and 2012 Kenya AIDS Indicator Surveys (KAIS) and the 2016 Kenya County HIV profiles, [33, 34] the most recent national data available at the time of simulator development. HIV was modelled as having four stages: early infection, asymptomatic, symptomatic and AIDS, and treatment with antiretroviral therapy was operationalized as individual ART demand and health system capacity to meet ART demand. Hypertension was modelled as being normotensive or hypertensive (i.e., blood pressure > 140/90 mmHg), accounting for the potential effects of age, gender, and economic development on hypertension risk. Key outputs of the inMODELA model include total annual mortality, incidence, and prevalence, as well as the health system burden of hypertension, HIV, and comorbid HIV and hypertension. Additional details of the inMODELA simulator are available in a separate publication [29].
The second model in this tutorial is an individualbased simulator initially developed by Smit et al. for Zimbabwe [35] and adapted for Kenya [36]. The model estimates current and future births, deaths, HIV disease and treatment, as well as prevalence/incidence of cardiovascular disease, chronic kidney disease, depression, diabetes, hypertension, and other NCDs and cancers among adults for the period of 2018 through 2035. Simulator calibration relied on data from the Joint United Nations Programme on HIV/AIDS, [37] 2016 Global Burden of Disease estimates [38] and other sources [36].
Relevant input parameters used to calibrate the two simulation models are summarized in Supplementary Table 1.
Overview of the emulation process
Figure 1 summarizes the steps that were used to develop and validate an emulator to approximate epidemiological outputs projected over 10 years by the multidisease simulators. To develop the emulator, we first abstract relevant parameters from the more complex simulation models to serve as the emulator’s design points. In this example, parameters related to the prevalence of HIV, hypertension, comorbid HIV and hypertension, and depression among PLWH were selected. Prevalence parameters were ascertained for 2018 and for 2028. Second, we use Gaussian processes (GP) to approximate the mean and variance of each simulator’s outputs, assuming presence of noise for our design points. Third, we use Bayesian posterior predictive analysis to analyse the credibility of future emulator predictions based on the posterior distribution. Lastly, we use leaveoneout cross validation to confirm the emulator’s predictive accuracy and compare emulator estimates to simulation results.
Step 1. Install the program
This tutorial uses the GauPro package [39], the rstanarm package [40] and the loo package [41, 42] in R version 4.1.2 for emulator construction and application. To replicate results from this tutorial, or to adapt this emulator to new simulation data, the latest version of the free R software environment needs to be installed, and can be downloaded from https://www.rproject.org/.
Step 2. Select key design points from simulation model(s)
Not all parameters of a complex simulation model will be needed for emulation. Only those most informative for your research question should be ascertained. In this example using simulator data from Kenya, parameters relating to the prevalence of HIV, hypertension, and comorbid HIV and hypertension were ascertained from the inMODELA simulation model [29] for the period of 2018 through 2028. Parameters relating to the prevalence of depression among people living with HIV were ascertained from the Smit et al. model [36] for the period of 2018 through 2030. (Table 1) To reconcile the different forecasting periods used by the two simulation models, we assumed the prevalence of depression in 2028 to be the average prevalence for years 2025 and 2030. Supplementary Table 2 specifies the design points used to emulate annual HIV and NCD prevalence for the period of 2018 through 2028.
Update the emulator’s R code to reflect relevant design points and corresponding emulation time period(s) abstracted from simulator data.
Step 3. Fit Gaussian processes to simulation data
A Gaussian process refers to a collection of any finite number of random variables which have a Gaussian (normal) distribution [25]. A GP emulator uses a statistical model to fit a Gaussian process to a dataset, and is defined by (1) a mean function describing the mean at any point of the input space and (2) a covariance function describing the covariance between points [25, 39]. When emulating a stochastic simulator, the unknown function is assumed to be the expectation of the i^{th} simulator output denoted as f_{i}(x).
We construct our GP emulator such that, for each simulator output f_{i}(x), we select active variables (x^{A}) and then emulate using the following process:
The first part of the emulator, \(\sum_{i=1}^q{\beta}_i{g}_i\left({x}^A\right),\) is a polynomial of active inputs x^{A} running from i = 1, …q chosen from the simulator outputs, β_{i} are the regression coefficients, and g_{i}(˙) are the deterministic functions of x^{A} which are known [43]. The second part of the emulator, μ_{i}(x^{A}), represents a collection of any finite number of random variables which have a Gaussian distribution [43]. Supposing that μ_{i}(x^{A}) is a Gaussian process with zero mean and known variance, we can define it as:
where c(x^{A}, x^{A′}) is a covariance function that determines the relationship between μ(x^{A}) and μ(x^{A})^{′} based on the distance between x^{A} and x^{A′}. The Gaussian process used to develop the current emulator has a covariance structure given by:
The parameter σ in eq. 3 can be varied to obtain the desired amount of waves in the emulator, whereby a smaller value of σ results in less extreme waves [44]. θ > 0 are unknown correlation length parameters where large values of θ indicate a smooth output function of the i^{th} input and small values suggest high nonlinearity [45]. The last part of the emulator, δ_{i}(x), models the effects of inactive variables as random noise.
For each output of interest, the emulator provides the expectation E [f_{i}(x)] and variance var(f_{i}(x)) at x for every output given by i = 1, 2, …n where x denotes a vector of the emulator inputs. We evaluate f_{i}(x) as the prevalence value(s) projected from 2018 to 2028 for HIV, hypertension, comorbid HIV and hypertension, and depression. The results are gathered into a vector D: \({D}_i=\Big(f{\left({x}_1^A,\dots, f\left({x}_q^A\right)\right)}^T\) where q represents the number of design points. The emulator is then plotted as an adjusted expectation and variance function of f_{j}(x): E_{D}(f_{j}(x)) and Var_{D}(f_{j}(x)).
Step 4. Run graphical posterior predictive checks of emulator fit
In this step, we use graphical displays to check that the disease burden trends produced by our emulator look similar to the simulator data we observed. Our emulator uses Bayesian posterior predictive analysis to assess the credibility of future observable data based on the posterior distribution [46]. We analyze the Bayesian posterior distributions graphically to check predictive accuracy of the posterior distributions, plotting simulated y values from the posterior distribution against the actual values of y. The posterior predictive distribution is defined as:
where y^{rep} represents future data that could be drawn from the posterior distribution, y is the current simulator data, θ is the model parameter, and p(y θ) is the sampling density for future data conditional on the parameter [46]. The size of the posterior sample was based on 50 draws which showed to be sufficient enough to generate the y^{rep} matrix from the posterior predictive distribution [40].
Step 5. Validate the emulator’s accuracy
We use the Bayesian leaveoneout cross validation (LOOCV) technique to confirm the emulator’s predictive accuracy. The LOOCV method applies the loglikelihood evaluation of posterior parameter values, [42] and is appropriate for smaller data sets and when similar distributions exist in the training and testing data [47]. LOOCV assesses the predictive ability of posterior simulations in which the data is iteratively partitioned into either calibration (training) sets or validation (test) sets. LOOCV is one of the most accurate ways to estimate how well a model will perform on unseen, “outofsample” data [42]. The calibration set is used to train the model and produce output values that are compared with the test set [48]. We validated the emulator’s accuracy using the loo package in R whereby the computation uses Paretosmoothed importance sampling (PSIS) to regularize importance weights [42]. Following prior work [42, 49], a Pareto k estimate less than or equal to a 0.70 threshold was used to confirm reliability of the emulator’s performance. The LOOCV approach is specified in eqs. 5–7.
Given that data y_{1}, …y_{n} are independently modelled given the parameters θ then, \(p\left(y\theta \right)={\prod}_i^np\left({y}_i\theta \right)\). Suppose a prior distribution p(θ) gives a posterior distribution p(θ y) and a posterior predictive distribution p(y˜ y) = ∫ p(y˜_{i}θ)p(θ y)dθ. A predictive accuracy measure for n data points termed as expected log pointwise predictive density (elpd) for a new dataset is given as:
where p_{t}(y˜_{i}) is the distribution of the real data generation for y˜_{i}. The values of p_{t}(y˜_{i}) are unknown and therefore crossvalidation is used to approximate (5). The Bayesian leaveoneout (LOO) estimate of the predictive fit is:
where
is the leaveoneout predictive density given the data without the i^{th} data point.
To train the emulator, we followed the Emulation and History Matching Handbook for R [50] whereby we chose a number of training points equal to ten times the number of parameters. Given that we were estimating only one parameter (prevalence) for each of the four outcomes of interest, this resulted in 10 training points.
Results: emulator interpretation
Using outputs from the two simulation models of HIV, hypertension and depression burden, our emulator shows to be as accurate and computationally more efficient at predicting prevalence of these comorbidities in Kenyan populations over 10years. On average, the inMODELA simulator took 11 hours and 53 minutes to perform 1000 runs on a highperformance computing cluster (HPC) and used 208 central processing units (CPU) cores, 2 graphics processing units (GPUs), and 1500 gigabytes (GBs) of randomaccess memory (RAM). The model by Smit et al. was coded in C++, ran using Xcode (Mac), and took a little over 3 minutes per iteration on an 8core machine. Although run times were not overwhelmingly long, the Smit et al. simulation model faced computational constraints due to the large size of model outputs; this necessitated an additional postprocessing step of aggregating and summarizing each batch of 10 model iterations in MatLab. By comparison, the present emulator was developed and validated on a HewlettPackard Intel Core i5 laptop with a 64bit operating system, 2.50 CPU of 2.50 GHz and 8.00 GBs of RAM, and took only a few seconds for a single run using the same statistical program for development, validation and output interpretation.
Figure 2 shows the emulator’s projected mean prevalence over time for the example data in terms of: (A) HIV, (B) hypertension, (C) comorbid HIV & hypertension, and (D) depression among adults living with HIV. In each subfigure of Fig. 2, black points should be interpreted as the design points (i.e., simulator outputs) used to fit the emulator, dashed red lines show the mean prevalence plotted as a function of the emulator’s predictions, solid blue lines represent the 95% Confidence Intervals (CI) for the emulator’s mean estimates, and the yellow lines are the 95% Confidence Intervals for the simulators’ mean estimates (Confidence Intervals for the annual mean prevalence of depression among PLWH were not publically available). We can then visually understand the emulated trends in disease burden to closely reproduce the trends projected by each simulator. For example, the emulator predicts that the mean prevalence of HIV will be 0.02776 [95% CI: 0.02774–0.02778] in the year 2027 compared to 0.02750 [95% CI: 0.02379–0.03121] predicted by the inMODELA simulator. The pointwise confidence intervals used for statistical uncertainty quantification are wider between each emulation year than at each annual point estimate (e.g., at the midpoint between 2027 and 2028, the mean prevalence of HIV is predicted to be 0.02573 (95% CI: 0.02219, 0.02928). The width of the credibility interval is largest between points and approaches zero near each point estimate due to the Gaussian process regression towards the mean. When considering the prevalence(s) of HIV, hypertension, and comorbid HIV & hypertension in this example, the emulator’s 95% Confidence Intervals for each mean estimate are very similar to those projected by the inMODELA simulator. Given the linearity and lack of noise in the original data, the emulator’s uncertainty ranges are within those of the inMODELA estimates. This indicates that our emulator accurately approximates outputs from the more complex model.
Figure 3 shows the Gaussian posterior predictive distributions for the example Kenyan data for HIV, hypertension, comorbid HIV & hypertension, and depression. The solid black lines can be understood as the distribution of y; the dashed red lines represent the y posterior predictive distribution, and the dotted grey lines represent the distribution of the simulations performed. We observe from panel A, B, C and D in Fig. 3 that the posterior predictive distributions denoted by postpred y are not far off from the current fitted data denoted by y. Density values differ for each panel due to different prevalence values for the respective disease(s) on the xaxis, with greater imprecision for depression estimates (panel D) because of fewer available simulation data at each projection year.
For each output of interest, the LOOCV yielded Pareto k diagnostic values of < 0.70, indicating practical convergence rates and reliable Monte Carlo error estimates. Validating diagnostics for each of the emulator’s output are provided in Supplementary Figures 1–4.
Supplementary Figure 5a shows the emulator’s projected mean prevalence of HIV when the simulation model assumes that Kenya’s ART coverage targets are achieved (i.e., 90% of people with HIV are aware of their status, and, of those, 90% are enrolled in HIV care^{Footnote 1}) due to the implementation of effective interventions along the care continuum. Supplementary Figure 5b shows the emulator’s projected mean prevalence of hypertension when Kenya’s hypertension treatment targets are reached (i.e., when 50% of persons with confirmed hypertension are enrolled in care and receiving drug therapy and counselling). Supplementary Figures 5c and 5d show the projected disease prevalences when one or both of these treatment targets are achieved. Supplementary Figure 6 shows the Gaussian posterior predictive distributions for these emulations. When considering the prevalence(s) of HIV and hypertension in the presence of hypothetical interventions that help achieve treatment targets, the emulator’s 95% confidence intervals for each mean estimate closely approximate those projected by the inMODELA model. Similar to emulator outputs based on disease prevalence simulations in the absence treatment, the emulator’s uncertainty ranges for prevalence estimates in the presence of wider treatment uptake are within those of the inMODELA simulation estimates. This indicates that our emulator accurately approximates outputs from more complex modelling of the diseaseimpact of chronic disease intervention(s). We also see a substantial lack of density values for each panel for the respective disease(s) on the xaxis in Supplementary Figure 6, further indicating high precision of the emulator for estimating treatment impacts on chronic disease burden over time. For all emulations of disease burden in the presence of higher intervention uptake, the Pareto k diagnostic values were < 0.70 (Supplementary Figures 7–10).
Discussion
Observed accuracy and efficiency
Applying widelyused Gaussian processes, this tutorial details the steps used to develop a new, userfriendly emulator that can approximate multiyear estimates of chronic disease burden from two computationally intensive simulation models. We demonstrate that the emulator closely reproduced trends in disease burden projected by the published simulators from which our parameters were sourced [29, 36]. In this example, from 2018 through 2028, prevalence of HIV and depression among PLWH in Kenya is projected to decrease by approximately 2 and < 1 percentage points, respectively, with prevalence of hypertension increasing by 5 percentage points over the same 10year forecast. Successful validation checks (Pareto k estimates < 0.70 for all emulations) confirmed the emulator’s predictive accuracy. Disease prevalence estimates were modelled in only a few seconds on a 64bit operating system laptop with 2 CPU cores of 8 GBs of RAM using the emulator presented in this paper, while simulations [29] took more than 11 hours to perform 1000 runs on a highperformance computing cluster with 208 CPU cores, 2 GPUs and 1500 GBs of RAM. Thus, our emulator was able to more efficiently model disease burden over time without compromising the statistical accuracy of more computationally intensive simulators. Outputs from sensitivity analyses suggest that the emulator is equally efficient and reliable for approximating simulations of disease burden in the presence of effective treatment interventions.
Future applications
The emulator presented in this paper was developed and validated using 10year demographic and epidemiological case study data from Kenya. However, the broader Gaussian processes described in this tutorial (and made available via the open source R code) are widely validated in the fields of public and environmental health as reliable methods for emulating results of complex and resourceintensive models, [9, 43, 49] including for those above and beyond Kenyan populations and adults living with chronic conditions. For example, a tutorial using HIV data from Uganda [8] found that history matching and emulation of an 18 output simulator had a 65% probability of fitting all simulator outputs and was several orders of magnitude faster to evaluate. A Bayesian optimization emulator with Gaussian processes [51] has similarly shown to adequately capture the input–output relationship of the OpenMalaria individualbased model (IBM) [52] of malaria transmission while improving upon the simulator’s overall goodness of fit. And a Gaussian process emulator of an IBM of microbial communities [53] has demonstrated an approximately 220fold increase in computational efficiency, with the percentage of variance explained by the univariate emulator ranging from 83 to 99%. Thus, through the selection of alternative design points, the emulator in this tutorial has the potential to approximate other simulators outside of those for HIV, hypertension and diabetes. As is the goal with any emulation exercise, our emulator offers a statistical model that can be used as a surrogate for chronic disease simulators. Because our simple emulator showed to be valid and more efficient in mimicking 10year prevalence of HIV comorbidities in the absence of intervention, the next step is to evaluate how the emulator will also be able to mimic future simulators that model disease burden in the presence of targeted interventions. For example, Hamilton et al. developed an agentbased simulation of HIV1 transmission in Kenya to estimate the potential populationlevel impact of providing PrEP layered into standard care services over 10 years [54, 55]. These and other longitudinal simulator data offer ideal design points to further test the emulator’s predictive accuracy for modelling intervention effects on HIV treatment and prevention. Several microsimulation models have been developed to characterize and plan for the rapidly growing burden of noncommunicable diseases in SSA [56,57,58] and elsewhere [59,60,61]. Also, though still in early stages, recent advances in computing power are now allowing large and complex ABMs to be simulated in reasonable amounts of time using desktop GPUs [62]. Yet, to the best of our knowledge, none of these NCD models have been made available to local public health decision makers through accessible and understandable platforms, which the present emulator may begin to correct.
Limitations and strengths of this tutorial
There are some limitations to the emulation methods presented in this paper. First, Gaussian process emulation was appropriate given the small number of design points being considered in this tutorial. However, GP emulators do not scale well when including many (e.g., > 50) inputs [9, 63, 64], such that several lowerdimensional emulators might be more appropriate when a greater number of simulator outputs are being considered [9]. Relatedly, because our original simulation model data were essential linear, the emulator increased the efficiency of our predictions (i.e., narrower uncertainty ranges). It is possible that our emulator may lose computational efficiency or yield estimates with greater uncertainty when applied to higher dimensional data or to data with asymmetric noise. Incorporating interval calibration [65], Monte Carlo [66], and other methods in future iterations of the emulator could help address this limitation. Second, we fit separate univariate GPs emulators to model each simulator output individually, which neglects any potential correlation [67] between the outputs [53] (e.g., between changes in the prevalence of hypertension and depression among PLWH over time). Future expansions of the emulator can address this issue by using a multivariate Gaussian process [68]. Lastly, as is common to mathematical modelling work, the simulation models selected for this tutorial suffered from incomplete surveillance data which restricted our ability to perform additional emulation procedures such as history matching [69] to reduce simulator input space or additional diagnostic checks [70] that rely on more robust training data.
Despite these limitations, the emulator presented in this paper offers an accessible tool for health policy makers who may not have a background in disease modelling. This will help to build the modelling capacity of local decision makers who are working to build integrated HIV and chronic disease programmes with limited resources [71]. While transparency surrounding microsimulation model development has increased in recent years as these models become more widespread, it is often impractical to document every detail of their functionality [11, 12]. Emulators do not address all transparency concerns related to black box modelling, but they can help address concerns related to communicating results from complex models for wider audiences. Thus, a key strength of this emulator is its simplicity; the stepbystep annotated code that was programmed using free R software and is available via an opensource repository can encourage future use and adaption at no cost to the user.
Secondly, our emulator uses Bayesian inferences rather than a frequentist approach in the posterior predictive analysis, which maximizes both the prior and currently available data. Third, we used the leaveoneout cross validation technique which offers an unbiased and reliable diagnostic check when similar distributions exist in the training and test data [47, 49, 70]. Fourth, our emulator fills a gap in the health decisionmaking toolbox in that it is one of the first to model the dual burden of HIV and hypertension and of HIV and depression for a country with a generalized HIV epidemic and growing noncommunicable disease burden.
Conclusion
This emulation tutorial responds to calls from international donors and global health researchers [72] to “make modelling tools and analytic packages publicly available to wider audiences” and facilitate “training of decision makers to understand model outputs, particularly uncertainty and confidence intervals”. We envision future expansions of this emulator to be able to estimate changes in HIV and NCD burden with greater coverage of National Health Insurance schemes and in the presence of integrated care interventions [73, 74], and to estimate the costeffectiveness of integrated interventions based on current [6, 75] and emerging data.
Availability of data and materials
All annotated code used to develop and run the emulator presented in this paper is freely available via the following digital repository: https://doi.org/10.7910/DVN/LUBYHQ
Notes
The simulation model was developed and published in 2019, at which time the UNAIDS 90–9090 targets were in effect in Kenya.
Abbreviations
 CPU:

Central Processing Unit
 GB:

Gigabyte
 GP:

Gaussian Process
 GPU :

Graphics Processing Unit
 HIV:

Human Immunodeficiency Virus
 inMODELA:

Integrated Modeling of Epidemiologic and Economic Longterm Outcomes in Africa
 KAIS:

Kenya AIDS Indicator Surveys (KAIS)
 LMICs:

Low and MiddleIncome Countries
 LOOCV:

LeaveOneOut Cross Validation
 NCD:

Noncommunicable Disease
 PLWH :

People Living with HIV
 RAM:

Random Access Memory
 SSA :

SubSaharan Africa
 STD:

Sexually Transmitted Disease
 STDSIM :

Sexually Transmitted Diseases Simulator
References
Law AM. Simulation Modeling and Analysis. 5th ed. Mcgraw Hill Series in Industrial Engineering and Management; 2014. p. 776.
Rutter CM, Zaslavsky AM, Feuer EJ. Dynamic microsimulation models for health outcomes: a review. Med Decis Mak. 2011;31:10–8.
Elveback L, Varma A. Simulation of mathematical models for public health problems. Public Health Rep. 1965;80:1067–76.
Kopec JA, Edwards K, Manuel DG, Rutter CM. Advances in microsimulation modeling of population health determinants, diseases, and outcomes. Epidemiol Res Int. 2012;2012:1–3.
Sugrue DM, Ward T, Rai S, McEwan P, van Haalen HGM. Economic modelling of chronic kidney disease: a systematic literature review to inform conceptual model design. Pharmacoeconomics. 2019;37:1451–68.
Nugent R, Barnabas RV, Golovaty I, Osetinsky B, Roberts DA, Bisson C, et al. Costs and costeffectiveness of HIV/NCD integration in Africa: from theory to practice HHS public access. Aids. 2018;32:83–92.
Nobile MS, Cazzaniga P, Tangherloni A, Besozzi D. Graphics processing units in bioinformatics, computational biology and systems biology. Brief Bioinform. 2017;18:870–85.
Andrianakis I, Vernon IR, McCreesh N, McKinley TJ, Oakley JE, Nsubuga RN, et al. Bayesian history matching of complex infectious disease models using emulation: a tutorial and a case study on HIV in Uganda. PLoS Comput Biol. 2015;11:e1003968.
Ellis AG, Iskandar R, Schmid CH, Wong JB, Trikalinos TA. Active learning for efficiently training emulators of computationally expensive mathematical models. Stat Med. 2020;39:3521–48.
Gulati R, Gore JL, Etzioni R. Comparative Effectiveness of Alternative ProstateSpecific Antigen–Based Prostate Cancer Screening Strategies. Ann Intern Med. 2013;158:145.
Abraham JM. Using microsimulation models to inform U.S. health policy making. Health Serv Res. 2013;48(2 Pt 2):686–95.
Lorscheid I, Heine BO, Meyer M. Opening the ‘black box’ of simulations: increased transparency and effective communication through the systematic design of experiments. Comput Math Organ Theory. 2012;18:22–62.
Blanning RW. The construction and implementation of metamodels. Simulation. 1975;24:177–84.
Conti S, Gosling JP, Oakley JE, O’Hagan A. Gaussian process emulation of dynamic computer codes. Biometrika. 2009;96:663–76.
De Foo C, Shrestha P, Wang L, Du Q, García Basteiro AL, Abdullah AS, et al. Integrating tuberculosis and noncommunicable diseases care in low and middleincome countries (LMICs): a systematic review. PLoS Med. 2022;19(1):e1003899.
Rohwer A, Uwimana Nicol J, Toews I, Young T, Bavuma CM, Meerpohl J. E ffects of integrated models of care for diabetes and hypertension in lowincome and middleincome countries: a systematic review and metaanalysis. BMJ Open. 2021;11:1–9.
Vorkoper S, Kupfer LE, Anand N, Patel P, Beecroft B, Tierney WM, et al. Building on the HIV chronic care platform to address noncommunicable diseases in subSaharan Africa: a research agenda. Aids. 2018;32(Suppl 1):S107–13.
Kemp CG, Weiner BJ, Sherr KH, Kupfer LE, Cherutich PK, Wilson D, et al. Implementation science for integration of HIV and noncommunicable disease services in subSaharan Africa: a systematic review. Aids. 2018;32:S93–105.
Adeyemi O, Lyons M, Njim T, Okebe J, Birungi J, Nana K, et al. Integration of noncommunicable disease and HIV/AIDS management: a review of healthcare policies and plans in East Africa. BMJ Glob Heal. 2021;6(5):e004669.
Kasaie P, Weir B, Schnure M, Dun C, Pennington J, Teng Y, et al. Integrated screening and treatment services for HIV, hypertension and diabetes in Kenya: assessing the epidemiological impact and costeffectiveness from a national and regional perspective. J Int AIDS Soc. 2020;23(Suppl 1):e25499.
Achwoka D, Mutave R, Oyugi JO, Achia T. Tackling an emerging epidemic: the burden of noncommunicable diseases among people living with hiv/aids in subSaharan Africa. Pan Afr Med J. 2020;36:1–9.
Wanni Arachchige Dona S, Bohingamu Mudiyanselage S, Watts JJ, Sweeney R, Coghlan B, Majmudar I, et al. Added socioeconomic burden of noncommunicable disease on HIV/AIDS affected households in the Asia Pacific region: a systematic review. Lancet Reg Health West Pac. 2021;9:100111.
Hyle EP, Mayosi BM, Middelkoop K, Mosepele M, Martey EB, Walensky RP, et al. The association between HIV and atherosclerotic cardiovascular disease in subSaharan Africa: a systematic review. BMC Public Health. 2017;17:1–15.
Patel P, Rose CE, Collins PY, NucheBerenguer B, Sahasrabuddhe VV, Peprah E, et al. Noncommunicable diseases among HIVinfected persons in lowincome and middleincome countries: a systematic review and metaanalysis. AIDS. 2018;32:S5–20.
Rasmussen CE, Williams CKI. Gaussian processes for machine learning. MIT Press; 2018.
Sacks J, Welch WJ, Mitchell TJ, Wynn HP. Design and analysis of computer experiments. Stat Sci. 1989;4(4):409–23.
Coetzee L, Bogler L, De Neve JW, Bärnighausen T, Geldsetzer P, Vollmer S. HIV, antiretroviral therapy and noncommunicable diseases in subSaharan Africa: empirical evidence from 44 countries over the period 2000 to 2016. J Int AIDS Soc. 2019;22(7):e25364.
Bloomfield GS, Khazanie P, Morris A, RabadánDiehl C, Benjamin LA, Murdoch D, et al. HIV and noncommunicable cardiovascular and pulmonary diseases in lowand middleincome countries in the art era: what we know and best directions for future research. J Acquir Immune Defic Syndr. 2014;67(Suppl 1):S40–53.
Osetinsky B, Hontelez JAC, Lurie MN, McGarvey ST, Bloomfield GS, Pastakia SD, et al. Epidemiological and health systems implications of evolving HIV and hypertension in South Africa and Kenya. Health Aff. 2019;38:1173–81.
Bakker R, Korenromp E, Meester E, Van Der Ploeg C, Voeten H, Van Vliet C, et al. STDSIM: a microsimulation model for decision support in the Control of HIV and other STDs. Am Sex Transm Dis Assoc. 2000;27:652.
Hontelez JAC, Lurie MN, Bärnighausen T, Bakker R, Baltussen R, Tanser F, et al. Elimination of HIV in South Africa through expanded access to antiretroviral therapy: a model comparison study. PLoS Med. 2013;10:e1001534.
Kenya Ministry of Health. STEPwise Survey for Non CommuniKenyacable Diseases Risk Factors 2015 Report. Kenya STEPwise Surv Non Commun Dis Risk Factors 2015 Rep, vol. 5. Kenya Ministry of Health; 2015. p. 8–210.
National AIDS and STI Control Programme (NASCOP) K. Kenya AIDS Indicator survey 2012: final report. Nairobi; 2014. https://nsdcc.go.ke/wpcontent/uploads/2015/10/KAIS2012.pdf. Accessed 26 Sep 2023
National AIDS Control Council. Kenya HIV County profiles. 2016. https://nsdcc.go.ke/wpcontent/uploads/2016/12/KenyaHIVCountyProfiles2016.pdf. Accessed 26 Sep 2023.
Smit M, Olney J, Ford NP, Vitoria M, Gregson S, Vassall A, et al. The growing burden of noncommunicable disease among persons living with HIV in Zimbabwe. AIDS. 2018;32:773–82.
Smit M, PerezGuzman PN, Mutai KK, Cassidy R, Kibachio J, Kilonzo N, et al. Mapping the current and future noncommunicable disease burden in Kenya by human immunodeficiency virus status: a modeling study. Clin Infect Dis. 2020;71:1864–73.
UNAIDS. On the FastTrack to end AIDS 2016–2021 Strategy. p. 2015. https://www.unaids.org/sites/default/files/media_asset/20151027_UNAIDS_PCB37_15_18_EN_rev1.pdf. Accessed 26 Sep 2023
Hashiguchi L, Achoki T, Alam U, Fullman N. The Global Burden of Disease: Generating Evidence, Guiding Policy. Nairobi, Kenya. Institute for Health Metrics and Evaluation and the International Centre for Humanitarian Affairs. 2016. https://www.healthdata.org/researchanalysis/library/globalburdendiseasegeneratingevidenceguidingpolicykenya. Accessed 26 Sep 2023.
Erickson C. A guide to the GauPro R package. 2023. https://cran.rproject.org/web/packages/GauPro/vignettes/GauPro.html. Accessed 26 Sep 2023.
Gabry J, Simpson D, Vehtari A, Betancourt M, Gelman A. Visualization in Bayesian workflow. J R Stat Soc Ser A Stat Soc. 2019;182:389–402.
Magnusson M, Andersen MR, Jonasson J, Vehtari A. Leaveoneout crossvalidation for Bayesian model comparison in large data. Proc Mach Learn Res. 2020; arXiv:2001.00980
Vehtari A, Gelman A, Gabry J. Practical Bayesian model evaluation using leaveoneout crossvalidation and WAIC. Stat Comput. 2017;27:1413–32.
Vernon I, Goldsteiny M, Bowerz RG. Galaxy formation: a Bayesian uncertainty analysis. Bayesian Anal. 2010;05(04):619–70.
Swiler LP, Gulian M, Frankel AL, Safta C, Jakeman JD. A survey of constrained Gaussian process regression: approaches and implementation challenges. J Mach Learn Model Comput. 2020;1:119–56.
Altaweel Y. Diagnostics and simulationbased methods for validating Gaussian process diagnostics and simulationbased methods for validating Gaussian process emulators. PhD Thesis. University of Sheffield; 2018. https://etheses.whiterose.ac.uk/19873/
Lynch SM. Bayesian Statistics. Encycl Soc Meas. 2005:135–44.
Shao Z, Er MJ, Wang N. An Efficient LeaveOneOut CrossValidationBased Extreme Learning Machine (ELOOELM) With Minimal User Intervention. IEEE Trans Cybern. 2016;46:1939–51.
Morrison RE, Bryant CM, Terejanu G, Prudhomme S, Miki K. Data partition methodology for validation of predictive models. Comput Math Appl. 2013;66:2114–25.
Beddows AV, Kitwiroon N, Williams ML, Beevers SD. Emulation and sensitivity analysis of the community multiscale air quality model for a UK ozone pollution episode. Environ Sci Technol. 2017;51:6229–36.
Iskauskas A. Emulation and History Matching Handbook. https://cran.rproject.org/web/packages/hmer/vignettes/emulationhandbook.html Accessed 03 May 2023
Reiker T, Golumbeanu M, Shattock A, Burgert L, Smith TA, Filippi S, et al. Emulatorbased Bayesian optimization for efficient multiobjective calibration of an individualbased model of malaria. Nat Commun. 2021;12:7212.
Smith T, Killeen GF, Maire N, Ross A, Molineaux L, Tediosi F, et al. Mathematical modeling of the impact of malaria vaccines on the clinical epidemiology and natural history of plasmodium falciparum malaria: overview. Am J Trop Med Hyg. 2006;75(Suppl 2):1–10.
Oyebamiji OK, Wilkinson DJ, Jayathilake PG, Curtis TP, Rushton SP, Li B, et al. Gaussian process emulation of an individualbased model simulation of microbial communities. J Comput Sci. 2017;22:69–84.
Hamilton DT, Agutu C, Sirengo M, Chege W, Goodreau SM, Elder A, et al. Modeling the impact of different PrEP targeting strategies combined with a clinicbased HIV1 nucleic acid testing intervention in Kenya. Epidemics. 2023;44:100696.
Coll P, Jarrín I, Martínez E, MartínezSesmero JM, DomínguezHernández R, CastroGómez A, et al. Achieving the UNAIDS goals by 2030 in people living with HIV: A simulation model to support the prioritization of health care interventions. Enfermedades Infecc y Microbiol Clin. 2023;41(10):589–95.
Gouda HN, Charlson F, Sorsdahl K, Ahmadzada S, Ferrari AJ, Erskine H, et al. Burden of noncommunicable diseases in subSaharan Africa, 1990–2017: results from the global burden of disease study 2017. Lancet Glob Heal. 2019;7(10):E1375–87.
Schnure M, Dowdy D. In: Bae KH, Feng B, Kim S, LazarovaMolnar S, Zheng Z, Roeder T, Thiesing R, editors. Proceedings of the 2020 Winter Simulation Conference. IEEE; 2020. p. 980–91.
Haacker M, Bärnighausen T, Atun R. HIV and the growing health burden from noncommunicable diseases in Botswana: modelling study. J Glob Health. 2019;9(1):010428. https://doi.org/10.7189/jogh.09.010428.
Nianogo RA, Arah OA. Forecasting obesity and type 2 diabetes incidence and burden: the ViLAobesity simulation model. Front Public Health. 2022;10 https://doi.org/10.3389/fpubh.2022.818816.
Lymer S, Schofield D, Lee CMY, Colagiuri S. NCDMod: a microsimulation model projecting chronic disease and risk factors for Australian adults. Int J Microsimulation. 2015;9:103–39.
Yarnoff B, Honeycutt A, Bradley C, Khavjou O, Bates L, Bass S, et al. Validation of the prevention impacts simulation model (PRISM). Prev Chronic Dis. 2021;18:200225.
Fain BG, Dobrovolny HM. GPU acceleration and data fitting: agentbased models of viral infections can now be parameterized in hours. J Comput Sci. 2022;61:101662.
Rasmussen CE. Gaussian Processes in Machine Learning. Bousquet, O., von Luxburg, U., Rätsch, G. (eds) Advanced Lectures on Machine Learning. ML 2003. Lecture Notes in Computer Science, vol 3176. Springer, Berlin, Heidelberg. https://doi.org/10.1007/9783540286509_4.
O’Hagan A. Bayesian analysis of computer code outputs: a tutorial. Reliab Eng Syst Saf. 2006;91(10–11):1290–300. https://doi.org/10.1016/j.ress.2005.11.025.
Thiagarajan JJ, Venkatesh B, Anirudh R, Bremer PT, Gaffney J, Anderson G, et al. Designing accurate emulators for scientific processes using calibrationdriven deep models. Nat Commun. 2020;11:5622.
LópezLopera AF, Bachoc F, Durrande N, Rohmer J, Idier D, Roustant O. Approximating Gaussian Process Emulators with Linear Inequality Constraints and Noisy Observations via MC and MCMC. In: Tuffin B, L'Ecuyer P, editors. Monte Carlo and QuasiMonte Carlo Methods. MCQMC 2018. Springer Proceedings in Mathematics & Statistics, vol 324. Cham: Springer; 2020. https://doi.org/10.1007/9783030434656_18.
Mayala BK, Bhatt S, Gething P. Predicting HIV/AIDS at Subnational Levels using DHS Covariates related to HIV. DHS Spatial Analysis Reports No. 18. Rockville, Maryland, USA: ICF; 2020.
Chen Z, Fan J, Wang K. Remarks on multivariate Gaussian Process. 2020. arXiv:2010.09830.
Craig PS, Goldstein M, Seheult AH, Smith JA. Pressure matching for hydrocarbon reservoirs: a case study in the use of Bayes linear strategies for large computer experiments; 1997.
Bastos LS, O’Hagan A. Diagnostics for Gaussian process emulators. Technometrics. 2009;51:425–38.
Revill P, Rangaraj A, Makochekanwa A, Mpofu A, Ciaranello AL, Jahn A, et al. Perspectives on the use of modelling and economic analysis to guide HIV programmes in subSaharan Africa. Lancet HIV. 2022;3018:1–4.
Kupfer LE, Beecroft B, Viboud C, Wang X, Brouwers P. A call to action: strengthening the capacity for data capture and computational modelling of HIV integrated care in low and middleincome countries. Journal of the international AIDS. Society. 2020;23(Suppl 1):e25475.
Genberg BL, Wachira J, Steingrimsson JA, Pastakia S, Tran DNT, Said JA, et al. Integrated communitybased HIV and noncommunicable disease care within microfinance groups in Kenya: study protocol for the Harambee cluster randomised trial. BMJ Open. 2021;11:e042662.
Kibachio J, Mwenda V, Ombiro O, Kamano JH, PerezGuzman PN, Mutai KK, et al. Recommendations for the use of mathematical modelling to support decisionmaking on integration of noncommunicable diseases into HIV care. J Int AIDS Soc. 2020;23(Suppl 1):e25505.
Osetinsky B, Mwangi A, Pastakia S, WilsonBarthes M, Kimetto J, Rono K, et al. Layering and scaling up chronic noncommunicable disease care on existing HIV care systems and acute care settings in Kenya: a cost and budget impact analysis. J Int AIDS Soc. 2020;23:e25496.
Acknowledgements
Not applicable.
Funding
Research reported in this publication was supported by the National Institute of Mental Health of the National Institutes of Health under award number R01MH118075, and by the Civilian Research and Development Foundation (CRDF) Global under award number R20210467710. Approximately 50 % of this research was financed with Federal money. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or CRDF Global.
Author information
Authors and Affiliations
Contributions
OG, SC, AM and FY led the conception and design of the work. SJS led data analysis and interpretation and wrote the first draft of the manuscript. RM, MWB and BO contributed to data analysis and to the conception and design of the work. All authors revised the work critically for important intellectual content, provided final approval of the version to be published, and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sawe, S.J., Mugo, R., WilsonBarthes, M. et al. Gaussian process emulation to improve efficiency of computationally intensive multidisease models: a practical tutorial with adaptable R code. BMC Med Res Methodol 24, 26 (2024). https://doi.org/10.1186/s1287402402149x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287402402149x