Applications of Bayesian approach in modelling risk of malaria-related hospital mortality

Background Malaria is a major public health problem in Malawi, however, quantifying its burden in a population is a challenge. Routine hospital data provide a proxy for measuring the incidence of severe malaria and for crudely estimating morbidity rates. Using such data, this paper proposes a method to describe trends, patterns and factors associated with in-hospital mortality attributed to the disease. Methods We develop semiparametric regression models which allow joint analysis of nonlinear effects of calendar time and continuous covariates, spatially structured variation, unstructured heterogeneity, and other fixed covariates. Modelling and inference use the fully Bayesian approach via Markov Chain Monte Carlo (MCMC) simulation techniques. The methodology is applied to analyse data arising from paediatric wards in Zomba district, Malawi, between 2002 and 2003. Results and Conclusion We observe that the risk of dying in hospital is lower in the dry season, and for children who travel a distance of less than 5 kms to the hospital, but increases for those who are referred to the hospital. The results also indicate significant differences in both structured and unstructured spatial effects, and the health facility effects reveal considerable differences by type of facility or practice. More importantly, our approach shows non-linearities in the effect of metrical covariates on the probability of dying in hospital. The study emphasizes that the methodological framework used provides a useful tool for analysing the data at hand and of similar structure.


Background
Plasmodium falciparum malaria is a major public health problem in most tropical countries in the world. Between 300 and 500 million cases of clinical episodes occur each year, and 1-3 million people die of the disease [1,2]. The sub-Saharan African region has the greatest burden with over 90% cases and 80% malaria-attributable deaths [3]. Measuring malaria burden in a population is a challenge in most developing countries [1,2], because most disease incidences and deaths occur outside of the formal health care, particularly at home [4,5]. Instead, routine hospital data provide a proxy for measuring the incidence of severe malaria and for crudely estimating morbidity rates or equivalent clinical indicators [6].
Analysis of these data may allow to assess, compare and ultimately improve the care provided at all levels of health care. It may assist in monitoring and planning resource needs in a health system and designing appropriate interventions, tailored towards communities at high risk or lead to further investigations to identify important risk factors [7]. Variability in these indicators is a well known issue, and is a function of various covariates, at both patient or group level, some observed and others unobserved, and maybe spatially correlated or time-varying [7][8][9][10]. Geographical differences are driven by socio-economic determinants, availability and access to health care or health seeking behaviour [7,10]. Temporal variation may again be a factor of access to care and malaria transmission [9], for example, there can be increased access in dry season and yet fewer cases in the same season. Adequate statistical modelling and analysis is, therefore, of epidemiological interest. This paper is motivated by the analysis of malaria-related hospital mortality data collected at patient's level, covering a period of two years among children admitted to a referral district hospital in Malawi. The response variable is binary (whether died of malaria in hospital or not) and is linked to several covariates which are categorical or continuous, spatial and temporal. Unobserved heterogeneity due to, for example, differences in practice style or type of hospital, inequalities in utilisation or access, may exist and should be explored. Hierarchical regression modelling provides a general framework to investigate the effect of these cofactors.
We apply a geoadditive logistic model as proposed by Fahrmeir and Lang [11]. Applications of such models are many and literature is growing. These models can be estimated through a fully or empirical Bayesian approach, and are implemented in BayesX [12]. For example, Augustin et al. [13] employed the model to study the relationship between needle losses of pine-trees and various covariates. Inference was performed with a full Bayes (FB) approach making use of Markov Chain Monte Carlo (MCMC) simulation techniques. Tutz [14] developed a class of generalised semiparametric mixed models and proposed penalized marginal likelihood approach for the estimation of parameters. Fahrmeir et al. [15] considered a penalised geoadditive model for space-time data with inference performed using an empirical Bayesian (EB) approach.
In this paper, we use the fully Bayesian approach via MCMC simulation techniques. The advantages of FB inference is that the functionals of the posterior can be computed without relying on large sample Gaussian justifications, and the approach is computationally feasible for large datasets. Moreover, the uncertainty in the parameters is easily quantified [15]. Furthermore, Bayesian methods are more flexible in that empirical information, when available, can be incorporated with the data through an informative prior distribution. When this information is not available, a non-informative prior can be chosen. The methodology is of substantive interest since the effects of other covariates are jointly estimated with the random effects, e.g., spatially structured and unstructured heterogeneity effects [16]. This is extended to incorporate nonparametric terms for nonlinear continuous covariates and time-varying coefficients, for example, time trend and seasonal variation of calendar time. In addition, space-time interactions are assessed within the varying-coefficient models framework [17].
The rest of this paper is organised as follows. We first describe the data. Next, we specify the model and outline the Bayesian approach used for model estimation. This is followed by the application of the model to the data, and then results are presented. Discussion on the results and limitations of the study conclude the article.

Data
Data were obtained from discharge records of all paediatric hospital admissions at Zomba district hospital, Malawi, between 1 January 2002 to 31 December 2003. Each case was confirmed as malaria on admission through microscopic identification of parasites in blood samples. Zomba district hospital, with over 500 beds is the largest health facility in the district and serves both as the first consultation point for patients within its catchment, and as a referral centre for other 23 primary health centres. These facilities are managed by the Ministry of Health and the Christian health association of Malawi, and variations in health care management is expected.
The discharge registers included patients' age, sex, date of admission and discharge, whether referred to the hospital or not, the discharge outcome (i.e. death, discharged home, home-based care or absconded), village or location of residence, and treatment given. Based on the name of the village, each case was matched to one of 21 residential wards in the district. Approximately 86% of cases were successfully linked to wards, the other 14% having either missing or insufficient residential information. Only georeferenced cases are included in this analysis. Table 1 gives a description of the variables used in this analysis.
A total of 302 deaths were registered among 3,969 children hospitalised for malaria, between January 2002 to December 2003, resulting in an overall case fatality ratio (CFR) of 7.6%. Table 1 shows the proportion who died in different covariate levels. The proportion varies with age, referral status, season, distance from the hospital and length of hospital stay (LOS). The CFR drops from 8.5% in the age of <1 year to 6.2% at age of between 1-4 years and increases in the 5-14 years groups to 10.5. This suggests a curvature in the association of age and the probability of in-hospital mortality. The number of cases are relatively more in the wet season (October-March) compared to the dry season (April-September), with a similar pattern of CFR. Boys are more frequently hospitalised than girls (58%), but the CFR is not different. The hospital receives relatively more patients from a distance of more than 5 kms (52%), with distant patient likely to die in hospital. As for LOS, CFR is very high on day 1, drops and then increases as the stay is prolonged. Again there is an indication of curvature in the relationship between LOS and the risk of inpatient mortality. Children referred to the hospital are most likely to die in hospital (CFR = 8.8%). Further detailed descriptive and exploratory analyses presented elsewhere clearly show spatial and temporal variations [10].

The Model
Given a set of observations (y i , w i ), i = 1, ʜ, n, where y i is a binary response such that y i = 1 if a child died in hospital and y i = 0 a child is discharged, and w i = (w i1 , ʜ, w ip )' are covariates, we consider a logistic model to estimate the probability of dying in hospital, y i = 1 versus the probability of being discharged from hospital, y i = 0. The response is distributed as a Bernoulli random variable such that: where p i = P(y i = 1), and η i = logit(p i ) is a canonical parameter linked to the linear predictor Here γ is a p-dimensional vector of unknown regression coefficients.
Since the observations are associated with location of residence, it is desirable to account for geographical differences. We introduce areal level effects to allow expected spatial correlation and any unstructured areal heterogeneity of morbidity, using a convolution prior [16]. We also specify health facility effects, which permit variations that occur by type of facility. These supply effects may impact on the referral patterns, admission patterns and case management. Furthermore, we assume additional flexibility in Total number of observations 3969 § n = number hospitalised and percent died in that category ‡ SD = standard deviation the predictor to allow for nonlinear or time-varying covariate effects. We, therefore, extend the predictor (2) to a more general semiparametric predictor [11], , ʜ, H} model unstructured heterogeneity at area and health facility levels respectively, f i are unknown functions for nonlinear effects of continuous covariate x i (e.g., age of the child), or calendar time effect t i . Note that the spatially structured effects and unobserved heterogeneity tries to capture all sources of unmeasured influential factors, some that occur locally or at large scale, or those that may vary with time.
Several extensions to the additive predictor (3) In addition, the model can be extended to include interaction surfaces within the varying coefficient framework proposed by Hastie and Tibshirani [17]. Here the effect of some covariate z is assumed to vary smoothly over the range of a second covariate x, giving the predictor of which the term f(x)z = g(x, z) is interpreted as an interaction term between z and x. In our case study, this can be time-space interactions, leading to a predictor of the form The function f 4 quantifies the deviations from the effect at some specified reference or baseline time period. This will be discussed in detail in a separate analysis.

Estimation: fully Bayesian approach
Prior distributions for covariate effects Modelling and inference uses the fully Bayesian approach. In the Bayesian formulation, the speci-fication of the proposed model (Equation 4) is complete by assigning priors to all unknown parameters. For the fixed regression parameters, a suitable choice is the diffuse prior, i.e., p(γ) ∝ const, but a weakly informative Gaussian prior is also possible. For the time and continuous covariates we estimate them nonparametrically through smoothness priors. We use the second-order Gaussian random walk prior to allow enough flexibility, while penalising abrupt changes in the function, as suggested by Lang and Brezger [18]. The prior can be expressed in the pairwise difference form as is the variance, with diffuse For the time-varying seasonal effect, we also assign a smoothness prior whose joint distribution, s, is given by again assuming diffuse priors for initial values, s 1 , ʜ, s 11 , and is a variance that controls the degree of smoothness. The unstructured spatial heterogeneity term, u i is assumed to follow an exchangeable Gaussian prior with zero mean and variance, . A similar prior is assigned to the heterogeneity term for the health facility, i.e., .
Finally, for the spatial components v i , we assign a Markov random field (MRF) prior [16]. This is analogous to random walk models. The conditional distribution of v i , given adjacent areas v j , is a univariate normal distribution with mean equal the average v j values of v i 's neighbouring areas and variance equal to divided by the number of adjacent areas. This leads to a joint density of the form where i ~ j denotes that area i is adjacent to j, and assumes that parameter values v i and v j in adjacent areas are similar.
The degree of similarity is determined by the unknown precision parameter .
well defined design matrix Z and a (possibly high-dimensional) vector of regression parameters β, all different pri- τ v 2 ors (Equations 5-7) can be expressed in a general Gaussian form with an appropriate penalty matrix K j . Its structure depends on the covariate and smoothness of the function. In most cases, K j is rank deficient and hence the prior for β j is improper. For the variances we assume inverse Gamma priors IG(a j , b j ), with hyperparameters a j , b j chosen such that this prior is weakly informative.

Posterior distribution
Fully Bayesian inference is based on the analysis of posterior distribution of the model parameters. In general the posterior is highly dimensional and analytically intractable, which makes direct inference almost impossible. This problem is circumvented by using MCMC simulation techniques, whereby samples are drawn from the full conditional of parameters given the rest of the data. Under conditional independence assumptions the posterior distribution for the Bernoulli model is given by Bayes Theorem where the quantity p(β, γ, τ 2 ) is the prior density function, and L(data|β, γ, τ 2 ) denotes the likelihood of the Bernoulli model. More specifically, the posterior is given by For updating the full conditionals of parameters, we use a hybrid MCMC sampling scheme of the iteratively weighted least squares (IWLS) proposals, developed for generalised linear mixed models by Gamerman [19], and Metropolis-Hastings algorithm. Full details are presented elsewhere [11,14,15,18].

Applications
We analyse the following logistic models, We implement the models in BayesX ver 1.4 -a public domain software for computing complex Bayesian techniques [12]. For the four models, 40,000 iterations are carried out after a burn-in sample of 10,000. We thin every 20th iteration, yielding 2,000 samples for parameter estimation. Convergence is monitored by plotting trace and autocorrelation plots of the samples. Quantiles, median, mean and standard deviation for all parameters, estimated from the posterior distributions, are used to assess model fit. In particular, credible intervals are used to assess the significance of parameters.
We also monitored the posterior deviance, and compared the set of plausible models using the Deviance Information Criterion (DIC) [20]. Specifically, we compare the structured additive models (i.e., M1, M2 and M3) with the simpler parametric alternative (M0). The DIC is given by

Model assessment
Comparing the goodness of fit of models M0, M1, M2, and M3 we note that M3 is a preferred model ( ). Evidently, modelling the impact of known factors alone is not sufficient to produce a satisfactory fit to the observations, and random effects at area and health care level are needed to improve fit and account for heterogeneity. In our analysis, we also observe that the inclusion of random effects reduce the effect size of some variable (results not shown). In what follow, we only report results based on model M3. Table 3 gives posterior means and odds ratios (OR), and the corresponding 95% credible interval (CI) for categorical covariates. The risk of dying in hospital is related to season, distance to the hospital and referral status of a child. No association is observed between probability of dying in hospital and sex, nor between probability of dying and day of the week. The likelihood of dying in hospital is lower in the dry season relative to the wet season (OR: 0.63, 95% CI: 0.49 to 0.86). For children who travel less than 5 kms to the hospital compare to those who travel more than 5 kms, the risk of dying in hospital is lower (OR: 0.005, 95% interval: 0.0006 to 0.28). Children referred to the hospital are at increased risk of dying in the hospital relative to those who do not (OR: 98.49, 95% interval: 21.33 to 383.75). Figure 1 displays the nonlinear effects of age of child and LOS on the probability of dying in hospital. The effect of age is estimated to be almost linear, with the posterior means increasing with increasing age (Figure 1a). In other words the risk is lower for infants, but increases for much older children. For LOS, the posterior means show slight deviation from linearity (Figure 1b). The risk decreases from day 1, remains almost constant from day 2 to 6, and then increases from day 7 to 20. Figure 2 displays the temporal effect as measured by the calendar effect. Again the time trend is estimated to be nonlinear ( Figure 2a). From week 1 to week 15, the risk decreases, then starts to increase up to week 55. From week 56 the risk is almost constant. The trend closely mirrored wet and dry seasons in the area, with high risk in the rain season and low risk in the dry season. This should be explained by the large number of children hospitalised during the wet season. The seasonal effect is given in Figure 2b. There is a clear seasonal variation for the entire study period. It is evident that the risk pattern displays both within month and between month variability. Figure 3 shows the spatial effects with the corresponding posterior probabilities map at 80% nominal level ( Figure  4). Areas shade black show strictly negative credible intervals, while white areas depict strictly positive credible intervals, and grey areas indicate nonsignificant credible intervals. There is evidence of spatial variation in risk of dying in hospital. It is clear that areas at the center of Zomba district, which is urban, report reduced risk, while those in the peripheral have increased risk. The uncorrelated spatial heterogeneity is given by caterpillar plot in Figure 5a. There are no clear differences in area specific effects, and most of them have a near zero effect on the probability of dying in hospital. It is clear that the spatially correlated effects are dominant, based on the ratio of variance components, ( Table   4). The other plot (Figure 5b) displays the heterogeneity effects at health facility level. We observe strong evidence of variation in risk possibly due to differences in health care management at various facilities.   Residual spatial effect of 'residential ward' in Zomba district Figure 3 Residual spatial effect of 'residential ward' in Zomba district. Shown are the posterior means. Red colour denotes regions with negative risk, green denotes regions with positive risk. Lake Chilwa is in diagonal solid lines.

Spatial effects
-2.4 5.7 Table 4 reports on the results investigating the influence of hyperpriors since the performance of the model can be sensitive to the choice of the variance components priors [21]. We therefore consider alternative specifications, and carry out sensitivity of our model assuming an IG with scale and shape parameters a and b respectively. We assume four alternatives a = 0.5, b = 0.0005, a = 1, b = 0.005, a = 0.001, b = 0.001 and a = 0.01, b = 0.01. The first specification was suggested by Kelsall and Wakefield [22], for modelling the precision of the spatial effects in an MRF model. The second alternative was proposed in Besag and Kooperberg [23]. The remaining two priors with equal scale and shape parameters, especially a = b = 0.001, have often been used as standard choice on the variances of random effects [24]. Re-running MCMC simulations based on these specifications, using model M3 for simplicity, yield relatively similar inference on risks of dying Nonlinear effect of (a) age of the child (in months); (b) length of hospital stay (in days) Figure 1 Nonlinear effect of (a) age of the child (in months); (b) length of hospital stay (in days). Shown are the posterior means (solid line) together with 95% pointwise credible intervals (dotted line).

Discussion
This study apply Bayesian techniques to analyse patterns and risk factors of malaria attributable case fatality data. We develop and use logistic regression models to have an in-depth understanding of factors associated with the probability of dying of malaria in hospital, building on the existing methodological contributions by Fahrmeir and Lang [11], Fahrmeir et al. [15].
A number of variables are used to explain the variation in the response and include spatial, continuous, categorical, and heterogeneity terms. The spatially structured variation and unstructured heterogeneity are modelled using MRF prior and zero mean Gaussian heterogeneity priors as proposed by Besag et al. [16]. The continuous variables are estimated non-parametrically by applying second order Gaussian random walk prior, which permits enough flexibility while avoiding over-fitting the data [18]. The proposed methodology allows all these factors to be estimated in a single framework. Because the models are highly parameterised and analytically intractable, the maximum likelihood approach is not be feasible. Thus, the Bayesian inference, making use of MCMC simulation techniques, offers a viable alternative.
In this paper we find evidence that the risk of dying in hospital due to malaria is lower in the dry season, and for children travelling less than 5 kms to the hospital. However, for those referred to the hospital the risk increases. These results seem to suggest that when health care is accessible or available lives can be saved. Malaria is a preventable disease, but delayed treatment or lack of effective treatment can lead to fatal malaria within days [1]. Children are particularly vulnerable because of lack of immunity against the disease [2]. The risk decreases with age, again infants being the most vulnerable, but overall chil-dren under five years are the most at risk. The increase in risk for those aged 6-14 years, although these are supposed to be protected through acquired immunity, may reflect some aspects of health seeking behaviour, and emphasize the need for prompt and effective management of malaria for all children including those aged over five years even if such cases may not frequently occur in the general population [7,9,10].
The lower risk in the dry season should be interpreted with care. While the risk of infection is reduced during this period, this effect is directly linked to few cases being hospitalised, hence fewer deaths, and if anything such death should reflect disease management other than severity of the disease. Another possible explanation is that during the dry season access to the hospital is easier than during rainy season, leading to early treatment, and therefore fewer avoidable deaths. Referral children are at increased risk because probably these are already worse-off when they arrive at the hospital. Disease management differ-Posterior probabilities, at nominal level of 80%, for the spatial effects in Figure 3

Figure 4
Posterior probabilities, at nominal level of 80%, for the spatial effects in Figure 3. Black denotes regions with strictly negative credible intervals, white denotes regions with strictly positive credible intervals, while grey shows areas of no significant difference. Lake Chilwa is in diagonal solid lines.

(b)
ences or inaccessibility of care may contribute towards this finding.
The spatial effects are often a surrogate of underlying unobserved information, and may give leads for further epidemiological research or assist in designing malaria interventions. For example, the increased risk in rural areas may be an influence of different factors, such as una-vailability or inaccessibility of health facilities resulting in increased risk for such children. These effects may also reflect health seeking behaviour, which plays a critical role in accessing prompt and effective care. Since most antimalarial remedies at first taken at home, effective care may be delayed, leading to increased risk for rural children. Scaling-up of interventions such as insecticide-treated nets or health promotions on appropriate and effective treatment Residual unstructured heterogeneity effects of (a) residential wards, and (b) primary health care facilities Figure 5 Residual unstructured heterogeneity effects of (a) residential wards, and (b) primary health care facilities. Shown are the caterpillar plots of posterior means (circles), with 95% error bars. The significance of health facility effects further suggests that management of health care differs in the 23 referring facilities in Zomba district. Indeed, as these are public or private operated, resources such as drugs or ambulatory support may be lacking mainly in government-run health centers. Moreover, some facilities, for example, dispensaries and clinics have limited capacity to treat severe malaria, and may not refer severely sick in time because of lack of communication. There is need to ascertain actual factors contributing to such discrepancy, e.g. using health facility surveys on malaria case management. If indeed, these are the underlying factors, resources need to be committed to improve primary health care. The seasonal variations indicate that malaria transmission processes may explain the variation in the probability of dying in hospital. This is because malaria transmission is highly seasonal and may change within the same area as the year progresses. Essentially, interventions or health promotion campaigns should be tailored in recognition of these varying risk patterns.
The data-driven approach we have taken in this analysis has a greater advantage in that the nonlinear effects of continuous variables are estimated, and avoids ad hoc categorizations although the effect of age can as well be estimated as linear ( Figure 1). Indeed, the methodological framework we have applied provide useful tools for handling this type of data, and in similar conditions. Our application demonstrates that spatial and temporal analysis may reveal some salient features of the data, which may be overstepped by the classical regression models (Model M0) or the purely spatial models (Model M2    appropriate for decision making on which model to choose have already been employed, for example see Gosoniu et al [26]. Nevertheless, some simulation results [27] suggest that the DIC gives reasonable results even in complex nonparametric regression models.
A major limitation of our analysis is that data used comes from hospital registers. In most African countries, most malaria cases occur at home, and the pattern may be biased towards urban areas that are well covered by health facilities. Moreover, one may argue that much of this data represent severe forms of malaria, because studies on health seeking behaviour for malaria report that biomedical care is sought when the disease is nearly fatal [5].
Health facility data can best be described as providing proxies for prevalence or morbidity and hence health need. A more representative data is through cross-sectional household surveys, e.g. the demographic and health surveys (DHS), however, these are often carried out every four years, thus the periodicity is not frequent enough for surveillance and to inform immediate decision making [4].

Conclusion
In many resource-poor African countries, collection of population-based health data is a challenge and hospital data provide a critical source of information for decision making. In this paper, we set out to analyse risk factors of malaria mortality, using hospital register data. Our model, using the Bayesian approach, shows that malaria mortality is associated with both individual and group level factors, as well as observed and unobserved risk factors, some of which exhibit spatial and seasonal variation. From a public health perspective, with a goal of prevention and control, our results highlight that reducing malaria burden may require integrated strategies encompassing improved availability and access to effective care at primary facilities; reinforcing home and community case management where prompt care is inaccessible, and encouraging early referral, as well as inducting health promotion interventions aimed at interrupting malaria transmission. Methodologically, this model can easily be adapted to analyse other health indicator of similar structure and in like settings.