- Research article
- Open Access
- Published:

# Comparison of Bayesian and frequentist approaches in modelling risk of preterm birth near the Sydney Tar Ponds, Nova Scotia, Canada

*BMC Medical Research Methodology*
**volumeÂ 7**, ArticleÂ number:Â 39 (2007)

## Abstract

### Background

This study compares the Bayesian and frequentist (non-Bayesian) approaches in the modelling of the association between the risk of preterm birth and maternal proximity to hazardous waste and pollution from the Sydney Tar Pond site in Nova Scotia, Canada.

### Methods

The data includes 1604 observed cases of preterm birth out of a total population of 17559 at risk of preterm birth from 144 enumeration districts in the Cape Breton Regional Municipality. Other covariates include the distance from the Tar Pond; the rate of unemployment to population; the proportion of persons who are separated, divorced or widowed; the proportion of persons who have no high school diploma; the proportion of persons living alone; the proportion of single parent families and average income. Bayesian hierarchical Poisson regression, quasi-likelihood Poisson regression and weighted linear regression models were fitted to the data.

### Results

The results of the analyses were compared together with their limitations.

### Conclusion

The results of the weighted linear regression and the quasi-likelihood Poisson regression agrees with the result from the Bayesian hierarchical modelling which incorporates the spatial effects.

## Background

Public awareness about potential environmental hazards has continued to grow in recent years. This concern has led to an increased demand for public health authorities and researchers to investigate potential clustering of diseases around putative sources of hazards [1â€“10]. Evidence of significant association between maternal proximity to hazardous waste sites and risk of low birth-weight and congenital anomalies has been reported in some studies [4â€“9, 9â€“12], but other studies have reported otherwise [1, 3, 6, 13â€“15]. Some studies have also shown that women exposed to PCB are at increased risk of giving birth to infants with low birth weight [16, 17].

An assessment of the effect of human exposure to particular substances can be complex because multiple chemicals are usually involved so it may be difficult to discern the specific agent responsible for a particular health concern [16â€“19]. Furthermore, extraneous factors, like cultural and socioeconomic, may confound the effect of direct exposure to a waste site [16â€“24]. Within the boundaries of these limitations, the theory of spatial modelling and its applications to waste landfills and risk of adverse health have been developed and extensively discussed [25â€“29]. Regression analysis is one of the most widely used methods in the modelling of disease risk associated with proximity to a point source [25]. The parameters of the regression model can be estimated using the Bayesian or the frequentist approaches with spatial data assumed to be available at the individual case level or as spatially aggregated counts in enumeration districts (ED) [25â€“27].

In this paper, we focus on the comparison of the Bayesian and frequentist regression methods for aggregated counts. Specifically, we compare the Bayesian hierarchical Poisson regression, quasi-likelihood Poisson regression and weighted linear regression modelling approaches in answering the following two questions: 1) Is maternal proximity to hazardous waste and pollution from the Sydney Tar Pond sites associated with increased risk of preterm birth? 2) How much of the variation in preterm birth can be explained by socioeconomic inequalities across the study region?

## Methods

In the following subsections we provide a description of the study site, the data used for analyses and the theoretical framework of methods used to analyse the data.

### Tar Pond site in Sydney

The history of the Tar Pond site in Sydney, Nova Scotia, and the health consequences are well documented [2, 30]. The Tar Pond is a tidal estuary of 33 hectares in the Cape Breton regional municipality of Nova Scotia, Canada. This site, considered to be the most toxic site in Canada, is a result of over 100 years of steel manufacturing and other allied industries in the area. The byproducts from these industries include BTEX (benzene, toluene, ethylbenzene, and xylene), PAH (polycyclic aromatic hydrocarbons), PCB (polychlorinated biphenyl) and particulate laden with toxic metals, such as arsenic, lead and other heavy metals. This has led to the contamination of soil and other sources of natural water in the surrounding areas.

### Data description

Cape Breton regional municipality is made up of 158 enumeration districts but aggregated counts of preterm birth were available from only 144 enumeration districts in the municipality. There were 1604 observed cases of preterm birth out of a total population of 17559 at risk of preterm birth. Other variables include the distance from the Tar Pond (*d*) and the following area-specific covariates; the proportion of persons who are separated, divorced or widowed (*x*
_{2}); the proportion of persons who have no high school diploma (*x*
_{3}); the proportion of people living alone (*x*
_{4}); the proportion of single parent families (*x*
_{5}) and average income (*x*
_{6}). The covariates were selected based on the Pampalon and Raymond index [21] for health and welfare planning in Quebec. All area-specific covariates were extracted from the 1996 Canadian census data.

### Some theoretical background and context

Let *Y*
_{
i
}denote the number of observed cases of preterm birth, and *N*
_{
i
}the population at risk in each enumeration district (ED). The expected counts (*E*
_{
i
}) for each ED was calculated by multiplying *N*
_{
i
}by the the Canada preterm birth rate of 7.1 per 100 live births in 1996 (source: Population and Public Health Branch, Health Canada). This rate is assumed fixed for 1996 and may have been calculated by including data from the Cape Breton regional municipality, but we will assume that the effect of this can be ignored. Hence, *E*
_{
i
}is the expected number of preterm birth from all other sources of risk other than pollution from the Sydney Tar Pond. Preterm births only occur in females within the child-bearing age and the condition is not infectious. Hence, it is reasonable to assume that each case occurred independently. We also assumed that the risk is constant in each ED, so that

*Y*
_{
i
}|*Î»*
_{
i
}~ Poisson (*E*
_{
i
}
*Î»*
_{
i
}) *i* = 1,..., *n*,

where *Î»*
_{
i
}denotes the relative risk of preterm birth for each ED compared to the whole country [31]. The maximum likelihood estimator of *Î»*
_{
i
}is the unadjusted standardized incidence ratio (SIR), the ratio of observed to expected within each ED [27, 32]. We use a regression approach to adjust the crude SIR to improve its stability where the population at risk may be small [27, 29, 32, 33].

Based on the work of Morris and Wakefield [27], we define the null hypothesis that proximity to source does not influence risk by

*H*
_{0} : *Î»*
_{
i
}= *Î·* for *i* = 1,..., *n*.

Now suppose (*x*
_{0}, *y*
_{0}) denotes the centroid of the Tar pond, (*x*
_{
i
}, *y*
_{
i
}) the centroid of each ED and *d*
_{
i
}the distance between the two centroid. In the absence of an exposure measure that may be attached to each ED, Morris and Wakefield [27] define a natural additive distance/risk model by

*Î»*
_{
i
}= *Î·* {1 + *f*(*d*
_{
i
}; *Î¸*)}

where *Î·* is the background relative risk and *f*(*d*
_{
i
}; *Î¸*) is a function of distance, such that *f* (*d*
_{
i
}; *Î¸*) â†’ 0 as *d*
_{
i
}â†’ âˆž. We will use a reparameterization of the form

*Î»*
_{
i
}= *Î· g*(*d*
_{
i
}; *Î¸*)

so that this model will be consistent with Bithell [34]. With this reparameterization, *g*(*d*
_{
i
}; *Î¸*) â†’ 1 as *d*
_{
i
}â†’ âˆž. Bithell [34] proposed the following distance functions as suitable forms for *g*(*d*
_{
i
}).

*g*
_{1}(*d*
_{
i
}) = exp(*Î±*/*d*
_{
i
})

*g*
_{2}(*d*
_{
i
}) = 1 + *Î¾* exp(-*d*
_{
i
}/*Î²*)

*g*
_{3}(*d*
_{
i
}) = 1 + *Î¾* exp(-(*d*
_{
i
}/*Î³*)^{2})

*g*
_{4}(*d*
_{
i
}) = 1 + *Î¾*/(1 + *d*
_{
i
}/*Î´*)

where *Î±*, *Î²*, *Î³*, and *Î´* represent decay rates. For *g*
_{2}(*d*
_{
i
}), *g*
_{3}(*d*
_{
i
}) and *g*
_{4}(*d*
_{
i
}), 1 + *Î¾* is a measure of the ratio of relative risk at source to that at infinity. Other variants of the Bithell functions have also been proposed [35]. For simplicity, and following Datta *et al.* [32] and Bithell [34], we have chosen

*g*(*d*
_{
i
}; *Î¸*) = *g*
_{1} (*d*
_{
i
}; *Î¸*) = exp(*Î±*/*d*
_{
i
}).

We incorporated the area-level covariates (*z*
_{
i
}) and a measure of the spread of the risk from the Tar pond through a generalized linear model of the form

where *Î±*
_{
o
}= log *Î·*. Hence, *Î·* = exp(*Î±*
_{
o
}) is a measure of the overall inflation of risk in the region under study, *Î±* represents the decay rate and *Ï•* is a vector of parameters of the area-specific covariates. One of the problems associated with the use of equation (7) is overdispersion (heterogeneity or spatial dependency) [36]. In the frequentist framework, we have assessed spatial autocorrelation by using any of the Moran's I statistics [37]. Other alternatives include Geary's C statistic [38] and non-parametric rank-based method [39]. The Bayesian approach is discussed in the next section.

### Bayesian hierarchical modelling

To model the data while accommodating the expected heterogeneity and also including the spatial components (location or relative position of data values) of the data, Bayesian hierarchical modelling [33, 40, 41] was used. The implementation of this modelling was done with WINBUGS and GeoBugs software [42] for modelling aggregated data with plots and convergence diagnostic tests done using the **coda** package in R [43]. The mean or median of the posterior distribution is used as a point estimate of disease risk for each area. The modelling is explained in the following three stages:

#### First-stage: model

We incorporated two measures of overdispersion, so that equation (7) becomes

where *V*
_{
i
}are unstructured random effects included in the model to capture the effects of unknown or unmeasured area level covariates. Hence, exp(*V*
_{
i
}) will be equal to the residual or unexplained relative risk in each ED after adjusting for known area-specific covariates. We have included *U*
_{
i
}in the model to capture our belief that the unstructured random effects (*V*
_{
i
}) may exhibit some spatial structure.

#### Second-stage: overdispersion modelling

We assume that the unstructured random effects which is a measure of heterogeneity is of the form

where {\mathrm{\xcf\u0192}}_{v}^{2} is a measure of the between-area variability of the *V*
_{
i
}. Next, we specify the spatial random effect to model the anticipated spatial dependence of the log of relative risk. For a detailed review on the modelling of the spatial variability see Wakefield *et al.* [26, 41].

We specified the Markov random field (MRF) model using the intrinsic conditional autoregressive (CAR) proposed by Besag *et al.* [40]. We define ED *i* and *j* as neighbours if they share a common boundary [31, 40, 41]. We also define the spatial weights {*W*
_{
ij
}: *i* = 1,..., *n*} as a binary contiguity matrix in which *W*
_{
ij
}= 1 for neighbours and *W*
_{
ij
}= 0 otherwise. Furthermore, *W*
_{
ii
}= 0 and the constraint {\displaystyle {\xe2\u02c6\u2018}_{i=1}^{n}{U}_{i}=0} is imposed for identifiability.

#### Third-stage: prior distributions

At this stage all the parameters (*Î±*
_{
o
}, *Î±*, *Ï•*, {\mathrm{\xcf\u0192}}_{v}^{\xe2\u02c6\u20192} and {\mathrm{\xcf\u0192}}_{u}^{\xe2\u02c6\u20192}) of the model are assigned a prior distribution. *Î±*
_{
o
}was assigned a flat prior which corresponds to a uniform distribution over the whole real line. *Î±*, and *Ï•*
_{
i
}were assigned a normal (0, 10^{5}). The choice of prior for {\mathrm{\xcf\u0192}}_{v}^{\xe2\u02c6\u20192} and {\mathrm{\xcf\u0192}}_{u}^{\xe2\u02c6\u20192} is a very challenging one and it has to be done carefully. Many authors have favoured the use of gamma (a, b) for both {\mathrm{\xcf\u0192}}_{v}^{\xe2\u02c6\u20192} and {\mathrm{\xcf\u0192}}_{u}^{\xe2\u02c6\u20192} because it is a conjugate prior in the normal model but the choice of *a* and *b* is what they have not agreed on [31â€“33, 36, 40, 41]. In our case, we have assigned gamma (0.1, 0.1) to both {\mathrm{\xcf\u0192}}_{v}^{\xe2\u02c6\u20192} and {\mathrm{\xcf\u0192}}_{u}^{\xe2\u02c6\u20192} and carry out sensitivity analysis with all the priors given in [31â€“33, 36, 40, 41].

The models were fitted using Markov Chain Monte Carlo (MCMC) simulation method [44]. Five separate chains starting from different initial values were run for each model. Convergence was assessed by visual examination of time series plots for each parameter and by carrying out the Gelman and Rubin diagnostic test [45] based on the ratio of between to within chain variances for each model. The time series plots with all the five chains superimposed were examined to see whether the chains were mixing well. Goodness of fit was examined using the Deviance Information Criterion (DIC) [46] which consists of two terms, one is a measure of goodness of fit and the other is a penalty for increasing model complexity so that smaller values of DIC indicate a better-fitting model.

We defined a quantity *Ïˆ* = *Ïƒ*
_{
u
}/(*Ïƒ*
_{
u
}+ *Ïƒ*
_{
v
}) as a measure of the relative contribution of *U*
_{
i
}and *V*
_{
i
}to the total overdispersion [33]. So that as *Ïˆ* â†’ 1, spatial variation dominates, while as *Ïˆ* â†’ 0, spatial variation becomes negligible.

### Poisson regression

For *Y*
_{
i
}~ Poisson(*Î¼*
_{
i
}), where *Î¼*
_{
i
}= *Î»*
_{
i
}E_{
i
}(*i* = 1,..., *n*), we assume the generalized linear model [47]. Four models were fitted for the log relative risk (log *Î»*
_{
i
}= log *Î¼*
_{
i
}- log *E*
_{
i
}) in terms of a constant, area-level covariates and the reciprocal of distance. The fitted models are:

log *Î»*
_{
i
}= *Î±*
_{
o
}

log *Î»*
_{
i
}= *Î±*
_{
o
}+ *Î±*/*d*
_{
i
}

log *Î»*
_{
i
}= *Î±*
_{
o
}+ *Ï•*
_{1}
*x*
_{1} + *Ï•*
_{2}
*x*
_{2} + *Ï•*
_{3}
*x*
_{3} + *Ï•*
_{4}
*x*
_{4} + *Ï•*
_{5}
*x*
_{5}

log *Î»*
_{
i
}= *Î±*
_{
o
}+ *Î±*/*d*
_{
i
}+ *Ï•*
_{1}
*x*
_{1} + *Ï•*
_{2}
*x*
_{2} + *Ï•*
_{3}
*x*
_{3} + *Ï•*
_{4}
*x*
_{4} + *Ï•*
_{5}
*x*
_{5}

No random effects or spatial effects was included. In each of the fitted models, log *E*
_{
i
}is used as an offset to account for variations in *Î»*
_{
i
}over the study region. The models were fitted using the quasi-likelihood approach to account for the overdispersion that might occur in the data set. The dispersion parameter, *Îº*, was estimated by the mean of the Pearson *Ï‡*
^{2} statistic.

### Weighted linear regression

A weighted regression approach was carried out to account for the dispersion that might result from the violation of the constant variance assumption in the least squares approach. The last three models (equations 10, 11 and 12) were fitted using weighted least square regression by replacing *Î»*
_{
i
}with the SIR ({\widehat{\mathrm{\xce\xbb}}}_{i} = *Y*
_{
i
}
*/E*
_{
i
}). The weights (*w*
_{
i
}) were set equal to {E}_{i}/{\displaystyle {\xe2\u02c6\u2018}_{i=1}^{n}{E}_{i}} so that the error sum of squares (*Q*) of the weighted linear regression can be written as

Here, we have not included the spatial component of the model because we have seen that the SIR does not exhibit spatial dependency during our exploratory data analysis.

## Results

In the following subsections, we explain the results of the exploratory data analysis and modelling.

### Exploratory data analysis

Plot of unadjusted standardized incidence ratios (SIR) against distance in km from the Tar Pond is shown in Figure 1. From the plots, areas with SIR less than 1 indicate no risk or absolute risk reduction while SIR greater than 1 indicate high risk of preterm birth compared to the rest of Canada. All the high values of SIR occurred within the 20 km distance from the Tar Pond. There is some evidence of decrease in risk from source as we move further away but this will be tested statistically in the next sections. However, as explained earlier, care has to be taken when interpreting the crude SIR. To illustrate this, we plotted the SIR against the population at risk (see Figure 2). This graph clearly shows that areas with low population at risk tend to show high variability in SIR. We accounted for this by using the Poisson model regression for aggregated data.

### Area-specific risk

Following Pampalon and Raymond [21], the following area-specific variables were considered for the analysis: the proportion of persons who have no high school diploma, the rate of unemployment, average income, the proportion of persons who are separated, divorced or widowed, the proportion of single parent families and the proportion of people living alone.

Only five of the variables are available at all the 144 EDs with average income available only in 130 EDs. Hence, we could not compute an adequate measure of deprivation based on the method proposed by Pampalon and Raymond. We decided to assess the effect of each of the variables separately leaving out average income. Distance from the Tar Pond site and all the area-specific variables were plotted against SIR to assess the effect of each. The plots are given in Figure 2.

As explained earlier, points below the dotted line indicate no risk or absolute risk reduction and vice versa. The plot of SIR and the rate of unemployment shows an upward trend with high unemployment rates associated with high SIR. A similar pattern is displayed by the plot of SIR against proportion of persons with no high school diploma. In the plot of the SIR against proportion of separated, divorced and widowed areas with low proportion of separated, divorced and widowed tend to have high SIR. A similar pattern is seen in the plot of SIR and proportion of people living alone. There is no obvious pattern in the plot of SIR against proportion of single parent families.

### Test for spatial dependency

One of the objectives of this study is to check for any clustering of events around the Tar Pond that may be significant in explaining the variation in preterm birth rates. This was done by plotting the maps of all the variables (Figure 3, 4, 5, 6, 7, 8) and visually assessing whether there is any clustering, and by performing a formal test of clustering using the Moran I statistic [37].

The map of SIR (Figure 3) was examined to see whether there is a cluster of high SIR around the Tar Pond or a decrease in the SIR as we move further away from the Tar Pond but neither of the two is obvious from the map. The maps of all the area-covariates (Figure 4, 5, 6, 7, 8) were examined to assess whether there is any spatial pattern. The plot of percentage of people living alone shows a pattern with the highest proportion of people living alone occurring within the 20 km radius of the Tar Pond site. The plot of the rate of unemployment to population also shows that the unemployment to population ratio decreases as we move further away from the Tar Pond. Furthermore, the proportion of persons who have no high school diploma also displays some spatial pattern with some of the areas close to the Tar Pond having high proportion. Existence of spatial autocorrelation was also tested formally using the Moran I test. Results of the spatial autocorrelation analysis given in Table 1, show the correlation, standard error, corresponding normal statistic and associated *p*-value. The results indicate that there is significant autocorrelation for all the covariates with the exception of SIR (*p* = 0.5387). However, a more confirmatory test is required. The result of the autocorrelation not being statistically significant could be due to different reasons (1) there is none; or (2) small populations which may give rise to high SIR [39].

### Bayesian analysis results

The following four models were fitted using the five area covariates available at all the 144 EDs and a measure of proximity (*d*
_{
i
}): Model 1 which contains no covariates and corresponds to the null model; Model 2 which contains the distance measure alone; Model 3 with deprivation covariates alone; and finally, Model 4 with distance and deprivation covariates.

The Gelman Rubin Plots shows that the "shrinkage factor" for each parameter approaches 1. Hence, all chains have escaped the influence of their starting points. The autocorrelation plots shows that autocorrelation decrease rapidly from lag 1. On this basis, the first 2000 samples of each chain were discarded as 'burn-in'; each chain was run for a further 10,000 iterations, and posterior estimates were based on pooling the 5 Ã— 10, 000 samples for each model. This gave Monte Carlo standard errors that are less than 1% of the posterior standard deviation for each parameter. All the plots including the posterior density of each parameter after convergence are provided as additional file 1 (Bayesian diagnostic plots). All the plots were produced with the **coda** package for R [43].

Table 2 gives the summaries of the posterior distribution under each model. From Table 2, we can see that estimates of *Î±* in both Models 2 and 4 are negative, and the 95% credible intervals contain zero which shows that there is no significant association between distance from the Tar Ponds and risk of preterm birth. The 95% credible interval for *Ï•*
_{
i
}(*i* = 1,..., 5) in Models 3 and 4 also contain zero which shows that the change in risk cannot be explained by any of the socio-economic covariates. For each of the models, *Î·*, a measure of the overall risk, was found to be greater than 1 which is evidence that there is an increased risk of preterm birth in the entire Cape Breton region compared to the rest of Canada.

The parameters, *Ïƒ*
_{
u
}and *Ïƒ*
_{
v
}change only slightly over the four models. From Table 2, the 95% credible intervals for *Ïˆ* for each model contain 0.5. Hence, there is no clear evidence that the spatial structure dominates the random effect in any of the models. From the results of Table 2, the DIC increases as more variables are added into the model. Hence, Model 1 is better than all three other models. Finally, the posterior median of the relative risk of preterm birth were plotted against distance in km from the Tar pond. The plot is shown in Figure 9.

A comparison of the plots with that of Figure 1 shows there is a high relative risk (greater than 1) of preterm birth in almost all the enumeration districts. However, the risk is not as high in Figure 9 as in Figure 1. The plots also show that there is no clear relationship between distance and risk. This result is consistent with the results of two of the studies conducted in this area using primary data [1, 3]. They both concluded that a causal association between preterm births and maternal/residential proximity to the Tar Ponds could not be inferred from the statistical analysis. A map showing the posterior median of the relative risk of preterm births for Model 4 is shown in Figure 10.

### Poisson regression analysis results

The results of all four models are displayed in Table 3. For each of the fitted models, *Îº* was estimated to be approximately equal to 1, a condition that shows that there is no evidence of overdispersion. The Wald confidence intervals shown in Table 3 are based on the asymptotic normality of the parameter estimators. From the table, we can see that the estimated *Î±* in both Models 2 and 4 is negative, and the 95% Wald confidence intervals contain zero which is evidence that there is no decrease in risk as distance from Tar pond decreases.

The 95% confidence intervals for *Ï•*
_{
i
}(*i* = 1,..., 5) in Models 3 and 4 also contain zero which shows that the covariates are not significant factors in risk of preterm birth. This result shows that none of the variables make significant contributions to the explanation of the variation in risk. Recall that *Î·* = exp(*Î±*
_{
o
}) is a measure of the overall mean of the relative risk in the region under study. For each of the models, Table 3 gives the estimates of the overall risk together with its 95% confidence intervals. The overall mean of the relative risk is greater than 1 for each model which indicates that there is elevated risk of preterm birth across the whole of the Cape Breton municipality.

### Weighted regression results

The result of the fit is given in Table 4. None of the variables is significant in explaining the increased risk of preterm birth. The residual standard error for Model 4 was estimated to be 0.02347 on 137 degrees of freedom. Multiple *R*-square is 0.09795 which shows that the variables in the model account for about 10% of the total variation in the risk. The *F*-statistic for the regression relationship was estimated to be 2.479 on 6 and 137 degrees of freedom and the associated *p*-value is 0.0262. This shows that at least one of the parameters (*Î±*, and *Ï•*
_{
i
}) does not equal zero. Hence, there is evidence of a regression relationship between the dependent variable (*Y*
_{
i
}) and the area-specific variables (*z*
_{
i
}).

#### Test for autocorrelation

Next, Moran's I test was also carried out to examine whether there is spatial autocorrelation in the residuals. The result gave a correlation of -0.01628, variance of 0.002541 and standard error of 0.05041. In addition, the normal test statistic was -0.1843 with associated 2-sided *p*-value equal to 0.8538. These results are sufficient to conclude that there is no spatial autocorrelation in the residuals. Hence, there was no need to use spatial regression modelling.

## Discussion and conclusion

In practice, a typical spatial regression modelling will start with the examination of the dependent variable for spatial dependency. This can be done with Moran's I statistic or Geary C statistic. If there is no spatial pattern, then ordinary least squares or weighted least squares is sufficient to model the data. On the other hand if the dependent variable shows a spatial patterns, the first order spatial pattern can be incorporated at the beginning of the modelling using an adjacency matrix. However, great care has to be taken when using spatial modelling. First, some of the available parametric tests for measuring spatial autocorrelation, including Moran's I [37] and Geary's C [38] methods, are not robust when the data is sparse. The non-parametric rank-based method [39] is not available in most standard statistical software. Second, the structure of the adjacency matrix may affect the result. Hence, it must be chosen carefully. This research is part of a project done to assess the effect of maternal proximity to the hazardous waste from the Sydney Tar Pond, Nova Scotia. Two question have been addressed in this project: first, is maternal proximity to hazardous waste and pollution from the Sydney Tar Pond sites associated with increased risk of preterm birth? Second, how much of the variation in risk of preterm birth can be explained by socioeconomic inequalities across the study region?

In addressing these questions frequentist and Bayesian methods were employed. In the frequentist approach, Poisson regression for aggregated data and weighted least squares were fitted using distance from the Tar Pond and the following area specific-covariates: the proportion of persons who have no high school diploma; the rate of unemployment to population; the proportion of persons who are separated, divorced or widowed; the proportion of single parent families; and the proportion of people living alone. The same models were fitted using a Bayesian hierarchical model incorporating both structured and unstructured random effects to account for model overdispersion.

Our intention was to combine all of the area covariates to form the deprivation index, but income data were not available in 14 of the 144 enumeration districts included in the study. So the effect of each variable was assessed independently. The overall estimate of relative risk of preterm birth was found to be greater than 1 for almost all the enumeration districts. Also, none of the area covariates in the model is significant in explaining the risk of preterm births.

There was no evidence of any decrease in risk as we move away from the Tar Pond site. The results of both the weighted least squares and the quasi-likelihood Poisson regression agree with the result from the Bayesian hierarchical modelling which incorporates the spatial effects. The result of the Bayesian modelling shows that there is no significant spatial association of risk in the area studied. There was no obvious clustering of outcomes around the Tar Pond significant enough to find an association between maternal proximity to the Sydney Tar Ponds and risk of preterm birth. Although the three methods lead to similar results, we think the three-stage Bayesian hierarchical modelling is one of the best approaches for handling this problem. First, it allows the modelling of both sources of overdispersion, heterogeneity and spatial dependence or clustering in one model, and second, it allows the estimation of SIR with adjustment of sparse data. The least suggested method is the weighted least square method because it does not lend itself to some of the assumptions of Poisson models.

The following are some of the limitations of this research. First, data were not available for 14 of the Enumeration districts. Hence, they were omitted from our analysis but the effects of this on spatial dependency or our conclusion are not known. Second, we have based our analysis on the 1996 data but we do not have any evidence of whether the exposure from the Tar Pond has decreased since 1996. Third, the use of aggregated data may increase the potential for ecological bias which can occur due to the differences between individual and group-level estimates of disease risk. In particular, factors that affect length of gestation such as parity have not been directly adjusted for in the modelling.

Our experience with this project shows that more work is still needed in this area. None of the models was able to predict more that 10% of what we would like to know. The future plans include aggregating the data for up to ten years and modelling using other forms of *g*(*d*; *Î¸*). We will also consider using individual level data and incorporating other covariates. The study shows that there is an elevated risk of preterm births, which appears to be uniform across the whole of the Cape Breton regional municipality as shown by all the methods used. This shows that the pollution may be occurring on a wider scale and over time may have affected the ability to differentiate the EDs in terms of amount of exposure. A direct comparison of the Cape Breton regional municipality with other nearby municipalities may help answer some of the remaining questions.

## References

Burra TA, Elliott SJ, Eyles JD, Kanaroglou PS, Wainman BC, Muggah H: Effects of residential exposure to steel mills and coking works on birth weight and preterm births among residents of Sydney, Nova Scotia. The Canadian Geographer. 1996, 50 (2): 242-255. 10.1111/j.0008-3658.2006.00138.x.

Nova Scotia Department of Health and the Cape Breton District Health Authority: Lead and arsenic biological testing program in residential areas near the coke ovens site. 2001, [Accessed September 10, 2004]., [http://www.gov.ns.ca/health/downloads/Full_Report.pdf]

Haalboom B, Elliott SJ, Eyles J, Muggah H: The risk society at work in the Sydney 'Tar Ponds'. The Canadian Geographer. 2006, 50 (2): 227-241. 10.1111/j.0008-3658.2006.00137.x.

Dolk H, Vrijheid M, Armstrong B, Abramsky L, Bianchi F, Garne E, Nelen V, Robert E, Scott JES, Stone D, Tenconi R: Risk of Congenital anomalies near hazardous waste landfill sites in Europe: the EUROHAZCON study. The Lancet. 1998, 352: 423-427. 10.1016/S0140-6736(98)01352-X.

Elliot P, Briggs D, Morris S, de Hoogh C, Hurt C, Jensen TK, Maitland I, Richardson S, Wakefield J, Jarup L: Risk of adverse birth outcomes in populations living near landfill site. British Medical Journal. 2001, 323: 363-368. 10.1136/bmj.323.7309.363.

Fielder HMP, Poon-King CM, Palmer SR, Moss N, Coleman G: Assessments of impact on health of residents living near the Nant-y-Gwyddon landfill site: retrospective analysis. British Medical Journal. 2000, 320: 19-23. 10.1136/bmj.320.7226.19.

Geschwind SA, Stolwijk JAJ, Bracken M, Fitzgerald E, Stark A, Olsen C, Melius J: Risk of congenital malformations associated with proximity to hazardous waste sites. American Journal of Epidemiology. 1992, 135: 1197-1207.

Gilbertson M, Brophy J: Community health profile of Windsor, Ontario, Canada: anatomy of a great lakes area of concern. Environmental Health Perspectives. 2001, 109 (suppl 6): 827-843. 10.2307/3454645.

Goldman LR, Paigen B, Magnant MM, Highland JH: Low birth weight, prematurity and birth defects in children living near the hazardous waste site, Love Canal. Hazardous Waste and Hazardous Materials. 1985, 2: 209-223.

Viana NJ, Polan AK: Incidence of low birth weight among Love Canal Residents. Science. 1984, 226: 1217-1219. 10.1126/science.6505690.

Berry M, Bove F: Birth weight reduction associated with residence near a hazardous waste landfill. Environmental Health Perspectives. 1997, 105: 856-861. 10.2307/3433704.

Goldberg MS, Goulet L, Riberdy H, Bonvalot Y: Low birth weight and preterm births among infants born to women living near a municipal solid waste landfill site in Montreal, Quebec. Environmental Research. 1995, 69: 37-50. 10.1006/enrs.1995.1023.

Baker DB, Greenland S, Mendlein J, Harmon P: A health study of two communities near the Stringfellow waste disposal site. Archives of Environmental Health. 1988, 43: 325-334.

Kharrazi M, von Behren J, Smith M, Lomas T, Armstrong M, Broadwin R, Blake E, Mclaughin B, Worstell G, Goldman L: A community-based study of adverse pregnancy outcomes near a large hazardous waste landfill in California. Toxicology and Industrial Health. 1997, 12: 211-224.

Shaw GM, Schulman J, Frisch JD, Cummins SK, Harris JA: Congenital malformations and birth weight in areas with potential environmental contamination. Archives of Environmental Health. 1992, 47: 147-154.

Baibergenova A, Kudyakov R, Zdeb M, Carpenter DO: Low birth weight and residential proximity to PCB-contaminated waste sites. Environmental Health Perspectives. 2003, 111: 1352-1357.

Rylander L, Stromberg U, Hagmar L: Lowered birth weight among infants born to women with a high intake of fish contaminated with persistent organochlorine compounds. Chemosphere. 2000, 40: 1255-1262. 10.1016/S0045-6535(99)00377-X.

Michal F, Grigor KM, Negro-Vilar A, Skakkebaek NE: Impact of the environment on reproductive health: executive summary. Environmental Health Perspectives. 1993, 101 (Suppl 2): 159-167. 10.2307/3431390.

Sullivan FM: Impact of the environment on reproduction from conception to parturition. Environmental Health Perspectives. 1993, 101: 13-18. 10.2307/3431370.

Jolley D, Jarman B, Elliot P: Socio-economic Confounding. Geographical and Environmental Epidemiology: Methods for Small-Area Studies. Edited by: Elliot P, Cuzick J, English D, Stern R. 1992, New York: Oxford press, 115-124.

Pampalon R, Raymond G: A deprivation index for health and welfare planning in Quebec. Chronic Diseases in Canada. 2000, 21 (3): 104-113.

Townsend P: Deprivation. Journal of Social Policy. 1987, 16 (2): 125-146.

Carstairs V, Morris R: Deprivation and Health in Scotland. 1991, UK: Aberdeen University Press

Vrijheid M: Health effects of residence near hazardous waste landfll sites: a review of epidemiologic literature. Environmental Health Perspectives. 2000, 108 (suppl 1): 101-112. 10.2307/3454635.

Diggle PJ, Morris S, Elliot P, Shaddick G: Regression modelling of disease risk in relation to point sources. Journal of the Royal Statistical Society, Series A. 1997, 160: 491-505.

Wakefield JC, Morris SE: The Bayesian modelling of disease risk in relation to a point source. Journal of the American Statistical Association. 2001, 96: 77-91. 10.1198/016214501750332992.

Morris SE, Wakefield JC: Assessment of disease risk in relation to a prespecified source. Spatial Epidemiology: Methods and Application. Edited by: Elliot P, Wakefield JC, Best NG, Briggs DJ. 2000, New York: Oxford University press, 153-184.

Lawson AB: On the analysis of mortality events associated with a prespecified fixed point. Journal of the Royal Statistical Society, Series A. 2001, 56: 363-377.

Lawson AB, Biggeri AB, Boehning D, Lesaffre E, Viel JF, Clark A, Schlattmann P, Divino F: Disease mapping models: an empirical evaluation. Statistics in Medicine. 2000, 19: 2217-2241. 10.1002/1097-0258(20000915/30)19:17/18<2217::AID-SIM565>3.0.CO;2-E.

Tara ARB: Reproductive and psychological health of women living in the vicinity of the Tar Ponds, Sydney, Nova Scotia. Master's thesis. 2002, McMaster University, Geography and Geology Department

Clayton D, Kaldor J: Empirical bayes estimates of age-standardized relative risks for use in disease mapping. Biometrics. 1987, 43: 671-682. 10.2307/2532003.

Datta G, Ghosh M, Waller LA: Hierarchical and empirical bayes methods for environmental risk assessment. Handbook of Statistics. Edited by: Sen PK, Rao CR. 2000, Elsevier Science B.V, 18: 223-245.

Best NG, Arnold RA, Thomas A, Waller LA, Conlon EM: Bayesian models for spatially correlated disease and exposure data. Bayesian Statistics 6. Edited by: Bernardo JM, Berger JO, Dawid AP, Smith AFM. 1999, New York: Oxford University Press, 131-156.

Bithell JF: The choice of test for detecting raised disease risk near a point source. Statistics in Medicine. 1995, 14: 2309-2322. 10.1002/sim.4780142104.

Diggle PJ: A point process modelling approach to raised incidence of a rare phenomena in the vicinity of a prespecified point. Journal of the Royal Statistical Society, Series A. 1990, 153: 340-362.

Wakefield JC, Kelsall JE, Morris SE: Clustering, cluster detection, and spatial variation in risk. Spatial Epidemiology: Methods and Application. Edited by: Elliot P, Wakefield JC, Best NG, Briggs DJ. 2000, New York: Oxford University press, 129-152.

Moran PAP: The interpretation of statistical maps. Journal of the Royal Statistical Society, Series B. 1948, 10: 243-251.

Geary RC: The contiguity ratio and statistical mapping. The Incorporated Statistician. 1954, 5: 115-145. 10.2307/2986645.

Walter SD: Assessing spatial patterns in disease rates. Statistics in Medicine. 1993, 12: 1885-1894. 10.1002/sim.4780121914.

Besag J, York J, Mollie A: Bayesian image restoration with two applications in spatial statistics. Annals of the institute of Statistics and Mathematics. 1991, 43 (1â€“59):

Wakefield JC, Best NG, Waller L: Bayesian approaches to disease mapping. Spatial Epidemiology: Methods and Application. Edited by: Elliot P, Wakefield JC, Best NG, Briggs DJ. 2000, New York: Oxford University press, 105-126.

Spiegelhalter DJ, Thomas A, Best NG: WINBUGS User Manual, version 1.2. 1998, UK: Cambridge, [Accessed October 30, 2004]., [http://www.mrc-bsu.cam.ac.uk/bugs]

Plummer M, Best NG, Cowles MK, Vines SK: Output analysis and diagnostics for Markov Chain Monte Carlo, version 0.7-1. 2004, [Accessed October 30, 2004]., [http://www-fis.iarc.fr/coda/]

Gilks WR, Richardson S, Spiegelhater DJ: Introducing Markov Chain Monte Carlo. Markov Chain Monte Carlo in Practice. Edited by: Gilks WR, Richardson S, Spiegelhalter DJ. 1996, London: Chapman and Hall, 1-17.

Gelman A, Rubin DB: Inference from iterative simulation using multiple sequences. Statistical Science. 1992, 7: 457-511.

Spiegelhalter DJ, Best NG, Carlin BP: Bayesian deviance, the effective number of parameters, and the comparison of arbitrarily complex models. 1998, [Accessed October 15, 2004]., [http://www.med.ic.ac.uk/divisions/60/biostat/dic.ps]

McCullagh P, Nelder JA: Generalized Linear Models. 1989, London: Chapman and Hall

### Pre-publication history

The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/7/39/prepub

## Acknowledgements

This work was funded through an NSERC Discovery Grant to one of the authors (AC). We thank Prof. Pavlos Kanaroglou of the School of Geography and Earth Sciences for giving us the data used in this project and allowing the use of the Center for Spatial Analysis at McMaster University. Our appreciation also goes to Patrick Deluca, a member of the Center for Spatial Analysis at McMaster University, for his assistance. We thank the reviewers and associate editors for their invaluable suggestions that substantially improved the manuscript.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The author(s) declare that they have no competing interests.

### Authors' contributions

ASI and AC developed the project. All authors participated in preparing this manuscript.

## Electronic supplementary material

### 12874_2006_222_MOESM1_ESM.pdf

Additional file 1: Bayesian diagnostic plots. Gelman Rubin plots from five parallel chains, kernel density plots of sampled values for parameters of model 4 based on five pooled chains and autocorrelation plot for each chain. (PDF 652 KB)

## Authorsâ€™ original submitted files for images

Below are the links to the authorsâ€™ original submitted files for images.

## Rights and permissions

**Open Access**
This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (
https://creativecommons.org/licenses/by/2.0
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Ismaila, A.S., Canty, A. & Thabane, L. Comparison of Bayesian and frequentist approaches in modelling risk of preterm birth near the Sydney Tar Ponds, Nova Scotia, Canada.
*BMC Med Res Methodol* **7**, 39 (2007). https://doi.org/10.1186/1471-2288-7-39

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1471-2288-7-39

### Keywords

- Preterm Birth
- Spatial Autocorrelation
- Nova Scotia
- High School Diploma
- Standardize Incidence Ratio