A semi-parametric approach to estimate risk functions associated with multi-dimensional exposure profiles: application to smoking and lung cancer

Background A common characteristic of environmental epidemiology is the multi-dimensional aspect of exposure patterns, frequently reduced to a cumulative exposure for simplicity of analysis. By adopting a flexible Bayesian clustering approach, we explore the risk function linking exposure history to disease. This approach is applied here to study the relationship between different smoking characteristics and lung cancer in the framework of a population based case control study. Methods Our study includes 4658 males (1995 cases, 2663 controls) with full smoking history (intensity, duration, time since cessation, pack-years) from the ICARE multi-centre study conducted from 2001-2007. We extend Bayesian clustering techniques to explore predictive risk surfaces for covariate profiles of interest. Results We were able to partition the population into 12 clusters with different smoking profiles and lung cancer risk. Our results confirm that when compared to intensity, duration is the predominant driver of risk. On the other hand, using pack-years of cigarette smoking as a single summary leads to a considerable loss of information. Conclusions Our method estimates a disease risk associated to a specific exposure profile by robustly accounting for the different dimensions of exposure and will be helpful in general to give further insight into the effect of exposures that are accumulated through different time patterns.

where t m (l, s) denotes the Student's-t distribution, with m degrees of freedom, and location l and scale s and R is the number of fixed effect coefficients. eFigure1 provides a directed acyclic graph (DAG) of our model.

eAppendix 2
The code for profile regression is freely available in package PReMiuM e1 of the R statistical software e2 . Once the data and the package have been loaded in R, the lines of code to implement profile regression are given below.
We run our sampler for 20000 iterations, discarding the first 10000. It is a well known problem for samplers of such models that due to the highly multi-modal nature of the clustering space, the sampler can sometimes struggle to move between equally well supported local modes, so that two separate runs may suggest different clusterings, and that this might be sensitive to the initial allocation of the chain. Guidance for the PReMiuM sampler suggests ensuring that the individuals are initially allocated into a larger number of clusters e3 and we also use the marginal model posterior as a measure of convergence e3 .

eAppendix 3
At each sweep r of the MCMC sampler we can define an additional "allocation" variable,Z r p corresponding to each pseudo-profile p. These variables do not affect the fit of the model, which is determined wholly by the observed data. However, for each pseudo-profile we can compute the posterior probabilities p(Z r p = c|x p , Θ r , D) where D is the observed data (y, x 1 , . . . , x N ).
With these probabilities we can construct a cluster-averaged estimate of the log odds θ for each particular pseudo-profile at each sweep. Specifically, Looking at the density of these log odds (or the log odds ratio with respect to the non-smoking reference pseudo-profile) over MCMC sweeps gives us an estimate of the effect of a particular pseudo-profile, and can be compared to other pseudo profiles, allowing us to derive a better understanding of the role of specific covariates.

eAppendix 4
Unlike for the observed individuals where posterior allocation probabilities depend on both the profile and response sub-models, for the predictive pseudo-profiles the allocation probabilities are determined only by the profile sub-model. Furthermore, missing variables in the pseudo-profiles are ignored when the allocation probabilities are computed. The impact of this is subtle, but essentially means that the missing value will reflect the covariate patterns present in the main sample. For example, if high intensity is associated with high duration, the high intensity profile is likely to be assigned to a cluster which is also characterised by high duration, meaning that in effect, the missing duration would be treated as high duration for this profile. However, for a different pseudo-profile (e.g. low intensity), the missing value might be treated as something different (e.g. low duration). Because of this, the marginal effect of intensity that is derived has to be interpreted as a population average effect, over a population with similar characteristics to that under study.

eAppendix 5
We performed a small simulation exercise to illustrate the performance of profile regression, CART and logistic regression for predictive modelling, both when no apparent structure is present in the covariates and in the presence of a strong signal.
Scenario 1 We first generated 10 binary predictors x = (x 1 , . . . , x 10 ) and a case-control disease status of 2500 unrelated subjects. Predictor variables were generated from a Bernoulli distribution with a probability of 0.5 and were randomly associated with the outcome y. Therefore, individual predictors were not statistically significant predictors of the outcome. We further considered two interaction terms, which were highly associated with the outcome. We considered all single predictors and two-way interactions to find the best logistic model. The multiple logistic approach and stepwise selection procedure identified four single predictors and four interactions. Thus, the best logistic model is: With the above model, we calculated the probabilities that a subject has a disease for given values of the predictors.
Scenario 2 Next, in order to examine the situation where the signal was stronger, we simulated a data set of sample size N = 2500 where individuals fell into one of three subgroups with the probabilities: ψ 1 = 0.24, ψ 2 = 0.51 and ψ 3 = 0.25 and we choose the probabilities that the covariates are in one of the 5 categories (these probabilities correspond to the φ c,j in our approach) as shown in the eFigure 2 below. For each individual i we sampled a cluster c i ∈ {1, 2, 3} and, given these clusters, the vector of values for the 10 covariates using the φ and ψ described above.
We then generated β c ∼ N (0, σ 2 β ), calculated each individual i's true probability of disease as To check how well the different models fit the simulated data we used logistic "regression" type residuals and the misclassification error. Thus, we judged the quality of predictions by comparing predicted ("fitted") probabilities (p i ) to the true probabilities (p i ) using the root mean square error given by RMSE p = N −1 N i=1 (p i − p i ) 2 and the mean absolute error given by We also compute similar quantities but with the generating probabilities p i replaced the values of the observed outcome y i . We denote these as RMSE y and MAE y respectively. The misclassification error criterion represents the proportion of the subjects misclassified as cases or controls.
In eTable 3 we give the measures of fit for profile regression, logistic regression and CART for both simulation studies. For Scenario 1, where the patterns are not clearly distinct, profile regression and logistic regression had similar performance with profile regression slightly better with respect to the MAE y and the misclassification error. CART seems to struggle more to detect the interactions in this case with the worst performance among the three compared methods regarding all the measures of fit considered. For Scenario 2, where the patterns are more distinct than Scenario 1, the results suggest that profile regression is competitive with CART regarding prediction of the outcome or the true probabilities of disease. Logistic regression showed a reduction of power in results for this dataset.

eAppendix 6
We report a comparison of profile regression with CART on our real data. Due to the perfect collinearity present between smoking status and covariates for smokers when the latter covariates are discretised, we cannot directly compare to logistic regression. CART is available in the tree e4 package in the R statistics software e2 . The analysis using this method consists of three steps. First, a classification tree is built using standard recursive partitioning and a splitting rule. We use the gini criterion for this step, a criterion that minimises the heterogeneity of a group of subjects with respect to the outcome. At this point the maximal tree that has been produced probably overfits the data. The second step is pruning. We used 5-fold cross-validation and chose the tree that minimises the misclassification error. The resulting number of terminal nodes was 55, a result difficult to interpret. Moreover, CART requires datasets with no missing observations, so we based the comparison on a reduced dataset of 4643 subjects. We split our data into two groups. For one group, we sampled 2322 people from the population to train the models and the second group composed of the remaining 2321 people was used to measure the resulting models' prediction errors.
eFigures Figure 1: Directed Acyclic graph (DAG) for the profile regression model.