- Technical advance
- Open Access
On the proportional hazards model for occupational and environmental case-control analyses
BMC Medical Research Methodology volume 13, Article number: 18 (2013)
Case-control studies are generally designed to investigate the effect of exposures on the risk of a disease. Detailed information on past exposures is collected at the time of study. However, only the cumulated value of the exposure at the index date is usually used in logistic regression. A weighted Cox (WC) model has been proposed to estimate the effects of time-dependent exposures. The weights depend on the age conditional probabilities to develop the disease in the source population. While the WC model provided more accurate estimates of the effect of time-dependent covariates than standard logistic regression, the robust sandwich variance estimates were lower than the empirical variance, resulting in a low coverage probability of confidence intervals. The objectives of the present study were to investigate through simulations a new variance estimator and to compare the estimates from the WC model and standard logistic regression for estimating the effects of correlated temporal aspects of exposure with detailed information on exposure history.
We proposed a new variance estimator using a superpopulation approach, and compared its accuracy to the robust sandwich variance estimator. The full exposure histories of source populations were generated and case-control studies were simulated within each source population. Different models with selected time-dependent aspects of exposure such as intensity, duration, and time since cessation were considered. The performances of the WC model using the two variance estimators were compared to standard logistic regression. The results of the different models were finally compared for estimating the effects of correlated aspects of occupational exposure to asbestos on the risk of mesothelioma, using population-based case-control data.
The superpopulation variance estimator provided better estimates than the robust sandwich variance estimator and the WC model provided accurate estimates of the effects of correlated aspects of temporal patterns of exposure.
The WC model with the superpopulation variance estimator provides an alternative analytical approach for estimating the effects of time-varying exposures with detailed history exposure information in case-control studies, especially if many subjects have time-varying exposure intensity over lifetime, and if only one control is available for each case.
Population-based case-control studies are widely used in epidemiology to investigate the association between environmental or occupational exposures over lifetime and the risk of cancer or other chronic diseases. Many of the exposures of interest are protracted and a huge amount of information is often retrospectively collected for each subject about his/her potential past exposure over lifetime. For example, for occupational exposures, the whole occupational history is usually investigated for each subject, and different methods exist to estimate the average dose of exposure at each past job [1–3]. However, only the cumulated estimated dose of exposure at the index age (age at diagnosis for cases and at interview for controls) is usually used in standard logistic regression analyses. Such approach does not use the (retrospective) dynamic information available on the exposure at different ages during lifetime.
A time-dependent weighted Cox (WC) model has recently been proposed to incorporate this dynamic information on exposure, in order to more accurately estimate the effect of time-dependent exposures in population-based case-control studies . The WC model consists in using age as the time axis and weighting cases and controls according to their case-control status and the age conditional probabilities of developing the disease in the source population. The weights proposed in the WC model are therefore time-dependent and estimated from data of the source population. A simulation study showed that the WC model improved the accuracy of the regression parameters estimates of time-dependent exposure variables as compared with standard logistic regression with fixed-in-time covariates . However, the average robust sandwich variance estimates based on dfbetas residuals  were systematically lower than the empirical variance of the parameter estimates, which resulted in too narrow confidence intervals (CI) and low coverage probabilities .
There is an extensive statistical literature on the weighted analyses of cohort sampling designs (see among many others [6–10]). A population-based case-control study can be seen as a nested case-control study within the source population of cases and controls, and can therefore fit in this general cohort sampling design framework. Population-based case-control studies can also be seen as a survey with complex selection probabilities [11–14] and this is the general framework that we use in this paper. Specifically, we consider the superpopulation approach developed by Lin  who proposed a variance estimator that accounts for the extra variation due to sampling the finite survey population from an infinite superpopulation. As a result, the Lin variance estimator accounts for the random variation from one survey sample to another and from one survey population to another, as opposed to the robust sandwich variance estimator that accounts only for the random variation from one survey sample to another. In the context of population-based case-control study, the case-control sample could be considered as the survey sample, the source population as the finite survey population, and the population under study as the infinite superpopulation.
The asymptotic properties of the Lin variance estimator have been investigated and a small simulation study has been conducted to investigate these properties in finite samples . The results indicated that the superpopulation variance estimates were closer to the true variance than the robust sandwich variance estimates. However, the simulation study considered only fixed-in-time covariates and simple selection probabilities that did not reflect the more complex sampling scheme of population-based case-control studies. It is therefore unclear how the superpopulation variance estimator would perform for the estimation of the effects of time-dependent covariates using the specific estimated time-dependent weights proposed in the WC model . In addition, for further applications to population-based case-control data, it would be important to clarify the performance of the WC model, as compared with standard logistic regression analyses, for estimating the effects of several correlated temporal patterns of protracted exposures. Indeed, the effects of temporal patterns of exposures such as intensity, duration, age at first exposure, and time since last exposure are often of great interest from an epidemiological point of view , but they need to be carefully adjusted for each other to avoid residual confounding . Such adjustment induces correlation between covariates and it is important to investigate how it affects the proposed estimators.
The first objective of the present study is to investigate through extensive simulations the accuracy of the Lin variance estimator for estimating the effects of time-varying covariates in case-control data, using the weights proposed in the WC model . The second objective is to compare the estimates from the WC model and standard logistic regression for estimating the effects of selected correlated temporal aspects of exposure with detailed information on exposure history. The next section introduces the WC model and the robust and Lin’s variance estimators. The different approaches are then compared through simulations and using data from a large population-based case-control study on occupational exposure to asbestos and pleural mesothelioma (PM).
The regression model and the variance estimators
The WC model
The Cox proportional hazards model specifies the hazard function as
where λ 0 is the baseline hazard, x(t) is the vector of observed covariate values at time t and β is the vector of unknown regression parameters. In the context of a population-based survey with complex sampling design , the estimator of β is the solution of the pseudo-maximum likelihood equation
where n is the sample size, ω i is the sampling weight for subject i, δ i = 1 if subject i is the case diagnosed at age t i and 0 otherwise, and
with Y j (t) = 1 if the subject j is at risk at time t (i.e. t j ≥ t), 0 otherwise.
In the WC model proposed for case-control data , t is age and the sampling weight ω of each subject depends on age and on his case-control status. Specifically, the weight for each subject i at age t is given by
where π(t) is the probability to develop the disease at age t or at a later age in the source population, n cases (t) is the number of cases diagnosed at age t or at a later age in the case-control study, and n controls (t) is the number of controls selected at age t or at a later age in the case-control study as well. If the WC model is used to analyze data from a nested case-control study, the age conditional probabilities π(t) in Equation (2) can directly be estimated from the full enumerated cohort. Left-truncation at age at entry into the cohort should be performed to account for delayed entry . If the WC model is used to analyze population-based case-control data, π(t) can be estimated from health statistics on the population under study, as shown in our application on PM in the section following simulations. The weights equal 1 for cases because all the eligible cases of the source population (or in the cohort) are usually included in the case-control study. If the sampling probabilities of cases do not equal 1, then weights in Equation (2) should be adjusted accordingly.
The weights defined in Equation (2) can be implemented in any statistical software that handles time-dependent weights in the Cox model, such as the coxph function in R or the SAS PROC PHREG function.
The variance estimators
where is the observed information matrix obtained by evaluation of this expression , a ⊗ 2 = aa′, and
The robust variance estimator in Equation (3) can be rewritten as =D’D where D is the dfbetas residuals  vector from the Cox model including the weights ω that can depend on time as those defined in Equation (2), as suggested in Barlow . As indicated by Therneau and Li  and by Barlow et al. , the robust sandwich variance estimate from Equation (3) can directly be obtained with R using the commands
M1 < −coxph(Surv(start,stop,event) ~ x + cluster(id), weights = weight)
V1 < − M1$var
with the vector of weights derived from Equation (2) for the WC model.
The robust variance estimator accounts for the variability due to sampling the case-control sample from the source population. To account for the extra variability due to sampling the source population from the (infinite) superpopulation, we propose to use the Lin variance estimator  that turned out to consist in adding the naïve variance estimator to the robust variance estimator . The Lin variance superpopulation estimator is thus given by
With R, the superpopulation variance estimate from Equation (5) can simply be obtained using the command
V2 <- V1 + M1$naive.var
All along this paper, the WC model using the robust variance estimator in Equation (3) will be denoted by WC1, while the WC model using the Lin’s superpopulation variance estimator in Equation (5) will be denoted by WC2. While WC1 and WC2 models give identical estimated exposure effects, they yield different standard errors and thus CI.
Overview of the simulation design
The main objective of the simulation study was to evaluate the performance of Lin’s superpopulation variance estimator in Equation (5) with the time-dependent weights defined in Equation (2), for the estimation of the effects of time-varying exposures in case-control studies. In particular, we compared the coverage probability of the 95% CI resulting from the WC2 model, as compared to the WC1 model and standard logistic regression. We were specifically interested in the effects of exposure intensity, duration, age at first exposure and time since last exposure. These inter-related aspects of exposure are of interest in many epidemiological applications but induce some statistical analytical issues because of correlation and time-dependency.
We generated 1000 source populations of 1000 or 5000 individuals each, and within each source population, we simulated a case-control study. The age at event for each subject in each source population was generated from a standard Cox model with time-dependent covariates, using a permutation algorithm described elsewhere and assuming Weibull marginal distribution of age at event [4, 22, 23]. Three Cox models of interest from an epidemiological point of view were simulated. Model 1 included intensity and duration of exposure only. Model 2 included age at first exposure in addition to intensity and duration. Model 3 was similar to Model 2 but used time since last exposure instead of age at first exposure.
The distribution of the exposures variables were chosen to be close to the observed distributions of occupational asbestos exposure variables in our case-control data on PM  described in the application section. Specifically, the ages at first and at last exposure were generated for all subjects from lognormal distributions. The exposure intensity at each age was generated from a linear function of age. Parameters for the random intercept and slope were chosen such that either 85% of subjects had a constant intensity, 6% a highly increasing, 6% a moderately decreasing, and 3% a moderately increasing intensity over lifetime (Scenario A); or 50% a highly increasing and 50% a moderately decreasing intensity over lifetime (Scenario B). Scenario A reflects our real case-control data on occupational exposure to asbestos. The exposure intensity at each age was represented in all our models by a variable that equaled the cumulated value of intensity at that age divided by the total duration of exposure at that age. This exposure intensity variable is equivalent to the mean index of exposure (MIE) variable introduced in the application section. The exposure intensity, as well as duration and time since last exposure, were time-dependent in all our true Models 1–3. The true effects β of each exposure variable in Models 1–3 were fixed to values that ranged from weak to strong effects: 0.41 to 1.39 for intensity, 0.01 to 0.05 for duration, −0.01 to −0.11 for age at first exposure, and 0.01 to 0.04 for time since last exposure. These beta correspond to hazard ratios of 1.5 to 4.0 for one standard deviation (i.e. 1.0 fiber/ml) increase in exposure intensity, hazard ratios of 1.2 to 2.0 for one standard deviation increase (i.e. 14 years) in duration of exposure, hazard ratios of 0.9 to 0.4 for one standard deviation (i.e. 8 years) increase in age at first exposure, and hazard ratios of 1.2 to 1.8 for one standard deviation (i.e. 14 years) increase in time since cessation of exposure.
Censoring for age at event in the source population was independently generated from a uniform distribution such that the event rate was about 10% in each source population of 1000 subjects, and 2% in each source population of 5 000 subjects. Each subject of the source population who had the event of interest was selected as a case in the case-control dataset. The event rates in the source population thus implied that we had about 100 cases in each case-control data set. For each case, 1, 2, or 4 controls were randomly selected with replacement among subjects at risk at the case’s event age, which corresponds to 1:1, 1:2, or 1:4 individual matching on age, respectively. On average, each case-control dataset was therefore made of about 100 cases and 100, 200, or 400 controls.
Analytical methods used to analyze the simulated data
Each case-control sample was analyzed using four regression models (WC1 and WC2 models and two standard logistic regression models) that were correctly specified in terms of the exposure variables included. In the WC1 and WC2 models, the exposure variables were time-dependent, and the probability π(t) was the proportion of subjects in the source population who had an event at age t or at a later age among those at risk at age t. We assumed that all subjects of the population source were followed-up since birth, implying that age at event did not have to be left-truncated in WC1 and WC2. For comparison purpose, conditional logistic regression (CLR) was used as the standard analytical method for individually matched case-control studies. Unconditional logistic regression (ULR) including age as a continuous covariate in addition to the exposure variables, was also used as the standard alternative analytical approach. For both ULR and CLR, the time-dependent covariates were fixed at their observed value at the age at event for cases or selection for controls. Because controls were selected among subjects at risk at the ages where each case occurs, all the exponential of the regression parameter estimates can be interpreted as the source population rate ratio estimates .
Statistical criteria used to compare the performance of the different estimators
For each of the four regression models WC1, WC2, CLR, and ULR, we calculated the relative bias of the regression parameter estimator associated with each exposure variable, as compared with the true effect β of that exposure variable, , where is the parameter estimate of the model based on the i th case-control dataset (i = 1, …, 1 000). To evaluate whether the relative bias was not partly due to a bias generated in the population source data, we also derived the relative bias as compared with the estimated effect of the well specified time-dependent Cox model using the full population source data, . We also derived the root mean squared error (RMSE) , where is the mean of the 1 000 parameter estimates . The empirical relative efficiency of each regression parameter estimator was computed as the ratio of the empirical variance of the Cox model using the full source population data, , to the empirical variance of the parameter estimates . The average of the 1000 standard errors (ASE) was compared to the empirical standard deviation of the 1000 estimates (SDE). We also calculated the coverage probability as the proportion of samples for which the 95% CI of β, , included the true value β.
Table 1 shows the results of the four analytical methods (WC1, WC2, CLR, ULR) for strong effects of exposure intensity and duration in Model 1. Table 2 shows the results for strong effects of i) intensity, duration, and age at first exposure in Model 2, and ii) intensity, duration, and time since cessation in Model 3. The results tended to be similar for weaker effects.
As suggested by the ratio ASE/SDE, the superpopulation variance estimator (WC2) tended to give estimates that were closer to the true variance than the robust variance estimator (WC1) that systematically under-estimated the true variance. Despite the superpopulation variance estimator tended to overestimate the true variance for the effect of exposure intensity when the population was made of 1000 subjects only (Tables 1 and 2), the coverage rates from WC2 were systematically much closer to the nominal level of 95% than those from the WC1 model. For each scenario of intensity pattern (Scenario A or B), the ratio ASE/SDE and the coverage rate for the effects of intensity and duration were similar in Models 2–3 as compared with Model 1 (Table 2 versus Table 1), suggesting that additional adjustment for correlated covariates does not affect the performance of the different variance estimators.
While the relative biases from all analytical models (WC, ULR and CLR) tended to be low and of the same magnitude in all scenarios, the relative efficiency as compared to the Cox model estimated on the full population source, as well as the accuracy in terms of RMSE, tended to be different. Indeed, in all scenarios with 1:1 case:control ratio within population source of 1000 subjects, the regression coefficient estimator from the WC models was much more efficient and thus also more accurate than that from CLR and ULR (Tables 1 and 2). As expected, the relative efficiency from all models estimated using 100 cases and 100 controls, as compared to the Cox model estimated on the full population source, decreased when the population size increases. For example, the relative efficiency of the WC for intensity with pattern B decreased from 0.59 to 0.20 when the population size increased from 1000 to 5000 subjects (Table 1). As expected as well, increasing the number of controls from 100 to 200 or 400 for a given population size (5000 in Table 1) strongly increased the relative efficiency of ULR and CLR but only moderately increased the relative efficiency of the WC models. For example, the relative efficiency for intensity with pattern B increased from 0.10 to 0.36 for CLR while only from 0.20 to 0.37 for the WC model (Table 1). Because the WC model used controls at different ages for which they were selected in the 1:1 case-control scenario, using additional controls in the 1:2 or 1:4 case:control ratio scenarios added relatively less information in this model than in ULR and CLR. As a result, ULR and CLR became more accurate in terms of RMSE than the WC models when four controls were selected for each case.
Interestingly, CLR did not perform better in terms of both bias and RMSE than ULR, despite individual matching of cases and controls. ULR was actually systematically more efficient than CLR. This result may be consistent with our previous results where we found that CLR might have difficulty in separating the effects of correlated time-dependent variables . Indeed, the correlation between each pair of the four exposure variables (intensity, duration, age at first exposure and time since last exposure) as well as with age at the index date, ranged between −0.679 and +0.453. The correlation also affected both the WC and ULR parameter estimators as suggested by the slightly higher RMSE in Models 2 and 3 (Table 2) as compared with Model 1 (Table 1) for the effects of intensity and duration, but it affected them less than the CLR estimator.
Application to occupational exposure to asbestos and pleural mesothelioma
Mesothelioma is a rare tumor mostly located in the pleura and usually caused by exposure to asbestos. The role of the different temporal patterns of occupational exposure to this substance has still to be explored using appropriate statistical methods accounting for individual changes over time in the exposure intensity . It is therefore of interest to apply the proposed estimators to estimate the mutually adjusted effects of exposure intensity, duration, age at first exposure, and time since last exposure, and to compare the results to those from standard logistic regression analyses that do not dynamically account for within subjects changes over time of exposure intensity.
The data came from a large French population-based case-control study described in Lacourt et al. . Cases were selected from a French case-control study conducted in 1987–1993 and the French National Mesothelioma Surveillance Program in 1998–2006. Population controls were frequency matched to cases by sex and year of birth within 5 years group. Occupational asbestos exposure was evaluated for each subject with a job-exposure matrix (JEM) which allowed us to derive the mean index of exposure (MIE) that was used in the regression models to represent intensity of exposure, as in Lacourt et al. . The MIE at age t was given by
where L is the total number of jobs exposed to asbestos till age t; d l the duration (in years) of job l, p l the probability of asbestos exposure for job l, f sl and i sl the frequency and intensity of asbestos exposure due to specific task of job l, respectively, f al and i al the frequency and intensity of asbestos exposure due to environment work contamination of job l, respectively. For each job, the probability was derived from the percent of workers exposed in the considered job code, the frequency from the percent of work time, and the intensity from the concentration of asbestos fibers in the air expressed as fibers per milliliter (f/ml). See Lacourt et al.  for more details. An ever exposed subject to asbestos was a subject who had at least one job with a probability p l different from zero.
Because our objective was to accurately investigate the effects of the quantitative time-related aspects of occupational exposure, all our analyses were restricted to subjects ever exposed to asbestos (68.9% in males and 20.9% in females). In addition, because the sample size for females was too small to ensure adequate statistical power and accurate estimates in separate multiple regression analyses of this group , the analyses were restricted to males ever exposed to asbestos, i.e. to 1041 male cases and 1425 male controls. The distribution of age and the asbestos exposure characteristics at the time of diagnosis for cases and interview for controls are shown in Table 3. The distribution of the patterns of intensity over lifetime was similar to the one described in scenario A of the simulation, with 85% of subjects with almost constant asbestos exposure intensity over lifetime.
Analytical methods used to analyze the case-control data on pleural mesothelioma
To derive the weights proposed in the WC models (Equation 2), we first estimated the age-conditional probabilities π(t) of developing PM in the French male general population. These estimated probabilities were derived from published estimated sex- and age-specific incidence rates of PM per 100000 person-years in France in 2005 . We assumed that these estimated incidence rates applied to our source population and that they were appropriate during the whole life of our subjects. The results are shown in Table 4 for males. As in the simulation study, standard errors for the WC model were then derived using the two variance estimators and , resulting in the WC1 and WC2 models, respectively.
For comparison purpose, the data were further analyzed with ULR which is the standard method to analyze frequency matched case-control data, as well as with CLR. Age was the time axis in WC1 and WC2 models, and a continuous covariate in ULR and CLR. We did not perform left-truncation in WC1 and WC2 models thus assuming that all subjects of the population source were passively followed-up for PM since birth. The matching factor, birth year, was a quantitative covariate in WC1, WC2, and ULR, and was the stratification variable (in 5 years groups) in CLR. Using each of the four approaches (WC1, WC2, CLR and, ULR), we estimated the effects of intensity and duration of occupational asbestos exposure, the age at first exposure, and time since last exposure, using the same combination of quantitative exposure variables as in Models 1–3 of the simulation study. All the effects of these variables were therefore assumed to be linear. Despite our recent results that suggested that these effects were not linear on the logit of PM , we used quantitative variables in order to facilitate the comparison of the estimates from the four different analytical approaches. The resulting estimates should therefore be used only for methodological comparison purpose and not as substantive epidemiological results. As in the simulation study, all the exposure variables were time-dependent in WC1 and WC2 models, and fixed at their value at the age at diagnosis or interview for ULR and CLR.
Table 5 shows the estimated effects of the selected quantitative asbestos exposure variables on the risk of PM, using the four analytical approaches (WC1, WC2, CLR, and ULR) and Models 1–3. The estimated effects are shown in terms of , i.e. estimated hazard ratios for WC1 and WC2 and estimated odds ratios for ULR and CLR. These estimated effects were calculated for an increase of about one standard deviation of the exposure variable, i.e. 1 fiber/ml for asbestos exposure intensity, 14 years for duration, 8 years for age at first exposure, and 14 years for time since last exposure.
As expected, the associations between all asbestos exposure variables and PM were significant with each of the four analytical approaches (Table 5). Specifically, increasing intensity or duration increased significantly the risk of PM, when adjusted or not on either age at first exposure or time since last exposure. Because the relative variation in the estimated effects of duration between Model 3 and Model 1 was higher than between Model 2 and Model 1, time since last exposure (in Model 3) seems to be a more important confounder than age at first exposure (in Model 2) in the relation between duration and PM. Estimates from Model 2 suggest that the later a subject is first occupationally exposed to asbestos, the smaller his risk of PM is. All the estimated effects of time since cessation indicate that risk continues to increase after the cessation of exposure, as in many other studies [15, 26, 27].
The 95% CI from WC1 and WC2 were almost identical (Table 5), suggesting that the robust variance estimates from WC1 was very close to the superpopulation variance estimates from WC2. This is likely due to the fact that the disease (PM) was very rare as shown in Table 3, as opposed to our simulation study where the overall event rates were about 10% and 2%.
The strongest contrasts between the estimates from the WC models and ULR or CLR were for the effect of exposure intensity. Indeed, the estimated effect of intensity was systematically weaker with the WC models than with ULR or CLR, with even non overlapping 95% CI. Note that, as for Scenario A in our simulation study, CLR provided the strongest estimates for the strong effect of intensity. By contrast, for the effects of duration, age at initiation, and time since last exposure, the strongest estimates were provided by the WC models, but the discrepancies with ULR and CLR were weaker than for intensity.
There are different potential explanations for the discrepancies between the results from the Cox (WC1 and WC2) and logistic (CLR and ULR) models. First the adjustment for age was largely different in the two series of models. While age was the time axis in the Cox models, and was therefore adequately adjusted for in both WC1 and WC2, it was included as a continuous covariate in both logistic models. This assumed that its effect was linear on the logit, which is actually not true . Thus there may be some residual confounding by age in both CLR and ULR. Second, because controls of the case-control study on PM were selected from members of the general French population at calendar times that can possibly differ from the period of case’s recruitment, the case-control odds ratio estimate from ULR and CLR may estimate a different quantity than the hazard ratio estimate from the Cox model. Indeed, the hazard function in the Cox models provides a dynamic description of how the instantaneous risk of getting PM varies over the age. The exponential of regression parameter can be interpreted as a hazard ratio, which is equivalent to the rate ratio that would be obtained from a cohort design. If the controls of the case-control study on PM were randomly selected from the member of the population who were at risk at each age a case occurs (as in our simulation study), then the estimated odds ratio that would be obtained from ULR and CLR could also be interpreted as a rate ratio that would be obtained from a cohort design. However, this was not the way controls were selected in the case-control study on PM, and it is therefore difficult to directly compare odds ratio estimates obtained from ULR and CLR, and hazard ratio estimates obtained from WC1 and WC2.
Our simulation results suggest that the superpopulation variance estimator  provides adequate coverage probabilities of the CI when using the time-dependent weights proposed in the WC model to estimate the effect of time-varying exposures in case-control studies. Indeed, our simulation results shows much better coverage probabilities of the CI resulting from the superpopulation estimator than those resulting from the robust variance estimator. However, our application to PM suggests that the two variance estimators give similar 95% CI when the disease is very rare. This is consistent with the results of Lin  who showed that the use of finite-population variance estimator (i.e. robust variance) results in reasonable coverage probabilities if the inclusion probabilities are low, but poor coverage probabilities if the inclusion probabilities are high. It should be noted that both robust and superpopulation variance estimators are easy to implement using most statistical softwares.
Our simulation results also confirmed that the WC model is an alternative method for estimating the effects of time-varying exposure variables in case-control studies. In particular, when compared to standard logistic regression that did not dynamically account for the different values of covariates over lifetime, the WC model tended to provide more accurate estimates of the effects of variables for which an important percentage of subjects had time-varying values over lifetime, such as intensity. However, the superiority of the WC did not persist when more than one control were selected from the risk set. Our results also suggest that the estimates from the WC model are not more affected by correlations between time-dependent covariates included in the model than logistic regression with fixed-in-time covariates. Note that the modelling of the exposure in the WC model could further be improved by incorporating some more complex function of the trajectory of the exposure over time that have recently been proposed [28–30].
The application of the WC model requires estimating the age-conditional probabilities in the source population for population-based case-control studies, or in the full cohort for nested case-control studies. In our application to population-based case-control data on PM, these probabilities were estimated from health statistics on the general French male population. Yet, our analyses were restricted to ever exposed males only who have much higher probability to develop PM than the general French male population. Further studies are needed to investigate the impact of biased estimates of the age-conditional probabilities on the WC estimates. Accounting for uncertainty in the weight estimates could further improve the variance estimator . In addition, controls in our case-control data set on PM were frequency matched to cases on birth year. To account for this stratification variable in the design, we included it as a covariate in the WC models. However, it would be interesting to consider accounting for this frequency matching variable in the weights of the WC models , and to investigate the performance of the resulting estimators through simulation of frequency matched case-control data. This would be all the more important that frequency matching is largely used in population-based case-control studies. It should also be mentioned that depending on the controls selection strategy, hazard ratio estimates from the WC model may not measure the same quantity as odds ratio estimates from the logistic regression. While the hazard ratio from the WC model estimates a rate ratio, the odds ratio may estimate another quantity depending on the control selection strategy .
The WC model with time-dependent variables requires also information on the values of the covariates at each event time, so at each age of diagnosis in cases. Such information may be missing, and different approaches could be considered to impute these values. However, further studies are needed to assess the impact of measurement errors of the time-dependent covariate values. Indeed, missmodeling the covariates has already been shown to induce bias in sandwich variance estimator based on dfbetas of unweighted Cox model for nested case-control analysis . A variance estimator based on Schoenfeld residuals provided better variance estimates for severe model misspecification . It may be of interest to further investigate such an estimator for misspecified time-dependent covariates in the WC model. Some further joint modelling between the WC model and the time-dependent covariate process could also be investigated as an alternative, especially for internal time-dependent exposure variables . However, in most case-control studies on occupational exposures, the occupational history is sufficiently well investigated to allow the elaboration of quite accurate time-dependent covariates, as in our application on asbestos and PM.
We believe that the WC model using the superpopulation variance estimator may provide a potential alternative analytical method for case-control analyses with detailed information on the history of the exposure of interest, especially if a large part of the subjects has a time-varying exposure intensity over lifetime, and if only one control is available for each case.
Average standard errors
Conditional logistic regression
Mean index of exposure
Root mean squared error
Standard deviation of the estimates
Unconditional logistic regression
Weighted Cox model.
Teschke K, Olshan AF, Daniels JL, De Roos AJ, Parks CG, Schulz M, Vaughan TL: Occupational exposure assessment in case-control studies: opportunities for improvement. Occup Environ Med. 2002, 59 (9): 575-593. 10.1136/oem.59.9.575. discussion 594
McGuire V, Nelson LM, Koepsell TD, Checkoway H, Longstreth WT: Assessment of occupational exposures in community-based case-control studies. Annu Rev Public Health. 1998, 19: 35-53. 10.1146/annurev.publhealth.19.1.35.
Bouyer J, Hemon D: Retrospective evaluation of occupational exposures in population-based case-control studies: general overview with special attention to job exposure matrices. Int J Epidemiol. 1993, 22 (Suppl 2): S57-S64.
Leffondre K, Wynant W, Cao Z, Abrahamowicz M, Heinze G, Siemiatycki J: A weighted Cox model for modelling time-dependent exposures in the analysis of case-control studies. Stat Med. 2010, 29 (7–8): 839-850.
Binder DA: Fitting Cox's proportional hazards models from survey data. Biometrika. 1992, 79 (1): 139-147. 10.1093/biomet/79.1.139.
Borgan O, Goldstein L, Langholz B: Methods for the analysis of sampled cohort data in the Cox proportional hazards model with risk set sampling. Ann Stat. 1995, 23 (5): 1749-1778. 10.1214/aos/1176324322.
Chen CY: Statistical estimation in the proportional hazards model with risk set sampling. Ann Stat. 2004, 32 (4): 1513-1532. 10.1214/009053604000000517.
Gray RJ: Weighted analyses for cohort sampling designs. Lifetime Data Anal. 2009, 15 (1): 24-40. 10.1007/s10985-008-9095-z.
Langholz B: Use of cohort information in the design and analysis of case-control studies. Scan J Stat. 2007, 34: 120-136. 10.1111/j.1467-9469.2006.00548.x.
Samuelsen SO, Anestad H, Skrondal A: Stratified case-cohort analysis of general cohort sampling designs. Scan J Stat. 2007, 34: 103-119. 10.1111/j.1467-9469.2006.00552.x.
Jiang J, Scott GM, Wild C: Adjusting for non-responses in population-based Case-control studies. Int Stat Rev. 2011, 79 (2): 145-159. 10.1111/j.1751-5823.2011.00132.x.
Li Y, Graubard BI, DiGaetano R: Weighting methods for population-based case-control studies with complex sampling. J R Stat Soc Ser C Appl Stat. 2011, 60 (2): 165-185. 10.1111/j.1467-9876.2010.00731.x.
Lin DY: On fitting Cox's proportional hazards models to survey data. Biometrika. 2000, 87: 37-47. 10.1093/biomet/87.1.37.
Scott A, Wild C: On the robustness of weighted methods for fitting models to case-control data. J R Stat Soc Ser B Stat Methodol. 1997, 64 (2): 207-219.
Lacourt A, Leffondré K, Gramond C, Ducamp S, Rolland P, Gilg Soit Ilg A, Houot M, Imbernon E, Févotte J, Goldberg M, et al: Temporal patterns of occupational asbestos exposure and risk of pleural mesothelioma. Eur Respir J. 2012, 39 (6): 1304-1312. 10.1183/09031936.00005111.
Leffondre K, Abrahamowicz M, Siemiatycki J, Rachet B: Modeling smoking history: A comparison of different approaches. Am J Epidemiol. 2002, 156: 813-823. 10.1093/aje/kwf122.
Thiébaut ACM, Bénichou J: Choice of time-scale in Cox's model analysis of epidemiologic cohort data: a simulation study. Stat Med. 2004, 23 (24): 3803-3820. 10.1002/sim.2098.
Belsey DA, Kuh E, Welsch RE: Regression diagnostics: identifying influential data and sources of collinearity. 1980, New York: John Wiley
Barlow WE: Robust variance estimation for the case-cohort design. Biometrics. 1994, 50: 1064-1072. 10.2307/2533444.
Therneau TM, Li H: Computing the Cox model for case cohort designs. Lifetime Data Anal. 1999, 5: 99-112. 10.1023/A:1009691327335.
Barlow WE, Ichikawa L, Rosner D, Izumi S: Analysis of case-cohort designs. J Clin Epidemiol. 1999, 52: 1165-1172. 10.1016/S0895-4356(99)00102-X.
Sylvestre MP, Abrahamowicz M: Comparison of algorithms to generate event times conditional on time-dependent covariates. Stat Med. 2008, 27 (14): 2618-2634. 10.1002/sim.3092.
Leffondre K, Abrahamowicz M, Siemiatycki J: Evaluation of Cox's model and logistic regression for matched case-control data with time-dependent covariates: A simulation study. Stat Med. 2003, 22 (24): 3781-3794. 10.1002/sim.1674.
Rothman KJ, Greenland S, Lash TL: Modern Epidemiology. 2008, Philadelphia, PA: Lippincott Williams & Wilkins, 3
Gilg Soit Ilg A, Chamming's S, Rolland P, Ducamp S, Brochard P, Galateau-Sallé F, Pairon J, Astoul P, De Quillacq A, Frenay C, et al: Programme national de surveillance du mésothéliome (PSNM): principaux résultats, France, 1998–2004. Bull Epidemiol Hebd. 2007, 41–42: 350-354.
Pesch B, Taeger D, Johnen G, Gross IM, Weber DG, Gube M, Muller-Lux A, Heinze E, Wiethege T, Neumann V, et al: Cancer mortality in a surveillance cohort of German males formerly exposed to asbestos. Int J Hyg Environ Health. 2009, 213 (1): 44-51.
Pira E, Pelucchi C, Piolatto PG, Negri E, Discalzi G, La Vecchia C: First and subsequent asbestos exposures in relation to mesothelioma and lung cancer mortality. Br J Cancer. 2007, 97 (9): 1300-1304. 10.1038/sj.bjc.6603998.
Richardson DB, MacLehose RF, Langholz B, Cole SR: Hierarchical latency models for dose-time-response associations. Am J Epidemiol. 2011, 173 (6): 695-702. 10.1093/aje/kwq387.
Sylvestre MP, Abrahamowicz M: Flexible modeling of the cumulative effects of time-dependent exposures on the hazard. Stat Med. 2009, 28 (27): 3437-3453. 10.1002/sim.3701.
Bhadra D, Daniels MJ, Kim S, Ghosh M, Mukherjee B: A Bayesian Semiparametric Approach for Incorporating Longitudinal Information on Exposure History for Inference in case-control Studies. Biometrics. 2012, 68 (2): 361-370. 10.1111/j.1541-0420.2011.01686.x.
Pan Q, Schaubel DE: Proportional hazards models based on biased samples and estimated selection probabilities. Can J Stat. 2008, 36 (1): 111-127. 10.1002/cjs.5550360111.
Xiang A, Langholz B: Robust variance estimation for rate ratio parameter estimates from individually matched case-control data. Biometrika. 2003, 90 (3): 741-746. 10.1093/biomet/90.3.741.
Rizopoulos D: JM: An R package for the joint modelling of longitudinal and time-to-event data. J Stat Softw. 2010, 35 (9): 1-33.
The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/13/18/prepub
This research was mostly supported by grants from the CHUM research center awarded to Dr. Karen Leffondre. Dr. Aude Lacourt was supported by a postdoctoral fellowship from the Fondation de France. The collection of data was partly supported by the French Institute for Public Health Surveillance.
The authors declare that they have no competing interests.
HG has drafted the manuscript, programmed and run the simulation study, analyzed the case-control data on mesothelioma, and has contributed to the interpretation of all the results. AL has provided the case-control data on mesothelioma and has revised the manuscript. KL has drafted and revised the manuscript, has designed the simulation study, and supervised HG in all stages. All authors read and approved the final manuscript.
About this article
Cite this article
Gauvin, H., Lacourt, A. & Leffondré, K. On the proportional hazards model for occupational and environmental case-control analyses. BMC Med Res Methodol 13, 18 (2013). https://doi.org/10.1186/1471-2288-13-18
- Case-control study
- Cox model
- Logistic regression
- Time-dependent variables
- Variance estimator
- Occupational exposures
- Environmental exposures