The data sets
Unit record data on esophageal cancer cases and their outcomes was extracted from the Surveillance, Epidemiology, and End Results Program (SEER) cancer registry [11]. The SEER system is administered by the National Cancer Institute. SEER currently compiles data from cancer registries covering about 28% of the US population across 13 States. Most cancers, including esophageal cancers, are recorded. De-identified unit record data made available for research include demographic measures, medical details of the cancer, treatment and outcomes (including survival time). 95.1% of esophageal cases had positive histology with just 0.4% clinical diagnosis only; the remainder having unknown (2.4%) or other confirmation methods.
Data on health behavior was extracted from the Behavioral Risk Factor Surveillance System (BRFSS) health survey [12]. The BRFSS is an annual national survey of health. It commenced in 1984 and now collects data from more than 400,000 telephone interviews each year covering adult residents of all US States and three Territories. The de-identified unit record information made available for research included demographic and health behavior measures, and State population sampling weights.
Both collections provided access to cleaned, de-identified unit record data at no cost to the researcher. Although both data collections are large, with less than 0.2% of American adults participating in BRFSS and around 4000 esophageal cancer cases being recorded in the SEER data set each year, we could only expect about eight new esophageal cancer cases each year to have participated in the previous BRFSS survey.
Inclusions and exclusions
This analysis focusses on the 15-year period from 2001 to 2015. Data prior to 2001 are excluded due to changes in the definitions of some health behaviors variables and because earlier data may be less relevant to current behavior and outcomes. 2015 was the most recent year of SEER cancer registry data.
As esophageal cancer is rare in young ages, all cancer cases who were less than 35 years of age are excluded as being atypical. Two hundred one of 57,025 (0.3%) cases are excluded. For the BRFSS health survey, all data records from respondents 25 or more years of age who lived in one of the 13 US States represented in the SEER cancer registries are included. Including the younger respondents allows information on health behavior up to 10 years prior to cancer diagnosis to be retained.
Outcome variable
The outcome of interest is post-diagnosis survival time in months as recorded in the SEER cancer registry data set. That is, all cases with survival less than 30.4 days after diagnosis (including cancers detected post-mortem) have a survival time of 0 months, those who died between 30.4 and 60.8 days have a survival time of 1 month, etc. The maximum possible survival time is 179 months. For those who are still alive and those who are lost to follow-up, survival time is censored at the date of last follow-up.
Health behavior variables
The research focused mainly on measures relating to cigarette smoking, alcohol consumption and excess body weight. The choice of variables was restricted to measures available through the BRFSS health survey. The following variables, all recording self-reported behavior, were included:
Current smoker (yes/no) which includes those who smoke daily or less than daily;
Alcohol - heavy drinking (yes or no), which is defined as more than two standard drinks per day for men and more than one standard drink per day for women in the month prior to survey;
Alcohol - binge drinking (yes or no), which is defined as males reporting having five or more standard drinks or females reporting 4 or more standard drinks on one occasion in the month prior to survey;
Current smoking and alcohol consumption (yes/no), which is defined as both current smoker and an average consumption of ≥1 standard drink of alcohol per day in the past month.
Obese (yes/no) which is BMI ≥ 30 kg/m2
Undertook physical activity or exercise in the past 30 days other than regular job (yes or no)
Demographic variables
As the cancer registry data did not include information on pre-diagnosis health behavior we estimated the probability of each pre-diagnosis health behavior for each cancer case using the available demographic variables.
Of the variables in common between the SEER cancer registry and the BRFSS health surveys we hypothesized that year, age, sex, race, marital status and State of residence could be helpful for predicting health behavior. For example, race is known to be associated with smoking [13] and alcohol dependence [14] in the US. Also, living as married ameliorates social isolation and social isolation is associated with adverse health behaviors such as smoking, higher BMI, and lower desire for exercise [15].
As age was recorded in 5-year age groups in the SEER cancer registry data, we applied the same categories to the BRFSS health survey data. Race was categorized as White; Black; Asian or Pacific Islander; and American Indian or Alaskan native. Participants in the BRFSS health survey who self-reported as mixed race (n = 44,670, 3.1% of total) were omitted as there was no corresponding code in the SEER cancer registry data set. Marital status was categorized as married or living as married; divorced or separated; widowed; and single.
Other factors considered
Post-diagnosis survival time is sensitive to a range of factors, some of which could potentially confound associations with pre-diagnosis health behavior and survival time. For example, the association between health behaviors and incidence of esophageal cancer is known to differ by histological type [3, 16] and these differences appear to carry over into survival time [17, 18]. Therefore, we have conducted sub-group analyses for squamous cell carcinoma (ESCC) and adenocarcinoma (EAC). Also age is associated with survival time [19] and health behavior can change with age. Age, recorded in 5-year age groups but treated as a continuous variable, is included in the final models as a potential confounder.
Somewhat more difficult was how to address cancer stage. Cancer stage at diagnosis is an important predictor of survival time [19] and could perhaps be associated with health behavior, although this association may be an intermediary step between health behavior and survival time rather than a true confounder. For completeness we opted to adjust for cancer stage in our models. Disease stage at diagnosis (clinical assessment) was coded by SEER according to the according to the AJCC Cancer Staging Manual 6th Edition [20].
Recording of cancer stage at diagnosis was incomplete in the SEER cancer registry data; being unavailable from 2001 to 2003 and having 18% missing data across the other years. We have excluded cancer stage prior to 2004 and categorized it into 5 categories (stage I, stage II, stage III, stage IV, not specified) from 2004 onwards.
Other potential confounders of the association between behavior and survival were considered to be of lesser impact or potentially on the disease pathway. For example, while the relationship between smoking history and post-diagnosis survival may differ by gender, the effect may be small. In contrast, the choice between curative or palliative treatment is a strong predictor of survival time but may partially lie on the association pathway. (Smoking, for example, may lead to a higher probability of significant co-morbidities and these in turn influence the decision of curative treatment and, hence, survival time.) Adjustment for variables on the association pathway may remove some of the true association between health behavior and survival time.
Eligible data records
Fifty-six thousand eight hundred twenty-four SEER esophageal cancer cases and 1,450,775 BRFSS health survey respondents met the eligibility criteria. Additional file 1 summarizes the characteristics of the two samples. Among the cancer cases, median time till death was 7 months with median follow-up time of censored observations (18.6%) was 30 months. 52.9% of cases were EAC and 33.7% ESCC. 16.1% of the BRFSS respondents were current smokers and 4.8% were judged to be heavy drinkers of alcohol. The BRFSS respondents included higher proportions of younger people and females than the SEER cases.
Statistical analysis
The characteristics of eligible cancer registry cases and health survey respondents are summarized using counts and percentages, with the exception of survival time which is summarized using medians, quartiles and maximums.
The main analysis involves three discrete steps. Firstly, the probability of engaging in each health behavior were estimated from the BRFSS health survey data using logistic models; with a separate model for each behavior. Each modelled the probability of having the behavior of interest based on year of survey, age, sex, race, marital status and State of residence. We also allowed for differences in the probability of health behaviors between sexes and between marital statuses at different ages by including age by sex, age by marital status and marital status by sex interaction terms in each logistic model.
For example, if we let i represent an eligible individual from the BRFSS data set and \( \hat{p_i(smoker)} \) represent the estimated probability that person i is a smoker, then the logistic model has the form
$$ logit\left(\hat{p_i(smoker)}\right)={\boldsymbol{x}}_{\boldsymbol{i}}\hat{\boldsymbol{\beta}} $$
(1)
where
$$ {\boldsymbol{x}}_{\boldsymbol{i}}\hat{\boldsymbol{\beta}}=\hat{\beta_0}+\hat{\beta_1}\left({year}_i\right)+\hat{\beta_2}\left({age}_i\right)+\hat{\beta_3}\left({sex}_i\right)+\hat{\beta_{4-6}}\left({race}_i\right)+\hat{\beta_7}\left( marital\ {status}_i\right)+\hat{\beta_{8-19}}\left( State\ of\ {residence}_i\right)+\hat{\beta_{20}}\left({age}_i\right)\left({sex}_i\right)+\hat{\beta_{21}}\left({age}_i\right)\left( marital\ {status}_i\right)+\hat{\beta_{22}}\left({sex}_i\right)\left( marital\ {status}_i\right) $$
and the \( \hat{\beta} \) ’s quantify the relationships between the demographic characteristics of the respondents and their likelihood of smoking.
To correct for the complexities in the BRFSS health survey sampling and non-response we weighted the logistic models by the sampling weights provided. In 2011, the BRFSS introduced a new method of calculating sampling weights which improved the weighting of some variables including race and marital status. However, as both systems weight to the State totals, we do not differentiate between the different type of weights in this analysis. We excluded data records with extreme sampling weights: those which fell in either the top or bottom 0.5% of the distribution. To assist the models to converge we use Firth’s bias reduced penalized-likelihood when fitting the models; using the logistf package (version 1.23) in R software (version 3.5.2). The fitted models are summarized in Additional file 5.
Year and age category were fitted as numeric variables while sex, race, marital status and State of residence are categorical. Preliminary investigations (not reported) confirmed that a linear model was reasonable for both year and age category. Year is coded as 0 for 2001 through to 14 for 2015 for analysis.
We confirmed that the chosen risk profiling variables were indeed predictors of each health behavior by visual inspection of odds ratios from logistic regression models. To help gauge the predictive ability of each demographic variable we present areas under the curve (AUC) of the receiver operating characteristic (ROC) curve for each predictor alone and for the full logistic model using the pROC package (version 1.13.0) in R software. The higher above 0.5 the AUC, the greater the ability of the model to predict the health behavior.
In the second step of the analysis, for each esophageal cancer case in the SEER cancer registry, we estimated their probability of participating in each health behavior by substituting their demographic characteristics into the logistic predictive model for that behavior.
For example, if we let j represent an eligible cancer case from the SEER data set and xj the set of observed values of the demographic variables for individual j and \( \hat{\boldsymbol{\beta}} \) represent the regression coefficients for the model predicting smoking (eq. 1 above), then we estimated the probability of cancer case j being a smoker as
$$ \hat{p_j(smoker)}=\frac{e^{{\boldsymbol{x}}_{\boldsymbol{j}}\hat{\boldsymbol{\beta}}}}{1+{e}^{{\boldsymbol{x}}_{\boldsymbol{j}}\hat{\boldsymbol{\beta}}}} $$
(2)
As we were specifically interested in health behavior prior to diagnosis we trialed three pre-diagnosis time points: 1, 5 and 10 years prior to diagnosis. This entailed substituting diagnosis year minus 1, 5 or 10 as the year variable of the logistic model and 5-age group minus 0, 1 or 2. To avoid extrapolating earlier than the observed data, the 5-year lag analysis was restricted to esophageal cancer cases from 2006 to 2015 and the 10-year lag model was restricted to cases from 2011 to 2015.
In the third step of the analysis, the relationship between the estimated probability of each behavior and survival was investigated using Cox regression models using the survival package (version 2.43–3) in R software. Separate models were fitted for each behavior. Results are presented as hazard ratios (HRs) with associated 95% confidence intervals (CIs) and p-values. Models were fitted with and without correction for age and cancer stage at diagnosis.
For example, the Cox model of survival time of cancer case j relative to their estimated probability of smoking, adjusting for age and disease stage, could be written
$$ S\left(t,x,\beta \right)={\left[{S}_0(t)\right]}^{\mathit{\exp}\left({\beta}_1^{\ast}\left(\hat{p_j(smoker)}\right)+{\beta}_2^{\ast}\left({age}_j\right)+{\beta}_3^{\ast}\left( cancer\ {stage}_j\right)\right)} $$
(3)
where \( \hat{p_j(smoker)} \), a number between 0 and 1, is the estimated probability that the SEER cancer case is a smoker from Eq. (2). The * superscript is just to highlight that these β ’s are different to the β ’s listed in Eq. (1). Under this model \( {e}^{\beta_1^{\ast }} \) is the hazard ratio for the estimated probability of smoking, adjusted for age and disease stage.
Subgroup analyses were performed for ESCC and EAC histological types. Missing values were excluded from analysis.