Using Cox regression to analyze a binary outcome in a cross-sectional study was suggested by Lee & Chia [10], and assessed by others [4, 15]. Usually, Cox regression is used to analyze time-to-event data, that is, the response is the time an individual takes to present the outcome of interest. Individuals that never get ill are assigned the total length of time of the follow-up, and are treated as censored, meaning that it is not known when they will get ill, but at least until the time of the end of the follow-up they are well. Individuals lost to follow-up are treated in a similar way. Cox regression estimates the hazard rate function that expresses how the hazard rate depends upon a set of covariates. The model formulation is
h(t) = h
0(t) exp(β1
z
1 + ... + β
k
z
k
)(1)
where h
0(t) is the base hazard function of time, z
i
are covariates and β
i
, the coefficients for the k covariates. The Cox model treats h
0(t) as a nuisance function and actually does not estimate it [16].
When a constant risk period is assigned to everyone in the cohort, the hazard rate ratio estimated by Cox regression equals the cumulative incidence ratio in longitudinal studies, or the prevalence ratio in cross-sectional studies [17, 18]. Although this model can produce correct point estimates, the underlying distribution of the response is Poisson. As prevalence data in a cross-sectional study follow a binomial distribution, the variance of the coefficients tends to be overestimated, resulting in wider confidence intervals compared to those based on the binomial distribution. This is easily explained by comparing the binomial variance, p(1-p), with a maximum of 0,25 when p = 0,5 with Poisson variance, λ, that grows steadily with the intensity of the process. That is, the variance estimated by the Poisson model will be very close to the binomial variance when the outcome is rare, but will be increasingly greater as the outcome becomes more frequent. In such a situation we have underdispersion, the opposite to the more commonly observed overdispersion, where the data is more dispersed than the model predicts.
It is possible to improve the situation using the robust variance estimates proposed by Lin & Wei [19], similar to other robust sandwich estimators proposed for parametric models, such as Huber's sandwich estimator [20]. In this paper, Cox regression with equal follow-up times was assessed, with standard and robust variance estimates.
Poisson regression is commonly used in epidemiology to analyze longitudinal studies where the outcome is a count of episodes of an illness occurring over time (e.g. episodes of diarrhea). The model formulation is
where n is the count of events for a given individual, t the time it was followed-up, and X
i
the covariates. The model parameters (β
i
) are log relative risks. In this context, Poisson regression is equivalent to Cox regression [21], and the parameters estimated are the same.
As described for Cox regression, the prevalence ratio is directly estimated by the model, and the confidence intervals are wider than those provided by a binomial model. A simple remedy is to multiply the estimated Poisson variance by some estimate of underdispersion (or overdispersion). These estimates can be based on the deviance or the chi-square of the model, dividing these quantities by the residual degrees of freedom [22, 23]. In practice, this ratio is used as a scale parameter, replacing the original Poisson value of 1. A robust variance estimate is also available for the Poisson model, based on the Huber sandwich estimate [20] (which again yields results that are equal to Cox regression with robust variance). This alternative is known to underestimate the true variability with moderately sized samples, while adjusting the scale parameter tends to overestimate it. Other alternatives would be jackknife and bootstrap variance estimates [23]. We decided not to use the latter alternatives as they are not directly available in standard statistical software. Thus, Poisson regression was used in this paper with unadjusted variances, with scale parameter adjustment for both deviance and chi-square statistics, and with robust variance estimates.
The last model assessed was the log-binomial model [15] – a generalized linear model where the link function is the logarithm of the proportion under study and the distribution of the error is binomial [4, 7, 11, 12, 15]. The measure of effect in this model is also the relative risk.
For k covariates the model is written as
log(π) = β0 +β1
X
1 + ... + β
k
X
k
(3)
where π is the probability of success (e. g., the proportion of sick persons in a group), and X
i
the covariates. The relative risk estimate of a given covariate is e
β.
Since log(π) must be in the interval -∞ to 0, restrictions in the estimation process have to be used to avoid predicting probabilities out of the [0,1] interval. When estimates are on the boundaries of the valid parameter space, the estimates of the Newton-Raphson method will not converge to the maximum likelihood estimates [24]. Convergence problems in the estimation process are most likely to happen when the model contains a continuous covariate or multiple politomic covariates, or the outcome prevalence is high [12, 24]. When the estimates are not on the boundary of the parameter space, convergence problems may still happen, and better starting values for the estimation process than the default used by the software will help. Most log-binomial models fitted in this paper used the default Stata estimation options, without convergence problems. In one case, when the model failed to converge, the "search" option, which makes the procedure search for a better starting value, was used [25].
The results obtained from the various models were compared to the pooled Mantel-Haenszel-like prevalence ratios (MHPR) and corresponding confidence intervals, used here as the reference results. Mantel-Haenszel estimates are easy to obtain in simple situations such as the ones dealt with in this work (one exposure and one confounder). However, for more complex situations, their estimation is more complicated and the use of statistical models is more efficient.
All the analyses were performed with Stata 7.0 [25], and the actual command lines used are listed below. Each outcome-exposure-confounder combination was represented by one row in the dataset and its frequency given by the variable freq.
*** M-H relative risk
cs ill exposed [fweight = freq], by(confounder)
*** Poisson regression unadjusted
poisson ill exposed confounder [fweight = freq], irr
*** Poisson regression adjusted by chi-squared
glm ill exposed confounder [fweight = freq], family(poisson) scale(x2) eform
*** Poisson regression adjusted by deviance
glm ill exposed confounder [fweight = freq], family(poisson) scale(dev) eform irls
*** Poisson regression with robust variance
poisson ill exposed confounder [fweight = freq], irr r
*** Log-binomial regression
glm ill exposed confounder [fweight = freq], family(binomial) link(log) eform
*** Odds ratio from logistic regression
logistic ill exposed confounder [fweight = freq]
For the comparison of the above techniques, real data from a population-based survey were used. A birth cohort was initiated in 1993, including all births happening in Pelotas, Southern Brazil [26]. These children were seen at birth, and their mothers interviewed. At 1 and 3 months of age, a sub-sample of 655 were sought for follow-up information. At 6 and 12 months, a larger sub-sample (including the 655 seen at 1 and 3 months) was sought, that comprised all children born with low birthweight and 20% of the remaining children. At these points, 1363 children were sought. The data used in this work came from another visit done between November 1997 and April 1998, when the children were 4–5 years-old. The children sought were the same as those in the 12-month revisit, and 1273 (93%) were actually interviewed. From the 90 children lost to follow-up, 61 (68%) had moved to other towns, 18 (20%) could not be found, 6 (7%) had died, and 5 (6%) refused to participate. The children were submitted to a nutritional assessment (weight and height) and their mothers answered a standardized pre-coded questionnaire including information on socioeconomic, demographic, reproductive, and health characteristics.
Three outcomes with different prevalences were used in the analyses, each in conjunction with a risk factor and a confounding factor, in a way to form 3 distinct situations. The three sets of variables used respectively as outcome, risk factor and confounder were: situation 1 – underweight (weight for age Z-score < -2), previous hospitalization and birth weight; situation 2 – asthma (asthma or bronchitis reported by the person responsible for the child in study), whether mother smoked and social class; situation 3 – status of maternal employment (whether or not in a paid job), father living with the family and social class. All variables were made dichotomous in order to simplify the comparisons and understanding of the models.
In order to widen the scenarios available, each set of variables was manipulated to increase the level of confounding. This was achieved by arbitrary changes in the prevalences of the risk and confounding factors, re-weighting the relevant strata in the data in a way to keep the sample size constant. The original and manipulated data are fully presented in the results section.
Confounding was measured by the proportional change from the crude prevalence ratio to the adjusted (Mantel-Haenszel) prevalence ratio using the expression
, so that whenever the adjusted prevalence ratio was smaller than the crude, confounding was negative.