### Comparison of sampling schemes under different models

Considering the **Full** model, both sample designs and fitting strategies give non-biased estimates. For the design-related variables, variance in parameter estimates is slightly smaller with simple random sampling than with weighted sampling. On the other hand, the variance in the samples for interaction of *smoking* with each professional category changes with the sample weighting: it is smaller for the *professional* stratum, larger for the *technicians* and much larger for the *administrative* group (Figure 2). Note that when the model is completely specified, whether or not the weights are included in stratified sampling, almost exactly the same point estimates are returned.

The **Marginal** model, with just a common effect of *smoking* across all strata, presents similar and unbiased estimates for the design factor for both sample designs (Figure 3), when compared with the **Marginal** population parameters. The hazard ratios for *smoking*, whether with random sampling or in the model with sample design correction, were similar and non-biased. However, those estimates were strongly biased when the sampling weights were not included in the model: the probability distribution for the estimates did not include the true value of the parameters, with 95% confidence. The argument in favor of not including the sample weights is that it improves precision [17, 18], but in our example the increased precision excluded the true value of the parameter.

The **Smoke-only** model returned very similar results (Figure 4), with smaller variance but strong bias. The average risk for smoking, ignoring the interaction with professional category, is really 2.21 (Table 1). The misspecification of the model in this case caused an overestimation of the smoking effect, as it absorbed the effect of professional category. Most studies include the variable indicating the sampling strata in the model, even when ignoring the sample weights [6, 7], considering that this, unfortunately insufficient, procedure will correct for the design effect. The crude estimated effect, usually used in exploratory analysis and to select the most important variables, is also misleading, as are the Kaplan-Meier estimates and Mantel-Haenszel (or log-rank) tests [19]. Although correcting for the sample weights is possible and simple, it is rarely done.

### Comparison of modeling strategies in terms of loss to follow-up

Random loss is a non-informative censoring mechanism. Therefore it affects only precision, with results similar to those presented in the previous section (Figures 5 to 7, upper frames). If the model is well specified, the covariate associated with loss will absorb the loss to follow-up, as shown in Figure 5. As expected, because this is informative censoring, the larger losses in the *administrative* category decreases its hazard in all models and all sample strategies.

The **Marginal** model (Figure 6), with non-weighted fit, displays a bias for smoking similar to the same model without losses (Figure 3). Attrition is a recognized problem in longitudinal studies [20]. Yang and Shoptaw (2005) [21] present a thorough discussion of conceptual and practical issues in analyzing incomplete longitudinal data. However, in our simulations, the impact of ignoring the sample weights is larger than the impact of dropout, which is not as large in our example as in some of the studies discussed. The bias for smoking in the **Smoke-only** model (Figure 7) points in two directions. When the sampling weights were not included, it overestimates the hazard for smoking. On the other hand, as losses were larger in the *administrative* stratum, the values of the estimates decrease in the random sample and in the weighted model. This feature was already present, although not as visible, in the **Marginal** non-random loss to follow-up. Analyzis of the crude effect of smoking, using a Mantel-Haenszel test, should include the non-administrative censored group as a separate category.

### Overall comparison

The average variance of the estimates for each covariate (Figure 8) is very similar in both weighted and random sampling models, both with and without loss to follow-up. As expected, with the smaller number of events due to the losses, the average variance shifted towards higher values. The pattern of the non-weighted model is for the mean variance for the smoking variable, isolated or in interactions, to decrease, except for smoking among professionals. The variance, in the latter case, is very large because both the total number of observations and the hazard in this category are small.

Mean square error (MSE) is the sum of the variance and the squared bias of the estimates. This statistic is a good summary of the quality of a point estimate, as it combines the random and systematic error [22, 23]. The coincidence between random sampling and weighted model in Figure 9 is the same as described previously for the average variance of the estimates. However, in the non-weighted model, the systematic error predominates, making it the worst fit for all variables, except for the interaction of *smoking* with the *technicians* and *administrative* staff. The loss to follow-up simulations displayed similar patterns, with much larger MSE.

The simulation exercise was restricted to Cox regression, with only a few scenarios. We tested many different scenarios with other covariates, omitted risk factors, and so on, but decided to present only these simpler models, so as to highlight the impact of ignoring the sample weights. Evidently, the large disparity in sample weights favored clear demonstration of the bias. However, these sample weights reflect our experience. Other modeling approaches, such as repeated measures analysis, were not implemented, and different results could be obtained.

If non-administrative censoring is considerable, then a valuable tool is to take a sub-distribution hazard approach, re-weighting individuals in the risk set. The sample weighting itself could be recalculated at each dropout [24].