Data and measures used
For investigation of nonresponse bias
We examined data on 17,370 personnel who had been sampled for the first wave of data collection of the Iraq war cohort study. All personnel had been employed in the military between January 18th and June 28th 2003: 7,621 (labelled Op TELIC 1) were recorded as having been deployed in Iraq during this period and 9749 (labelled Era). were not recorded as having been deployed on Op TELIC1. Participants were contacted by post, or were asked to complete a questionnaire during military unit visits made by the research team. Up to 5 further attempts were made to recruit initial non-responders. Reservist personnel were over-sampled by a ratio of 2:1. The study received approval from the Ministry of Defence (Navy) personnel research ethics committee and the King's College Hospital local research ethics committee. Full details of the study design, the participants and the questionnaire are described in [12].
129 personnel who appeared to have never received a questionnaire (i.e. all mailings were listed as return to sender, or they had been recorded as absent during a military unit visit) were excluded as were 42 who were recorded as having died during the study and 166 (1%) who refused to take part in the study. Of the remaining 17,162 personnel, 10,256 (60%) were listed as having returned the questionnaire and were labelled 'responders'.
Demographic information, including age, rank, Service and address, for individuals in our sample was provided by the Defence Analytical Services Agency (DASA), who also provided a monthly fitness category for each person, indicating whether or not they were fit for active duty during that month, known in military jargon as "downgrading status". This study is unusual in that we were able to ascertain the health of non-responders for over two years following the start of the study. Fitness data were available for 99% of regulars and for 55% of reservists. For the purpose of this study 'fit' was defined as fit to deploy at all times between May 2003 (end of TELIC 1) and August 2005. Reservists were excluded from all analyses using the fitness data because of the large percentage with missing data. They were, however, included in all other analyses since reservists showed the biggest health differences between TELIC 1 and Era.
For investigation of bias across response waves
For this part of the analysis we used data on the response patterns, fitness indicators and replies to health questions of 10,234 survey participants (labelled 'full responders') after excluding 18 responders who completed only the first page of the questionnaire. These respondents had been sent (or believed they had been sent) the incorrect questionnaire, i.e. a questionnaire tailored for the TELIC 1 group when they had not been deployed on TELIC 1. A further 57 responders were re-assigned from the TELIC 1 to Era group and 22 individuals from Era to the TELIC 1 group after establishing that they had been wrongly classified [12].
The paper by Stang et al., on which we based the simulations, considers error in the exposure variable, for example alcohol consumption, and assumes that the outcome, for example liver cancer, is known. Since the exposure (deployment on TELIC 1) is known in the Iraq war study, we are concerned with misclassification of outcome, but the same principles will apply [13]. We consider two health outcomes: multiple physical symptoms (18 or more physical symptoms) and post-traumatic stress disorder (PTSD) defined as having a score of 50 or more on the Post traumatic Check List (PCL), a commonly used measure of PTSD [14] We have defined outcome misclassification as "errors caused by carelessness in completing the questionnaire." Another possibility would have been to define misclassification as under or over-reporting of multiple physical symptoms. However, since the purpose of the Iraq war study was to identify people who perceived that they had a health problem, rather than to identify those that had some quantifiable disease, the first definition seemed more apt for this investigation. We used two measures for assessing the extent of misclassification: 1. the percentage of discrepant answers to a question on health that asked a similar question in a different way: and 2. the percentage of missing answers to PTSD, and other health questions. For the first measure respondents were labelled 'discrepant' if they gave the same (contradictory) answer to the two questions "I'm as healthy as anyone I know" and "I seem to get ill more easily than other people," where the choice of answers were "definitely true", "mostly true", "mostly false" or "definitely false" [15]. For this measure two variables were constructed, 'discrepant 1', excluded any missing values for the two questions, and 'discrepant 2' labelled those with missing values for both questions as discrepant. For the second measure, having missing health data was defined as falling into at least one of the following categories: 1. having at least 4 missing answers to either the PTSD or General Health Questionnaire 12 [16]; 2. not answering either of the two questions described above; 3. not answering a question on general health. The questions on multiple physical symptoms were not included in this measure since participants were only required to respond to this question if they had at least one symptom. Full details on all the questions on health are provided in [12].
As in [6] wave was defined as the number of contacts that were needed before a successful response, after excluding any attempts where the questionnaire was returned to sender, or the person was listed as being not present at a unit visit (e.g. wave 1 respondents are those that responded at first contact). Two measures were used to assess prevalence of the outcomes, those obtained from the questionnaires, and the fitness category for each person. Although previous evidence has shown that the correlation between fitness status and perceived health may be quite weak [17], fitness status will provide some indication of the likely physical and mental health levels of respondents at each wave.
Analysis
Statistical analysis
Statistical analyses were carried out using Stata 9 (Stata Corporation, Texas, USA), using the svy commands and sampling weights to adjust for the oversampling of reservists.
The factors which differed between responders and non-responders were identified using the chi-squared test and a multivariable logistic regression model, based on these factors (including any significant interactions), was used to predict the probability of response. These probabilities were used to construct an inverse probability weight for each responder, which was then multiplied by the sampling weight. Relative risks for the main health outcomes were estimated with and without response weights and compared in order to determine the extent of nonresponse bias.
All relative risks were estimated using Poisson regression [18]. The estimates of relative risks across response waves were adjusted for age, sex, rank, service type, and reservist status but (in contrast to [12]) we excluded any covariates that might be misclassified and hence cause extra bias [13]. The Rao and Scott second order correction was used for Chi squared tests and an extension of the Wilcoxon rank-sum test was used to test for trends. Sample weights were used for all analyses (and reported percentages) except tests for trend and the Spearman correlation. All reported p values are two-sided.
Simulations
The equation presented on page 206 of [6] was used 1. to simulate the 'true' (unbiased) relative risks that would have been observed at wave 4 (for all responders) if there had been no misclassification and 2. to simulate the biased 'observed' relative risks for wave 1 – wave 3 that would result from these 'true' relative risks for a range of 'true' prevalence rates. We compared the simulated observed relative risks with those estimated from the data. We used the proportion of discrepant answers and missing data as measures of misclassification (unlike [6] who used hypothesised specificity and sensitivity). Full details of the calculations are provided in the additional material (see Additional file 1). The R programming language was used for all the simulations [19].