Skip to main content

Comparison of prevalence and exposure-disease associations using self-report and hospitalization data among enrollees of the world trade center health registry



Although many studies have investigated agreement between survey and hospitalization data for disease prevalence, it is unknown whether exposure-chronic disease associations vary based on data collection method. We investigated agreement between self-report and administrative data for the following: 1) disease prevalence, and 2) the accuracy of self-reported hospitalization in the last 12 months, and 3) the association of seven chronic diseases (rheumatoid arthritis, hypertension, heart attack, stroke, asthma, diabetes, hyperlipidemia) with four measures of 9/11 exposure.


Enrollees of the World Trade Center Health Registry who resided in New York State were included (N = 18,206). Hospitalization data for chronic diseases were obtained from the New York State Planning and Research Cooperative System (SPARCS). Prevalence for each disease and concordance measures (kappa, sensitivity, specificity, positive agreement, and negative agreement) were calculated. In addition, the associations of the seven chronic diseases with the four measures of exposure were evaluated using logistic regression.


Self-report disease prevalence ranged from moderately high (40.5% for hyperlipidemia) to low (3.8% for heart attack). Self-report prevalence was at least twice that obtained from administrative data for all seven chronic diseases. Kappa ranged from 0.35 (stroke) to 0.04 (rheumatoid arthritis). Self-reported hospitalizations within the last 12 months showed little overlap with actual hospitalization data. Agreement for exposure-disease associations was good over the twenty-eight exposure-disease pairs studied.


Agreement was good for exposure-disease associations, modest for disease prevalence, and poor for self-reported hospitalizations. Neither self-report nor administrative data can be treated as the “gold standard.” Which source to use depends on the availability and context of data, and the disease under study.

Peer Review reports


Effectively monitoring population health for chronic diseases requires reliable and accurate data. The most common type of data for measuring prevalence of chronic disease has been surveys, in which subjects self-report demographic information and chronic disease status. However, surveys require extensive planning and are often expensive to administer. Further, the accuracy of the data obtained can be influenced by sampling and recall bias and other sources of error, themselves influenced by factors such as demographic characteristics and disease status [1,2,3].

In recent decades, hospitalization data have become more commonly used, due to easier access and increased computational power. Hospitalization data are often perceived as more objective and less biased because they are record-based. Nonetheless, such data have limitations. First, the identifying information contained in administrative data is often limited to portions of the name, date of birth, gender, social security number, and address. This contrasts with the rich demographic, social, and behavioral data often obtained from surveys. Further, coding entry errors can lead to misreporting and under-reporting of chronic disease. Finally, hospitalization data are usually limited to specific geographic regions, and so cannot provide data on subjects outside the coverage region.

Since there are concerns with both survey and administrative data, caution is recommended in accepting either as objective truth. A more prudent approach would be to view the two sources of health data as complementary for the purpose of obtaining a more accurate picture of population health. However, to achieve this end it is important to evaluate the nature and degree of agreement between survey and hospitalization data.

Much research has investigated the concordance between survey self-report and hospitalization data for the prevalence of chronic physical disease. For example, in a regionally representative sample of Quebec citizens [4], the agreement between self-report and administrative data was found to be almost perfect for diabetes (kappa = 0.82), moderate for heart disease (kappa = 0.54), and fair for chronic obstructive pulmonary disorder (COPD) (kappa = 0.23). Similar results were obtained from a study in Ontario [5]. Research in the United States comparing the self-report-based vs. Medicare-based prevalence for heart attack and diabetes found, for both diseases, that while the time trends were similar for both data sources, the prevalence was somewhat higher for self-report than for Medicare-based data [6]. A recent study of 202 Iranian participants in their RaNCD cohort study, measuring agreement between self-report and official medical records by the kappa statistic, found a wide range of agreement for different diseases. Specifically, they found kappa = 0.39 for Hepatitis B and hypertension, kappa = 0.65 for thyroid disease, kappa = 0.87 for ischemic heart disease, and kappa = 1.00 for cancer [7].

Monitoring the health of populations or other groups can also involve investigating the association between relevant exposures and chronic disease outcomes based on subjective self-reported vs. objective sources. However, less work has been done on this aspect of concordance research. One study [8] treated heart attack based on both subjective self-report and administrative data as an exposure, and mortality as the outcome, and investigated their association. They found that among the 3.1% of respondents with self-reported heart attack, 32.8% had a claims-identified heart attack, and among 1.4% of respondants with an administrative record of heart attack, 67.8% reported a heart attack. Self-report and administrative data had similar associations with mortality (Odds ratio = 2.5 vs. 2.8). However, to the best of our knowledge, no concordance study has investigated the association between an environmental exposure and chronic physical disease as the outcome. Disaster research provides an opportunity to explore such exposure-disease associations, since during a disaster a set of subjects experience an exposure, to varying degrees, and some of these subjects subsequently develop one or more chronic physical diseases.

This study employs self-report data collected by the World Trade Center Health Registry, as well as New York State hospital discharge data matched to World Trade Center Health Registry enrollees, to evaluate the concordance between these two sources for defining physical health outcomes. The aims of this study are: 1) ascertain whether prevalence of chronic physical disease is similar in self-report compared to hospitalization data for the period 2002–2015, 2) quantify accuracy of self-report of a hospitalization within the last 12 months, 3) determine whether the association between exposure to the 9/11 attacks and physical chronic disease is similar when evaluated using self-report and hospitalization data sources.



The World Trade Center Health Registry was established in 2003 to monitor the physical and mental health consequences of the terrorist attacks of September 11th, 2001. Enrollees included rescue/recovery workers, residents, area workers, passersby, and students/staff of local schools.

Data collection

The initial World Trade Center Health Registry survey was conducted in 2003–2004 (wave 1), and subsequently in 2006–2007 (wave 2), 2011–2012 (wave 3), and 2015–2016 (wave 4). Survey data were collected via mail, web, and computer aided telephone interview (CATI). The methods used by the World Trade Center Health Registry have been described in detail in previous publications [9, 10]. The World Trade Center Health Registry was approved by the institutional review boards of the Centers for Disease Control and Prevention and the New York City Department of Health and Mental Hygiene.

Administrative data

Hospital discharge data were obtained from the New York State Department of Health’s Statewide Planning and Research Cooperative System (SPARCS). SPARCS contains data on all inpatient and emergency department discharges in New York State, except for federal and psychiatric hospitals. Each SPARCS discharge record contains, in addition to personal identifier data (e.g. date of birth), an admitting diagnosis, the principal diagnosis and 24 secondary diagnoses. For the period under investigation, 2002–2015, the SPARCS system typically contains 2–2.5 million records per year.

A World Trade Center Health Registry-SPARCS matched dataset was created by matching records from the two sources, based on a deterministic algorithm using parts of the first and last names, as well as date of birth, sex, social security number, and zip code in hierarchical rounds. SPARCS data covering the period 2002–2015 were employed in the match. Both inpatient and emergency department visits were included. This matched dataset included World Trade Center Health Registry enrollees with hospitalizations in the above period, approximately 60% (N = 42,292) of the total World Trade Center Health Registry cohort (N = 71,426).

Analytic sample

The analytic dataset was obtained from the full World Trade Center Health Registry cohort (N = 71,426) using the following inclusion criteria: 1) Enrollees must have completed all four survey waves (N = 28,249); 2) enrollees must have resided in New York State at waves 2 and 3 (N = 46,028; see Fig. 1). The final analytic dataset (n = 18,206) included both self-report and SPARCS data.

Fig. 1
figure 1

Development of the study analytic sample

Outcome measures

Seven chronic diseases were investigated: Rheumatoid Arthritis, hypertension, heart attack, stroke, asthma, diabetes, and hyperlipidemia (elevated cholesterol). Two binary (yes/no) outcomes were created for each disease: self-report (SR) and hospitalization (SPARCS).

Self-report binary outcome for the period 2002–2015 was obtained from the following question in the waves 2–4 World Trade Center Health Registry surveys: “Have you ever been told by a doctor or other health professional that you had any of these conditions? If yes, please provide the year you were first told you had that condition”. If the enrollee reported having the disease at any of these waves, with a year of diagnosis between 2002 and 2015, the self-report outcome was defined as yes. If the enrollee reported having the disease at waves 2–4, but the year of diagnosis was never between 2002 and 2015 for any of these waves, the self-report outcome was defined as no. If the enrollee did not report having the disease at any of waves 2–4, the self-report outcome was defined as no.

The administrative binary outcome for the period 2002–2015 was set to yes if any of the enrollee’s hospitalization(s) in the SPARCS-World Trade Center Health Registry dataset contained International Classification of Diseases (ICD-9) codes for the relevant disease, in either the SPARCS principal diagnosis or the twenty-four secondary diagnoses. Otherwise, the binary outcome was set to no for that disease. ICD-9 codes relevant to the seven diseases studied here are given in the Appendix 1.

In addition to analysis for the period 2002–2015, we compared self-report and hospitalization data for the shorter period preceding wave 3, using the self-reported question “During the last twelve months, have you been hospitalized overnight for this condition?” from the wave 3 survey. Both self-report and hospitalization outcomes were defined as binary. If the enrollee self-reported a hospitalization for the disease within 12 months prior to wave 3, the self-reported binary outcome was defined as yes. If the enrollee had a SPARCS hospitalization within 15 months prior to have 3, to account for telescoping in self-reported hospitalization, the hospitalization outcome was defined as yes. If the enrollee self-reported or had a SPARCS hospitalization for the disease outside of twelve or fifteen, respectively, months prior to wave 3, or if there was no self-report or SPARCS hospitalization for the disease, the relevant binary outcome was defined as no.

Exposure variables

Four variables were selected as exposures for analysis of exposure-disease associations: 1) Post-Traumatic Stress Disorder (hereafter PTSD) (yes/no), assessed at wave 1, 2) caught in the dust cloud (yes/no) on 9/11, 3) injured on 9/11 (yes – 1 or more injuries/no – zero injuries), and 4) witnessed three or more traumatic events on 9/11 (yes/no).

Post-Traumatic Stress Disorder (yes/no) (PTSD) was assessed with the PTSD Checklist-Specific (PCL-S), a 17-item self-reported symptom scale that specifically targets the events of September 11. Probable PTSD was defined as a PCL score ≥ 44. The PCL scale possesses good psychometric properties [11, 12].

The dust cloud exposure was defined as binary (yes/no), based on the question “Was subject outdoors within dust cloud on 9/11/01” from the wave 1 survey. Being injured on 9/11 was derived from a wave 1 survey variable stating the number of injuries (excluding eye injuries) sustained on 9/11 and was treated as binary (yes – 1 or more injuries/no – zero injuries). Witnessing three or more traumatic events (e.g. seeing planes hits buildings, witnessing people jumping from buildings) was also defied as binary (yes – 3 or more events/no – 0, 1, or 2 events), and was obtained by summing questions from the wave 1 survey events enrollees reported having witnessed.


Covariates of interest included gender (male, female), race (White non-Hispanic, Black non-Hispanic, Hispanic, Asian, and Other), age at 9/11, and educational attainment (high school/GED or less, some college, and college grad/post-grad).

Statistical analysis

The main analyses focused on comparison of self-report of disease to hospitalization for the disease, for the period 2002–2015. The wave 3 survey contained, in addition to the disease self-report question, a subsidiary one asking if the enrollee had been hospitalized for that disease in the previous 12 months. We employed both questions to compare the self-reported vs. SPARCS-derived rates of hospitalization within the year preceding wave 3. Since the analytic sample contained emergency department visits, in addition to inpatient visits, we performed a sensitivity analysis by performing the same calculations as above using only inpatient visits. We performed a second sensitivity analysis by performing the above calculations using only the principal diagnosis from inpatient visits.

The disease prevalence, based (separately) on self-report and hospitalization data, was calculated as the ratio of the number of enrollees with the disease to the analytic sample size (n = 18,206). Concordance measures were also obtained for each disease between self-report and hospitalization prevalence estimates, that included the kappa statistic, sensitivity (of self-report vs. SPARCS), specificity (of self-report vs. SPARCS), positive agreement, and negative agreement.

We also examined the association of each disease with the four measures of exposure to the 9/11 attacks using both data sources. This was done using logistic regression between each disease outcome-exposure pair. All regressions controlled for age at 9/11, race/ethnicity, gender, and education. Odds ratios (OR): and their 95% confidence intervals (CI) were taken as the measure of association and its statistical significance. We investigated the difference between the self-report and SPARCS odds ratios on the log scale (i.e. the difference between the β) for all twenty-eight exposure-disease pairs and assessed the statistical significance of these differences.

All calculations were performed using SAS 9.4 (Cary, North Carolina).


The characteristics of the study sample (n = 18,206) are summarized in Table 1. 60.5% were male, 69.8% were White non-Hispanic, 51% were aged 25–44 years on 9/11, 52% had a college degree or post-graduate education, and 14.9% had probable PTSD at wave 1.

Table 1 Demographic characteristics of the analytic sample

The disease prevalence and concordance statistics are presented in Table 2. For all diseases, self-report prevalence was greater than hospitalization-based prevalence, by a factor of two or greater. Self-report- and hospitalization-based prevalence differences were small for some diseases (e.g., 2.4% vs. 1.1% for stroke), but moderately large for other diseases (e.g. 36.5% vs. 17.9% for hypertension). The kappa ranged from fair (0.35 for stroke) to slight (0.04 for rheumatoid arthritis). For all diseases, sensitivity was less than specificity, and positive agreement less than negative agreement. These latter differences were attenuated for diseases where the prevalence was larger (e.g., hypertension, hyperlipidemia).

Table 2 Chronic disease prevalence and agreement statistics

Self-report and SPARCS-derived hospitalizations within 12 months of wave 3 are shown in Table 3. For rheumatoid arthritis, heart attack, and stroke, the number of self-reported hospitalizations exceeded the SPARCS-based hospitalizations. The reverse was observed for hypertension, asthma, and diabetes. However, the overlap between the two data sources was extremely low for all six diseases, kappa statistics were all in the slight range. The results changed little if emergency department visits were excluded, or if only principal diagnoses were included (see Supplements S1 and S2). These results were somewhat smaller than those calculated for the 2002–2015 time period, in Table 2.

Table 3 Hospitalization Self-Report at Wave 3 and Verification by SPARCS

Disease-exposure associations are shown in Table 4. The difference between odds ratios obtained from self-report and SPARCS was non-significant for twenty-four of twenty-eight exposure-disease pairs. For three exposure-disease pairs the difference in odds ratios was statistically significant but the odds ratios did not differ substantially in size. For a single exposure-disease pair the difference in odds ratios was significantly significant, and the odds ratios differed substantially in size.

Table 4 Disease-Exposure Associationsa, b, c,d,e


The present study evaluated the agreement between self-report and hospitalization data sources for chronic physical diseases, based on disease prevalence and on exposure-disease associations, for subjects exposed to the September 11, 2001 attacks. We found only modest agreement for disease prevalence, but good agreement for the association of the seven chronic diseases with four measures of the 9/11 attacks.

While the level of agreement between self-report and hospitalization data for disease prevalence was modest for the study period 2002–2015, the agreement was poor for hospitalizations within 12 months of wave 3. This could result from limitations of the hospitalization data, or possibly because people were not accurate at reporting hospitalizations within a year.

Although we found modest agreement between survey and hospitalization data for disease prevalence, the associations between exposure to the 9/11 attacks and several disease outcomes were similar using both data sources. This finding is analogous to a study [8] that examined heart disease-mortality associations among Medicare recipients using both claims and self-report data for heart disease. One difference between that study and the present one is that we were able to examine whether associations remained constant when both exposure and outcome were self-reported. The current findings suggest that misreporting of disease status is non-differential for the exposure measures employed, at least among this cohort, as the point estimates obtained via hospitalization data are slightly attenuated but close to those obtained via self-report. This provides evidence that either data source may be sufficient for research on the association of chronic physical disease with environmental (or other) exposures, at least for associations measured on a relative scale. This broader result for exposure-disease associations may not hold for associations measured on an absolute scale (e.g. risk difference).

Strengths and limitations

A major strength of this study is that it employed a large prospective cohort with substantial self-report data on exposure to the 9/11 attacks, as well as on subsequent chronic disease. This allowed us to investigate both disease prevalence and its association with various measures of exposure to a disaster.

An important limitation is we linked World Trade Center Health Registry data to a single source of hospitalization data – SPARCS. We therefore could not obtain data on chronic diseases for enrollees residing outside of New York State or for enrollees in psychiatric institutions. This likely led to reduced prevalence estimates for chronic physical disease from hospitalization data, and to poorer agreement between self-report and administrative data sources.

A further limitation of our study is the attrition that occurs between surveys. Waves 2–4 had response rates of 68, 63, and 55%, respectively. Attrition can potentially introduce bias into the concordance estimates of the present study. However, such bias was found to be small in a previous World Trade Center Health Registry study [13], at least for measures of association.


This study investigated the concordance between self-report and hospitalization data for the prevalence of chronic physical disease and its association with exposure to the 9/11 attacks. Exposure-chronic disease associations were found to be similar for the two data sources, so either can be used for this purpose. Agreement was less satisfactory for disease prevalence. Choosing a data source therefore should be based on a variety of factors such as availability and research aims.

Availability of data and materials

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.



World Trade Center Health Registry


(New York) State Planning and Research Cooperative System




Post-Traumatic Stress Disorder


Odds Ratio


Confidence Interval


International Classification of Disease


  1. Mheen VD. Recall Bias in Self-reported Childhood Health: Differences by Age and Educational Level. Soc Health Illness. 1998;20(2):241–54.

    Article  Google Scholar 

  2. Rhodes AE, Fung K. Self-reported use of mental health services versus administrative records: care to recall? Int J Methods Psychiatr Res. 2004;13(3):165–75.

    Article  PubMed  Google Scholar 

  3. Schmier JK, Halpern MT. Patient recall and recall bias of health state and health status. Exp Rev Pharmacoecon Outcomes Res. 2004;4(2):159–63.

    Article  Google Scholar 

  4. Fortin M, Haggerty J, Sanche S, Almirall J. Self-reported versus health administrative data: implications for assessing chronic illness burden in populations. A cross-sectional study. CMAJ Open. 2017;5(3):E729–33.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Muggah E, Graves E, Bennett C, Manuel DG. Ascertainment of chronic diseases using population health data: a comparison of health administrative data and patient self-report. BMC Public Health. 2013;13(1):16.

    Article  PubMed  PubMed Central  Google Scholar 

  6. St Clair P, et al. Using self-reports or claims to assess disease prevalence: It's complicated. Med Care. 2017;55(8):782–8.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Najafi F, Moradinazar M, Hamzeh B, Rezaeian S. The reliability of self-reporting chronic diseases: how reliable is the result of population-based cohort studies. J Prev Med Hyg. 2019;60(4):E349–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Yasaitis LC, Berkman LF, Chandra A. Comparison of self-reported and Medicare claims-identified acute myocardial infarction. Circulation. 2015;131(17):1477–85; discussion 1485.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Brackbill RM, Hadler JL, DiGrande L, Ekenga CC, Farfel MR, Friedman S, et al. Asthma and posttraumatic stress symptoms 5 to 6 years following exposure to the world trade center terrorist attack. JAMA. 2009;302(5):502–16.

    Article  CAS  PubMed  Google Scholar 

  10. Farfel M, DiGrande L, Brackbill R, Prann A, Cone J, Friedman S, et al. An overview of 9/11 experiences and respiratory and mental health conditions among world trade center health registry enrollees. J Urban Health. 2008;85(6):880–909.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Blanchard EB, Jones-Alexander J, Buckley TC, Forneris CA. Psychometric properties of the PTSD checklist (PCL). Behav Res Ther. 1996;34(8):669–73.

    Article  CAS  PubMed  Google Scholar 

  12. Wilkins KC, Lang AJ, Norman SB. Synthesis of the psychometric properties of the PTSD checklist (PCL) military, civilian, and specific versions. Depress Anxiety. 2011;28(7):596–606.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Yu S, Brackbill RM, Stellman SD, Ghuman S, Farfel MR. Evaluation of non-response bias in a cohort study of world trade center terrorist attack survivors. BMC Res Notes. 2015;8(1):42.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


The authors thank Ms. Kacie Seil for her thorough proof of the SAS codes for this paper, and Mark Farfel, Sharon Perlman, Charon Gwynn, and Hannah Helmy for their critical review of this paper.


This publication was supported by Cooperative Agreement Numbers 2 U50/OH009739 and 5 U50/OH009739 from the National Institute for Occupational Safety and Health (NIOSH) of the Centers for Disease Control and Prevention (CDC); U50/ATU272750 from the Agency for Toxic Substances and Disease Registry (ATSDR), CDC, which included support from the National Center for Environmental Health, CDC, and by the New York City Department of Health and Mental Hygiene (NYC DOHMH). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIOSH, CDC or the Department of Health and Human Services. The funders played no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Author information

Authors and Affiliations



All authors have read, contributed to, and approved this manuscript. HEA, JB, JEC, and RMB conceived the study. HEA led the manuscript development and writing. HEA performed the coding and statistical analysis of the data. All authors discussed the results and implications and commented on the manuscript at all stages.

Corresponding author

Correspondence to Howard E. Alper.

Ethics declarations

Ethics approval and consent to participate

The analytical data of this study is from the World Trade Center Health Registry Surveys. All participants gave verbal informed consent to participate in the WTC Health Registry at the time of enrollment in 2003–04. The US Centers for Disease Control and Prevention and the New York City Department of Health and Mental Hygiene institutional review boards approved the World Trade Center Health Registry protocol, including use of the data.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Appendix 1.

ICD-9 Codes for Chronic Diseases.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alper, H.E., Brite, J., Cone, J.E. et al. Comparison of prevalence and exposure-disease associations using self-report and hospitalization data among enrollees of the world trade center health registry. BMC Med Res Methodol 21, 162 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: