This article has Open Peer Review reports available.
Misclassification of incident conditions using claims data: impact of varying the period used to exclude pre-existing disease
© Griffiths et al.; licensee BioMed Central Ltd. 2013
Received: 13 December 2012
Accepted: 23 February 2013
Published: 6 March 2013
Estimating the incidence of medical conditions using claims data often requires constructing a prevalence period that predates an event of interest, for instance the diagnosis of cancer, to exclude those with pre-existing conditions from the incidence risk set. Those conditions missed during the prevalence period may be misclassified as incident conditions (false positives) after the event of interest.
Using Medicare claims, we examined the impact of selecting shorter versus longer prevalence periods on the incidence and misclassification of 12 relatively common conditions in older persons.
The source of data for this study was the National Cancer Institute’s Surveillance, Epidemiology, and End Results cancer registry linked to Medicare claims. Two cohorts of women were included: 33,731 diagnosed with breast cancer between 2000 and 2002, who had ≥ 36 months of Medicare eligibility prior to cancer, the event of interest; and 101,649 without cancer meeting the same Medicare eligibility criterion. Cancer patients were followed from 36 months before cancer diagnosis (prevalence period) up to 3 months after diagnosis (incidence period). Non-cancer patients were followed for up to 39 months after the beginning of Medicare eligibility. A sham date was inserted after 36 months to separate the prevalence and incidence periods. Using 36 months as the gold standard, the prevalence period was then shortened in 6-month increments to examine the impact on the number of conditions first detected during the incidence period.
In the breast cancer cohort, shortening the prevalence period from 36 to 6 months increased the incidence rates (per 1,000 patients) of all conditions; for example: hypertension 196 to 243; diabetes 34 to 76; chronic obstructive pulmonary disease 29 to 46; osteoarthritis 27 to 36; congestive heart failure 20 to 36; osteoporosis 22 to 29; and cerebrovascular disease 13 to 21. Shortening the prevalence period has less impact on those without cancer.
Selecting a short prevalence period to rule out pre-existing conditions can, through misclassification, substantially inflate estimates of incident conditions. In incidence studies based on Medicare claims, selecting a prevalence period of ≥24 months balances the need to exclude pre-existing conditions with retaining the largest possible cohort.
Medicare administrative and claims data have many uses , including estimating the overall incidence of chronic and acute conditions [2–13], as well as the incidence of adverse events related to medical interventions [14–17] or within certain care settings . Since information on medical history is not generally available in Medicare data - with some exceptions where claims are linked to other data sets [7, 18] – researchers interested in estimating disease incidence, especially chronic disease incidence, often divide the observation period into two discrete intervals: a prevalence period, in which patients already diagnosed with the condition are identified and excluded from the incident condition risk set; and an incidence period, which is then searched for claims indicating the presence of previously undetected conditions [2, 3, 6, 8, 11, 12, 15]. In studies where one is interested in the incidence of conditions that follow a specific clinical event, such as the diagnosis of cancer, the onset of end-stage renal disease, or a medical or surgical intervention, the date of diagnosis or intervention typically is used to define the end of the prevalence period and the beginning of the incidence period [3, 4, 15]. Researchers then look backwards from that date a specific number of months or years to define the prevalence period, and either they require that all patients included in the study have at least that amount of time under observation prior to that date [3, 15], or they use all of the data back to the beginning of the claims period . In other cases, a “sham” index date may be selected to indicate the beginning of the incidence period following a pre-defined period of observation [2, 6, 8, 11, 12], for instance 12 months of Medicare eligibility.
In theory, one could pick a sufficiently long prevalence period such that the “incidence only” group would include only true positives, assuming, of course, perfect sensitivity and specificity for the information used to identify the conditions (e.g., claims data). That is to say, additional lengthening of the prevalence period will not result in any additional reclassifying of “incidence only” patients as “both prevalence and incidence” patients, or in their removal from the incident risk set. Conversely, shortening the prevalence period from this point would add false positives to the true positives in the “incidence only” group. Adding false positives naturally would inflate incidence estimates. Furthermore, it could, in theory, increase type II error in studies of risk factors for the conditions, or studies on the outcomes of the conditions.
The objective of this study was to examine the impact of changing the length of the prevalence period on the observed incidence and misclassification of these conditions.
The source of data for this study was the National Cancer Institute’s (NCI) Surveillance, Epidemiology, and End Results (SEER) cancer registry linked to Medicare claims . SEER contains cancer incidence and survival data from population-based cancer registries throughout the United States, presently covering approximately 28% of the population . In SEER-Medicare, cancer registry data are linked to Medicare enrollment and claims data, which are available for 93% of those aged ≥ 65 years in the SEER registry . Claims available in SEER-Medicare include hospital short- and long-stay, skilled nursing facility, physician/supplier, institutional outpatient, home health agency, and durable medical equipment. Our dataset also included a separately created 5% random sample of non-cancer patients from the Medicare program in the same catchment areas as those in the SEER program.
Two cohorts, cancer and non-cancer, were included in the study. The cancer cohort consisted of women diagnosed with breast cancer between January 1, 2000, and December 31, 2002, who had at least 36 months of Medicare Part A (hospital) and Part B (outpatient) fee-for-service coverage prior to the diagnosis of cancer. Medicare and other sources of administrative and claims data also are used to examine the epidemiology of disease in patients without cancer. Patterns of incidence misclassification could differ between those with cancer and others in the general population, due to shared risk factors and also detection bias associated with diagnostic work-up for cancer and cancer treatment. Thus, for comparison, we also included in this study a cohort of Medicare beneficiaries who had not been diagnosed with cancer. The non-cancer cohort consisted of all women in the sample who reached a total of at least 36 months of Medicare Part A and Part B fee-for-service coverage during the same time period (January 1, 2000, and December 31, 2002). Therefore, the calendar intervals defining the prevalence and incidence periods were identical in the cancer and non-cancer cohorts.
Patients in the cancer cohort were followed from 36 months before to up to three months after the date of diagnosis. Since SEER provides only the month and year of diagnosis, the first day of the month was used as the date of diagnosis. Patients in the non-cancer cohort were assigned a “sham” index date on the first day of the 37th month after the beginning of Medicare Part A and B coverage. The observation period was divided into two periods. The prevalence period consisted of 36 months prior to cancer diagnosis or the sham index date. The incidence period consisted of three months following cancer diagnosis or the sham index date.
Patients and variables
Patients in both cohorts were described according to age, race/ethnicity, SEER region, and NCI Comorbidity Index score [22, 23]. Additional variables from SEER, including cancer stage, grade, estrogen and progesterone receptor status, and year of diagnosis, were used to describe the cancer cohort.
International classification of diseases, 9th edition, clinical modification (ICD-9-CM) codes used to identify conditions
Congestive heart failure
428.xx, 398.91, 402.01, 402.11, 402.91, 404.01, 404.03, 404.11, 404.13, 404.91, 404.93
Chronic obstructive pulmonary disease
296.2x, 296.3x, 298.0, 300.4, 309.0, 309.1, 309.28, 311
250.xx, 357.2, 362.0x, 366.41, 648.0x
571.2, 571.4-571.49, 571.5, 571.6, 571.8, 571.9, 572.2-572.8, 456.0-456.21, V42.7
410.xx, 411.0, 412.xx, 429.7x
285.21, 403.xx-404.xx, 405.01, 405.11, 405.91, 458.21, 582.xx, 583.xx, 585, 586, 588.xx, 593.71-593.73, V42.0
For each condition, each patient in each cohort was assigned to one of four mutually exclusive groups based on when the condition was identified: in both the prevalence and incidence periods; in the prevalence period only; in the incidence period only; and in neither the prevalence nor the incidence period. The 36-month prevalence period was used to exclude previously diagnosed patients to permit the identification of the incident cases (and to calculate the incidence of each condition). This analysis was considered the “gold standard” for comparing other intervals for the prevalence period, with the incident patients considered to be “true positives”.
The length of the prevalence period was then reduced in 6-month increments from 30 months to only 6 months before cancer diagnosis or the sham date. Patients were reassigned to one of the four groups for each of the prevalence periods. Shortening the prevalence period meant that some patients in the prevalence only group might move to the neither group, and those in the prevalence and incidence group might move to the incidence only group. (Figure 1) For each prevalence period (30, 24, 18, 12, and 6 months), we recalculated the incidence and the number of patients added to the incidence only group as a result of shortening the prevalence period. Patients added to the incidence only group, as an artifact of shortening the prevalence period, were classified as false positives, and for each shorter prevalence period we calculated (A) the false positive fraction, which was defined as the number of false positives in the incidence period divided by the number of conditions identified in the “gold standard” 36-month prevalence period, and (B) the percent of all incident cases that were false positives. Since all patients classified as incident cases using a 36-month prevalence period also were classified as incident cases using shorter prevalence periods, the “sensitivity” of using shorter prevalence periods (probability of being classified as an incident condition using a shorter prevalence period given one was classified as an incident condition using the 36-month prevalence period) =1 for all the shorter prevalence periods. However, the specificity (probability of not being classified as an incident condition using a shorter prevalence period given one was not classified as an incident condition using the 36-month prevalence period) declines with shorter prevalence periods, causing the false positive fraction to increase and the positive predictive value to decrease.
The epidemiology of conditions, as well as patterns of misclassification, could differ between the cancer and non-cancer cohorts based on shared risk factors for conditions and cancer, and also that cancer is likely to result in the increased detection (detection bias) of pre-existing, but previously undiagnosed conditions. Therefore, these two cohorts were analyzed separately.
This study was conducted as part of a protocol submitted to Quorum Review Institutional Review Board. On November 10, 2011, Quorum granted a determination of exemption for this protocol, based on the fact that the information in the data files is recorded in such a manner that subjects cannot be identified either directly or through identifiers linked to them. The Quorum Review File # for the exemption determination is 26648.
Demographic characteristics of the breast cancer and non-cancer patient cohorts
N = 33,731 (%)
N = 101,649 (%)
Age (years) a
SEER Region c
Georgia (Atlanta/Rural Georgia)
Washington (Seattle/Puget Sound)
NCI Comorbidity score d
Year of diagnosis
Incidence and prevalence proportions of conditions in breast cancer and non-cancer cohorts
Incidence a proportion
Prevalence a proportion
Chronic obstructive pulmonary disease
Congestive heart failure
Uses of Medicare administrative and claims data include estimating the incidence of acute and chronic medical conditions [2–13]. There is also considerable interest in using claims to identify adverse events related to treatment [14–17], and this interest may well strengthen with the availability of Medicare Part D oral drug data, and as the United States Food and Drug Administration explores new data structures and methodologies for identifying and analyzing adverse drug events . Many such analyses require that the investigator define a prevalence period to exclude patients with pre-existing conditions from the incidence risk set [2, 3, 6, 8, 11, 12, 15]. In this study, we sought to understand how changing the duration of the prevalence period, and specifically shortening it from a “gold standard” of 36 months, would affect the incidence estimates for 12 chronic and acute conditions common among the elderly. We found that the number of false positive incident cases increased considerably as the length of the prevalence period was shortened. Using a 6-month prevalence period, false positives accounted for more than 50% of all incident hypertension and diabetes cases in the cancer cohort, and there were also high false-positive rates in the non-cancer cohort.
Our findings suggest that studies using relatively short prevalence periods to exclude pre-existing disease from the incident risk set may overestimate the incidence of the condition because the risk set includes false positive (undetected) prevalent cases. This may be especially problematic in studies that estimate trends in the annual incidence of conditions where the prevalence also is increasing - namely chronic conditions like diabetes, where we observed the greatest increase in false positive incident cases associated with shortening the prevalence period. To minimize potential misclassification when repeat acute conditions are classified as prevalent (false-negative), it is critical to use disease-specific algorithms to distinguish follow-up care from new events. For example, a six month lag between two hip fracture claims would be likely to represent two separate incident fractures rather than chronic care for a single fracture .
We constructed our cancer cohort from SEER-Medicare, and used the date of cancer diagnosis to divide the observation period into prevalence and incidence periods. Researchers using SEER-Medicare for epidemiologic studies typically require only 12 months of Medicare Part A and Part B coverage prior to cancer diagnosis to calculate an NCI Comorbidity Index [22, 23]. In studies on the incidence of conditions, possibly including adverse-events related to cancer treatment where a comparator group is not included , the look-back period used to define the NCI Comorbidity Index also doubles as the prevalence period for excluding pre-existing conditions. Our findings show that with a 12-month prevalence period, false positives accounted for 20-30% of all incident diabetes, hypertension, congestive heart failure, COPD, cerebrovascular disease, and renal disease cases in cancer patients. Therefore, establishing a prevalence period based on the requirements for calculating the NCI Comorbidity Index could result in overestimating the incidence of conditions at the time of cancer diagnosis by up to 30%. Also, diluting the true positive incident cases with false positives could mask underlying associations between incident conditions and other variables. Consequently, it is advisable for researchers to consider longer prevalence periods if they are using SEER-Medicare to investigate the incidence of conditions, especially chronic conditions, that occur around the time of cancer diagnosis. However, this should be weighed against the loss of sample size that can result from requiring patients to have longer periods of Medicare Part A and B coverage prior to cancer diagnosis.
The differences between the cancer and the non-cancer cohort deserve comment. It is difficult to compare these populations directly because, in the cancer cohort, certain comorbid conditions may also be associated with cancer. Hence, one should not expect the same prevalence or incidence estimates in both cohorts, even with adjustment for age and other common confounders. Despite this, it is instructive to see how much larger the incidence rates are for most conditions in the cancer cohort (i.e., on the order of 3–15 times higher). Most likely, this is due to the diagnostic workup as part of the diagnosis of cancer and/or treatment planning – “detection bias” - suggesting that studies of cohorts undergoing a sentinel event (like breast cancer or myocardial infarction) may find additional comorbid conditions after the index event. Whether these should be considered as “baseline” comorbid conditions is something to be considered, and would depend on the condition and the importance of identifying the condition in the study.
Our study has several limitations. First, we defined a 36-month prevalence period as the gold standard for defining true positive incident cases. It is possible that had we used a longer prevalence period, some of the true positive incident cases would have been excluded from the risk set, having also appeared in the additional prevalence period. Extending the prevalence period can only reduce, not increase, the number of incident cases. Therefore, selecting a shorter rather than longer gold standard can only result in under-estimating the number of false positives. In this regard, our study may present conservative estimates of misclassification.
Second, we selected cohorts that only included patients with 36 months of Medicare Part A and B coverage prior to cancer diagnosis (cancer cohort) or the sham date (non-cancer cohort). This was necessary to identify true positive incident cases in a stable cohort. As discussed above, in reality, cohort sizes would increase as the required prevalence period gets shorter. This, in turn, could affect the incidence of the conditions we studied.
Third, we included only 12 conditions that are relatively common in older patients. Many are included in the NCI Comorbidity Index, and others are of interest as they could result from cancer interventions, or they could directly or indirectly influence cancer outcomes. In order for conditions to be defined as false positives in the incident analysis, they had to have been present in both the 36-month prevalent and incident periods, but then to drop out of the prevalent period as it became shorter. This may best describe chronic conditions with diagnoses that appear only infrequently in claims but then become relevant for medical decision-making around the time of cancer diagnosis. Patterns of misclassification for other conditions may differ from the ones we included in this study. Similarly, different definitions of these chronic conditions, i.e., the use of different ICD-9 codes or different thresholds for the number, timing, or type of claims, may give different results.
Fourth, we elected to divide the observation period into prevalence and incidence periods based on the date of cancer diagnosis. However, to the extent that diagnostic work-up for cancer begins prior to diagnosis, and that this results in increased detection of new conditions, changing the point at which the prevalence period ends and the incidence period begins could affect the incidence rates and the amount of misclassification. As with most studies based on SEER-Medicare, Medicare claims, or other sources of administrative and claims data, we did not include a “true gold standard” with perfect sensitivity and specificity for ascertaining the prevalences and incidences of the conditions. Finally, our cancer cohort included only women with breast cancer. It is possible that patterns of misclassification could differ in other types of cancers.
Estimating the incidence of medical conditions using claims data often requires constructing a prevalence period that predates an event of interest to exclude those with pre-existing conditions from the incidence risk set. Our findings show that selecting a short prevalence period can, through misclassification, substantially inflate incidence estimates. This is especially problematic with chronic conditions like diabetes. Selecting longer prevalence periods will mitigate this problem. Based on our findings, we recommend a prevalence period of ≥24 months, especially if the focus of the incidence analysis is on estimating the incidence of chronic conditions that may appear less frequently in Medicare claims. However, this decision should be weighed against the associated loss of sample size and the need to include patients of younger age.
This study used the linked SEER-Medicare database. The interpretation and reporting of these data are the sole responsibility of the authors. The authors acknowledge the efforts of the Applied Research Program, NCI; the Office of Research, Development and Information, CMS; Information Management Services (IMS), Inc.; and the Surveillance, Epidemiology, and End Results (SEER) Program tumor registries in the creation of the SEER-Medicare database.
Editorial assistance on this manuscript was provided by Kim Merjan, Jen Duryea, and Suzanne Griffiths of Outcomes Insights, Inc.
- Virnig BA, McBean M: Administrative data for public health surveillance and planning. Annu Rev Public Health. 2001, 22: 213-230. 10.1146/annurev.publhealth.22.1.213.View ArticlePubMedGoogle Scholar
- Faught E, Richman J, Martin R: Incidence and prevalence of epilepsy among older US Medicare beneficiaries. Neurology. 2012, 78 (7): 448-453. 10.1212/WNL.0b013e3182477edc.View ArticlePubMedPubMed CentralGoogle Scholar
- Danese MD, O’Malley C, Lindquist K, Gleeson M, Griffiths RI: An Observational Study of the Prevalence and Incidence of Comorbid Conditions in Older Women with Breast Cancer. Ann Oncol. 2011, Epub ahead of printGoogle Scholar
- Salifu MO, Abbott KC, Aytug S: New-onset diabetes after hemodialysis initiation: impact on survival. Am J Nephrol. 2010, 31 (3): 239-246. 10.1159/000276542.View ArticlePubMedGoogle Scholar
- Brauer CA, Coca-Perraillon M, Cutler DM, Rosen AB: Incidence and mortality of hip fractures in the United States. JAMA. 2009, 302 (14): 1573-1579. 10.1001/jama.2009.1462.View ArticlePubMedPubMed CentralGoogle Scholar
- Sloan FA, Bethel MA, Ruiz D, Shea AM, Feinglos MN: The growing burden of diabetes mellitus in the US elderly population. Arch Intern Med. 2008, 168 (2): 192-199. 10.1001/archinternmed.2007.35.View ArticlePubMedGoogle Scholar
- Parekh RS, Zhang L, Fivush BA, Klag MJ: Incidence of atherosclerosis by race in the dialysis morbidity and mortality study: a sample of the US ESRD population. J Am Soc Nephrol. 2005, 16 (5): 1420-1426. 10.1681/ASN.2004080661.View ArticlePubMedGoogle Scholar
- Bertoni AG, Hundley WG, Massing MW, Bonds DE, Burke GL, Goff DC: Heart failure prevalence, incidence, and mortality in the elderly with diabetes. Diabetes Care. 2004, 27 (3): 699-703. 10.2337/diacare.27.3.699.View ArticlePubMedGoogle Scholar
- Sloan FA, Brown DS, Carlisle ES, Ostermann J, Lee PP: Estimates of incidence rates with longitudinal claims data. Arch Ophthalmol. 2003, 121 (10): 1462-1468. 10.1001/archopht.121.10.1462.View ArticlePubMedGoogle Scholar
- Freeman JL, Zhang D, Freeman DH, Goodwin JS: An approach to identifying incident breast cancer cases using Medicare claims data. J Clin Epidemiol. 2000, 53 (6): 605-614. 10.1016/S0895-4356(99)00173-0.View ArticlePubMedGoogle Scholar
- Warren JL, Feuer E, Potosky AL, Riley GF, Lynch CF: Use of Medicare hospital and physician data to assess breast cancer incidence. Med Care. 1999, 37 (5): 445-456. 10.1097/00005650-199905000-00004.View ArticlePubMedGoogle Scholar
- May DS, Kittner SJ: Use of Medicare claims data to estimate national trends in stroke incidence, 1985–1991. Stroke. 1994, 25 (12): 2343-2347. 10.1161/01.STR.25.12.2343.View ArticlePubMedGoogle Scholar
- Fisher ES, Baron JA, Malenka DJ: Hip fracture incidence and mortality in New England. Epidemiology. 1991, 2 (2): 116-122. 10.1097/00001648-199103000-00005.View ArticlePubMedGoogle Scholar
- Keay L, Gower EW, Cassard SD, Tielsch JM, Schein OD: Postcataract surgery endophthalmitis in the United States analysis of the complete 2003 to 2004 Medicare database of cataract surgeries. Ophthalmology. 2012, Epub ahead of printGoogle Scholar
- Stein JD, Zacks DN, Grossman D, Grabe H, Johnson MW, Sloan FA: Adverse events after pars plana vitrectomy among Medicare beneficiaries. Arch Ophthalmol. 2009, 127 (12): 1656-1663. 10.1001/archophthalmol.2009.300.View ArticlePubMedPubMed CentralGoogle Scholar
- Dobbels F, Skeans MA, Snyder JJ, Tuomari AV, Maclean JR, Kasiske BL: Depressive disorder in renal transplantation: an analysis of Medicare claims. Am J Kidney Dis. 2008, 51 (5): 819-828. 10.1053/j.ajkd.2008.01.010.View ArticlePubMedGoogle Scholar
- Baxter NN, Habermann EB, Tepper JE, Durham SB, Virnig BA: Risk of pelvic fractures in older women following pelvic irradiation. JAMA. 2005, 294 (20): 2587-2593. 10.1001/jama.294.20.2587.View ArticlePubMedGoogle Scholar
- Oliveria SA, Liperoti R, L’italien G: Adverse events among nursing home residents with Alzheimer’s disease and psychosis. Pharmacoepidemiol Drug Saf. 2006, 15 (11): 763-774. 10.1002/pds.1274.View ArticlePubMedGoogle Scholar
- Warren JL, Klabunde CN, Schrag D, Bach PB, Riley GF: Overview of the SEER-Medicare data: content, research applications, and generalizability to the United States elderly population. Med Care. 2002, 40: 3-18.Google Scholar
- National Cancer Institute: Overview of the SEER Program [Internet]. 2011, Bethesda: National Cancer Institute, [Accessed 9/29/2011]. Available from: http://seer.cancer.gov/about/overview.html Google Scholar
- National Cancer Institute: Overview of the SEER Program [Internet]. 2011, Bethesda: National Cancer Institute, [Accessed 9/29/2011]. Available from: http://healthservices.cancer.gov/seermedicare/overview/linked.htm Google Scholar
- Klabunde CN, Potosky AL, Legler JM, Warren JL: Development of a comorbidity index using physician claims data. J Clin Epidemiol. 2000, 53: 1258-1267. 10.1016/S0895-4356(00)00256-0.View ArticlePubMedGoogle Scholar
- Klabunde CN, Legler JM, Warren JL, Baldwin L-M, Schrag D: A refined comorbidity measurement algorithm for claims-based studies of breast, prostate, colorectal, and lung cancer patients. Ann Epidemiol. 2007, 17: 584-590. 10.1016/j.annepidem.2007.03.011.View ArticlePubMedGoogle Scholar
- Practice Management Information Corporation: ICD-9-CM. 2005, Los Angeles: Practice Management Information Corporation, 6Google Scholar
- U.S. Department of Health and Human Services. Food and Drug Administration (FDA): The future of drug safety – promoting and protecting the health of the public. FDA’s response to the Institute of Medicine’s 2006 report. January 2007. 2007, [Accessed 3/13/2012]. Available from: http://www.fda.gov/downloads/drugs/drugsafety/postmarketdrugsafetyinformationforpatientsandproviders/ucm171627.pdf Google Scholar
- Jean S, Candas B, Belzile É: Algorithms can be used to identify fragility fracture cases in physician-claims databases. Osteoporosis Int. 2012, 23: 483-501. 10.1007/s00198-011-1559-4.View ArticleGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/13/32/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.