Validation of diagnosis codes to identify side of colon in an electronic health record registry

Background The use of real-world data to generate evidence requires careful assessment and validation of critical variables before drawing clinical conclusions. Prospective clinical trial data suggest that anatomic origin of colon cancer impacts prognosis and treatment effectiveness. As an initial step in validating this observation in routine clinical settings, we explored the feasibility and accuracy of obtaining information on tumor sidedness from electronic health records (EHR) billing codes. Methods Nine thousand four hundred three patients with metastatic colorectal cancer (mCRC) were selected from the Flatiron Health database, which is derived from de-identified EHR data. This study included a random sample of 200 mCRC patients. Tumor site data derived from International Classification of Diseases (ICD) codes were compared with data abstracted from unstructured documents in the EHR (e.g. surgical and pathology notes). Concordance was determined via observed agreement and Cohen’s kappa coefficient (κ). Accuracy of ICD codes for each tumor site (left, right, transverse) was determined by calculating the sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), and corresponding 95% confidence intervals, using abstracted data as the gold standard. Results Study patients had similar characteristics and side of colon distribution compared with the full mCRC dataset. The observed agreement between the ICD codes and abstracted data for tumor site for all sampled patients was 0.58 (κ = 0.41). When restricting to the 62% of patients with a side-specific ICD code, the observed agreement was 0.84 (κ = 0.79). The specificity (92–98%) of structured data for tumor location was high, with lower sensitivity (49–63%), PPV (64–92%) and NPV (72–97%). Demographic and clinical characteristics were similar between patients with specific and non-specific side of colon ICD codes. Conclusions ICD codes are a highly reliable indicator of tumor location when the specific location code is entered in the EHR. However, non-specific side of colon ICD codes are present for a sizable minority of patients, and structured data alone may not be adequate to support testing of some research hypotheses. Careful assessment of key variables is required before determining the need for clinical abstraction to supplement structured data in generating real-world evidence from EHRs. Electronic supplementary material The online version of this article (10.1186/s12874-019-0824-7) contains supplementary material, which is available to authorized users.


Background
Historically, prospective randomized clinical trials have served as the "gold standard" for evidence generation in oncology. Given that only a small percentage of cancer patients take part in clinical research studies [1], there is increasing interest in leveraging the data contained in administrative and clinical databases for patients treated outside of clinical trials, as these data can provide guidance for treatment decisions. Such real-world data have the potential to be more representative of patients in routine practice, given that clinical trials tend to enroll highly selected patients who are younger and have fewer comorbidities. Furthermore, real-world data can supplement the results of prospective clinical trials in settings where accrual is difficult due to uncommon clinical or genomic selection criteria. Recently, the Twenty-first Century Cures Act [2] and United States Food and Drug Administration 2018 Goals [3] both highlighted the imperative to understand how real-world data can be optimally used to improve health.
The data contained in electronic health records (EHRs) afford an important opportunity to test hypotheses regarding patterns of care and outcomes in a broadly representative sample of cancer patients. EHR data are characterized by date and may not require third-party primary data collection. However, the impact of real-world data is dependent upon the reliability of specific data elements, their completeness, and the ability to ensure and trace their provenance [4]. Thus, the promise of EHR data can only be realized if each data point is carefully assessed and validated before clinical conclusions are drawn.
As an example of the research application of EHR data, we sought to validate in a real-world setting recent clinical findings from a group of prospective clinical trials in patients with metastatic colorectal cancer (mCRC). Historically, the clinical development of systemic therapies for mCRC has not distinguished patients based on the location of the tumor within the bowel. However, recent analyses have suggested that anatomical side of the colon from which a tumor arises is a prognostic and predictive indicator of survival [5][6][7][8][9]. These studies have indicated that CRCs arising from the left or right side of the colon differ significantly in their clinical characteristics and gene expression profiles [10][11][12][13], with rightsided tumors being associated with a worse prognosis [14][15][16]. Therapeutic outcome also may differ by tumor side, with several analyses reporting differences in benefit with epidermal growth factor receptor and vascular endothelial growth factor antibodies in left-vs rightsided mCRC tumors [5,6]. These findings led to a recent international expert panel recommendation that primary tumor location be included as an essential data element in the design and reporting of colon cancer clinical trials [17]. As an initial step in seeking to replicate these findings in a real-world population, we undertook a formal analysis of the ability to obtain information about tumor sidedness from billing codes (International Classification of Disease [ICD] 9/10) in EHRs. The overall goal of this study was to determine the feasibility of using structured diagnostic codes to determine tumor location for patients with mCRC. The formal validation approach described herein may be broadly applied to other clinical contexts where data points from EHRs are being considered for use in outcomes research.

Data source
This validation study was conducted using the nationwide Flatiron Health database, a longitudinal, demographically and geographically diverse database derived from de-identified EHR data. The Flatiron Health database includes data from over 265 cancer clinics, comprised of both community and academic oncology clinics, representing more than 2 million US cancer patients available for analysis. The deidentified patient-level data in the EHRs includes structured data (e.g. billing codes, laboratory measurements, visits, and prescribed drugs) and unstructured data curated via technology-enabled chart abstraction from physicians' notes and other unstructured documents (e.g. physician progress notes, pathology reports).

Patient selection
From the broader Flatiron Health EHR-derived database, a cohort of mCRC patients was created. Patients were selected for an ICD-code of colon or rectal cancer (153.x, 154.x, C18x, C19x, C20x, or C21x), at least two clinic visits in the Flatiron network that occurred on or after January 1, 2013, and clinical documentation of mCRC. Patients lacking relevant unstructured documents in the Flatiron Health database for abstraction were excluded. Of 9403 patients with confirmed metastatic colon cancer, a random sample cohort of 200 patients who met the above criteria was included in this study. The random sample was selected using a random number generator with a specified seed so that the list of patients is reproducible. As the current analysis focused on side of colon, patients with a confirmed diagnosis of metastatic rectal cancer were excluded from the validation study.

Identification of tumor location
ICD codes were compared with location identified through human abstraction of unstructured data to establish the quality of ICD-defined tumor location. For both ICD-defined and abstracted tumor location variables, tumors were classified as left side (splenic flexure, descending colon, sigmoid colon, rectosigmoid junction), right side (cecum, ascending colon, hepatic flexure), or transverse (transverse colon).

Identification of tumor location based upon structured data
Data captured in the Flatiron Health EHR-derived database include ICD, 9th and 10th revisions (ICD9 and ICD10; see Table 5 in Appendix) for diagnoses [18]. Whereas some codes can differentiate CRC tumor origin (i.e. ICD9 153.1/ICD10 C18.4: Malignant neoplasm of transverse colon, ICD9 153.7/ICD10 C18.5: Malignant neoplasm of splenic flexure), there is also an unspecified code (ICD9 153.9/ICD10 C18.9: Malignant neoplasm of the colon, unspecified site) that can be used by physicians.
ICD9/10 codes were available from the diagnosis table in the EHR database and were used to classify patients. The full list of codes and categories used is listed in Table  5 in Appendix: A. The date of the ICD code closest to the initial diagnosis date was used to assign side of colon with the following considerations: if a patient had multiple ICD codes that indicated different sides on the same date, and if this date was closest to the diagnosis date, the patient was categorized as having CRC in multiple sites of the colon. If one of the codes was an unspecified code, it was dropped and the specific code was used to classify the patient (e.g. "Left colon, Unspecified colon" became "Left colon"). For patients with no abstracted initial diagnosis date, the first relevant ICD code was selected.

Identification of tumor location based on chart abstraction
In order to establish the quality of ICD-defined tumor location, ICD codes were compared with location identified through human abstraction of unstructured data. Centrally trained abstractors reviewed all relevant unstructured documents included in the patients' EHR, including pathology reports, physician notes, and surgical notes to identify evidence of the side of colon. To classify a patient, abstractors looked for terms such as "left colon" or "right colon," as well as the specific sites within the colon, as described in Table 5 in Appendix: A.

Statistical methods
Patient characteristics were summarized using counts and percentages for categorical variables, and medians and interquartile ranges for continuous variables, for the full mCRC dataset (9403 patients) and the 200 randomly selected participants in our validation study. Concordance between structured ICD codes and abstracted diagnosis was determined via observed percent agreement and Cohen's kappa coefficient (κ). The concordance analysis assumed no gold standard. Accuracy of ICD codes was determined by calculating the sensitivity, specificity, positive and negative predictive values, and corresponding 95% confidence intervals, using the abstracted data as the gold standard. "Unspecified colon side" in the unstructured data was treated as "No" for all of these analyses.

Results
Baseline characteristics for patients in this study (N = 200) were similar to patients in the full mCRC dataset for all variables examined ( Table 1). Half of the validation study patients were male (50%), and more than half were aged 65 and older (59%), and had stage IV mCRC at initial diagnosis (54%). An additional 28% had stage III CRC at initial diagnosis. Site-specific ICD codes were available for 5940 (63%) patients in the parent cohort (Table 2).
When patients with unspecified ICD codes were excluded from the analysis, the distribution of side of colon using  (Table 1 and  Table 3). Approximately 4% (n = 8) of patients were considered to have rectal cancer based on ICD codes; however, through chart abstraction these patients had a confirmed diagnosis of colon cancer. Thus, this discrepancy represents misclassification of these patients based on ICD codes alone. When all 200 study patients were considered, concordance was moderate between the structured (ICD) data and the unstructured (abstracted) data, with an observed agreement of 0.58 (κ = 0.41). When patients who were classified as unspecified or rectal in the structured data were removed, the observed agreement was 0.84 (κ = 0.79). Seventy-six (38%) patients were classified as "unspecified" using ICD codes, and 63 of these (83%) had the side identified through  abstraction. As shown in Table 4, specificity of structured data for tumor location was high, ranging from 92 to 98%. Sensitivity, negative predictive value, and positive predictive value were of lower performance, ranging from 49-63%, 72-97%, and 64-92%, respectively. When patients with non-specific side of colon ICD codes were removed, sensitivity improved to~80% for all tumor locations. Similar estimates were observed when stratified by stage at initial diagnosis (Stage I-III vs. Stage IV) (Additional file 1: Tables S1-S4).
In an effort to identify potential biases regarding the likelihood that ICD coding for tumor location was present, we compared the clinical characteristics of those patients who had specific diagnosis codes and those who did not. There were no differences in age, stage, sex, or treatment distributions between these two cohorts ( Table 2). A gradual increase in the use of specific ICD codes was observed over time, with 57% of patients diagnosed in 2011 having a specific ICD code, increasing to 74% of patients diagnosed in 2016, and a higher proportion of use of non-specific ICD codes was seen in academic centers compared with community centers; however, the number of academic sites was small compared to community centers.

Discussion
This study demonstrates that billing codes are a highly reliable indicator of tumor location, when the specific location code is entered in the EHR. For a sizable minority of mCRC patients, non-specific colon cancer ICD codes are captured in the EHR; thus, structured data for these patients do not indicate tumor side of colon. In these cases, chart abstraction can increase the completeness. If studies are restricted to patients with specific ICD codes, there would likely be minimal bias introduced as the patients with and without specific ICD codes were similar with respect to demographic and clinical characteristics.
A few limitations for this study exist. Although chart abstraction was considered the gold standard, it is subject to errors introduced by abstractors potentially mis-reporting information or by inaccurate information being recorded in the unstructured parts of the EHR. However, chart abstraction is the accepted gold standard for validation studies from administrative claims and other databases, such as EHRs. Additionally, billing codes are collected for the purposes of reimbursement, not for research. Thus, a bias may exist if there are reimbursement incentives based on charges for the treatment based on tumor site. Furthermore, there may be variation in how billing codes are assigned and recorded at the centers in the Flatiron network; however, we did not observe any systematic differences based on centers, with the exception of a higher proportion of patients without specific codes being treated at academic centers. Further studies are needed to validate whether these results are representative of a wider range of data sources, including sources from outside of the US where billing coding practices may differ.
Our analysis demonstrates that ICD codes adequately characterize side of colon for use in studying outcomes for left-versus right-sided colon tumors following specific therapies. However, certain other research questions, e.g. characterizing very small populations such as BRAF-mutant mCRC patients by variables including primary tumor site, may require a side of colon variable with greater

Conclusions
Overall, these analyses demonstrate the rigor necessary to characterize an EHR-based variable in terms of reliability and completeness, before engaging in formal testing of clinical hypotheses that could be practicechanging. Such methodological assessments are necessary before conducting large-scale research using variables generated from EHRs.

Additional file
Additional file 1: Table S1. Distribution of side identified by ICD code or abstraction for patients with Stage IV disease at diagnosis. Table S2. Distribution of side identified by ICD code or abstraction for patients with Stage I-III disease at diagnosis. Table S3. Accuracy of ICD codes for patients with Stage IV disease at diagnosis.