Integrating historical clinical and financial data for pharmacological research
© BioMed Central Ltd 2011
Received: 17 March 2011
Accepted: 18 November 2011
Published: 18 November 2011
Skip to main content
© BioMed Central Ltd 2011
Received: 17 March 2011
Accepted: 18 November 2011
Published: 18 November 2011
Retrospective research requires longitudinal data, and repositories derived from electronic health records (EHR) can be sources of such data. With Health Information Technology for Economic and Clinical Health (HITECH) Act meaningful use provisions, many institutions are expected to adopt EHRs, but may be left with large amounts of financial and historical clinical data, which can differ significantly from data obtained from newer systems, due to lack or inconsistent use of controlled medical terminologies (CMT) in older systems. We examined different approaches for semantic enrichment of financial data with CMT, and integration of clinical data from disparate historical and current sources for research.
Snapshots of financial data from 1999, 2004 and 2009 were mapped automatically to the current inpatient pharmacy catalog, and enriched with RxNorm. Administrative metadata from financial and dispensing systems, RxNorm and two commercial pharmacy vocabularies were used to integrate data from current and historical inpatient pharmacy modules, and the outpatient EHR. Data integration approaches were compared using percentages of automated matches, and effects on cohort size of a retrospective study.
During 1999-2009, 71.52%-90.08% of items in use from the financial catalog were enriched using RxNorm; 64.95%-70.37% of items in use from the historical inpatient system were integrated using RxNorm, 85.96%-91.67% using a commercial vocabulary, 87.19%-94.23% using financial metadata, and 77.20%-94.68% using dispensing metadata. During 1999-2009, 48.01%-30.72% of items in use from the outpatient catalog were integrated using RxNorm, and 79.27%-48.60% using a commercial vocabulary. In a cohort of 16304 inpatients obtained from clinical systems, 4172 (25.58%) were found exclusively through integration of historical clinical data, while 15978 (98%) could be identified using semantically enriched financial data.
Data integration using metadata from financial/dispensing systems and pharmacy vocabularies were comparable. Given the current state of EHR adoption, semantic enrichment of financial data and integration of historical clinical data would allow the repurposing of these data for research. With the push for HITECH meaningful use, institutions that are transitioning to newer EHRs will be able to use their older financial and clinical data for research using these methods.
Early Electronic Health Record (EHR) systems were electronic versions of traditional, paper-based medical records, used in medicine. The financial and administrative portions of the medical record were computerized first , and specialized hospital information systems gradually evolved into the first, comprehensive, modern EHRs [2, 3]. Rapid developments in information technology and growth in computing power have led to the development of EHR systems that surpass their predecessors in functionality and complexity. These changes coincided with improvements in controlled medical terminologies (CMT) , and standards for data storage, representation & exchange. CMT in modern EHR systems allow for precise semantic definitions that were not possible in many historical systems. The adoption of EHR systems in an inpatient setting has been slower than expected [5, 6], whereas specialized hospital financial systems remain pervasive . With the meaningful use provisions of the Health Information Technology for Economic and Clinical Health (HITECH) act of 2009, more institutions are expected to adopt modern EHR systems . In spite of these developments, many institutions will continue to have several years' worth of financial or historical clinical data, which can differ significantly from data obtained from modern EHR systems.
There is a growing interest in the secondary use and sharing of EHR data for research; several frameworks are available for this purpose, and CMT are key enablers [[9–11]]. A majority of EHR systems use commercial pharmacy vocabularies, which have many , but not all desirable characteristics of a CMT , although most of them map to RxNorm . RxNorm is a CMT of drugs and devices, one of the recommended national standards for pharmacy data, and contains a semantic network of concepts and relationships. Semantic enrichment approaches--in which semantic contextual information is added to data or metadata--have been applied successfully in information retrieval for enhancement of textual documents, in clinical decision support systems like InfoButtons, in natural language processing for annotating unstructured notes, in the development of medical ontologies, etc [[14–17]]. Mapping local terminologies to CMT like RxNorm can allow semantic enrichment by the extension of semantic attributes to these terminologies . The integration of pharmacy data from different EHR systems [[18–20]], as well as enrichment of financial data with CMT, can allow for semantic normalization and consistent use of such data in research, and provide long-term value to institutions that have large repositories of such data.
Financial and billing data can be represented by vocabularies like the International Classification of Diseases Ninth Revision, Clinical Modifications  (ICD9-CM) for diagnoses, Current Procedural Terminology  (CPT) for procedures, diagnosis related groups (DRG), etc.; however, each of these coding systems by themselves lack the level of detail required for clinical care and research . In the pharmacy domain, Healthcare Common Procedure Coding System  (HCPCS) is also used for coding certain medication data in the financial system, although HCPCS codes may not be available for all medications, and are not granular enough to distinguish between different doses, routes and forms of a given drug, making them also unsuitable for clinical care and research. In spite of these limitations, data from financial systems have been used for epidemiological research, particularly for selection of patient cohorts based on demographics, diagnoses and procedures. In a previous study, medications were the fourth most common type of inclusion criteria among data requested by researchers . Semantic enrichment of financial data would allow their use in cohort selection, in addition to demographics, diagnoses and procedures, which can be already obtained from financial systems. However, unlike diagnoses and procedures, where ICD9-CM and CPT have been respectively used in coding for several years, RxNorm--the US national standard for medications--has not yet been as widely adopted in EHR systems. In order to study the feasibility of using historical clinical and semantically enriched financial medication data for research, it is necessary to develop strategies for data integration.
In the present investigation, we develop strategies for semantic enrichment of financial data with RxNorm, and compare two different automated approaches for the integration of medication data from disparate clinical systems. Under the first approach for data integration, we use administrative metadata [26, 27] from the financial system and the medication dispensing system, to create automated crosswalks between systems. Under the second approach, we use RxNorm and two different commercial pharmacy vocabularies for data integration. We also compare the sensitivity of data integration approaches by using the percentage of matches to the current inpatient system as a metric. The purpose of developing automated matching methods is to create an initial population of vocabulary matches across different systems, which can be reviewed by a terminology expert. We begin with a description of various systems, and how clinical and financial data related to medications are captured electronically at our institution. We then describe the relevant metadata and vocabularies available in each system, and methods for mapping between various systems using metadata and vocabularies. Finally, we evaluate the effect of enriched financial data on the cohort size in an IRB-approved clinical study, and discuss the challenges faced during enrichment and integration of historical data in the context of modern EHRs.
Sources of pharmacy data
1993 to 2010
2010 to date
1993 to 1999
1999 to 2003
2003 to date
1995 to date
Epic for Business
Cerner Millennium Inpatient
Automated pharmacy charges
Manual pharmacy charges
Pharmacy data in the EDW
Financial (charge codes)
Dispensing (dispense codes)
In the heterogeneous EHR infrastructure consisting primarily of commercially developed systems, metadata and vocabulary were managed under a decentralized model, and individual systems were identified as the sources of truth for specific attributes as part of data scrubbing . The EDW served as the source for extracting reference data from individual systems as well as commercial vocabularies and CMT (RxNorm). The current inpatient EHR as well as the historical PM had supported Cerner Multum  (Cerner Corporation, Kansas City, MO), although the last snapshot from the historical PM before it was decommissioned did not include these codes, and they were unavailable during the integration process. The current outpatient EHR supported Wolters Kluwer MediSpan  (Wolters Kluwer Health, Indianapolis, IN), which was not used in any of the other systems. Orderables in each of the three clinical systems also contained National Drug Code  (NDC) as one of the attributes.
Inclusion criteria defined in an existing IRB approved study were used to evaluate the effect of the semantic enrichment process on cohort size for secondary use of EHR data based on the different methods for integration. The inclusion criteria consisted of patients who were prescribed warfarin, had a measurement of their International Normalized Ratio (INR) during the same visit as the medication order, and whose data were available in the EDW. The main outcome measure was cohort size obtained by using one or more integration and enrichment methods. For patients seen in an inpatient setting, warfarin is typically initiated while the patients are at the hospital, with subsequent follow-up being performed at a dedicated anticoagulation clinic in an outpatient setting, leading to their medication data being stored in two different EHR systems at our institution. The EDW contained records from different systems implemented and integrated during different time periods (Table 1), and the study was chosen among several others, since the medication part of the inclusion criteria spanned different systems, and because large sample sizes could be obtained due to warfarin being a highly prescribed drug.
Integration of financial data
Financial Catalog (complete reference)
Financial Catalog (codes used)
(N = 4532)
(N = 5482)
(N = 5969)
(N = 2443)
(N = 2638)
(N = 3226)
Match to charge codes in current inpatient EHR
(rate of mismatch)
Match to Clinical Systems
Clinical codes in current inpatient EHR by dispense codes
Clinical codes in historical inpatient PM by dispense codes
Clinical codes in outpatient EHR using primary NDC
Clinical codes in outpatient EHR using related NDCs from Multum
Multum drug codes
MediSpan codes using primary NDCs from inpatient EHR
MediSpan codes using related NDCs from Multum
RxNorm (CD) codes using primary NDCs from Inpatient EHR
RxNorm (CD) codes using related NDCs from Multum
RxNorm (CD) codes using Multum MMDCs
RxNorm (GN) codes using Multum drug codes
Integration of clinical data
Medication Catalog (codes used)
Medication Catalog (reference)
Historical Inpatient System
Primary NDC (Multum)
Related NDCs (Multum)
Primary NDC (MediSpan)
Related NDCs (MediSpan)
Primary NDC (RxNorm)
Related NDCs (RxNorm)
Related NDCs (Multum)
Related NDCs (MediSpan)
Primary NDC (RxNorm)
Related NDCs (RxNorm)
The semantic enrichment process for financial data leveraged common metadata attributes between the financial system and the current inpatient EHR, and used commercial vocabularies available in the EHR as links to RxNorm. The process of enrichment with CMT was limited by the availability of common metadata attributes within the current inpatient EHR, the consistent use of financial codes in the EHR, and also by links between the commercial vocabularies used in the EHR and CMT. Matches between the financial system and the EHR were significantly different for snapshots from the different years, with a 51.04% match in 1999, compared to a 62.42% match in 2009, even when the financial catalog itself was larger in 2009 than in 1999. Matching was also performed on a subset of financial codes that had actually been used for billing in the same years as the snapshots of the reference catalog, and these were substantially better, ranging from a 73.41% match in 1999 to 94.33% match in 2009. Although 94.33% of codes matched with items in the EHR in 2009, only 90.08% of the codes used in 2009 could be enhanced with CMT. In the inpatient pharmacy, certain items are mixed and dispensed or compounded in the hospital pharmacy, and such items may not always have single NDCs or MMDCs, since there could potentially be more than one drug involved. In such cases, reliable matches to CMT could not be obtained, and consequently, these were lower than the initial matches to the inpatient EHR.
The financial system is regarded as the source of truth for all financial data, and similarly, the clinical system is regarded as the source of truth for all clinical procedures, diagnoses, medication orders, etc. Automated charge capture processes (Figure 1) required the inclusion of financial charge codes in the current inpatient EHR as well as the historical PM, and this task was performed by expert pharmacists who maintained these systems over the years. The process of enrichment of financial data relied on the assumption that the copy of financial reference data that was maintained in the clinical system was a faithful representation of the original, i.e. the financial charge codes were appropriately assigned to the correct drug in the EHR. This assumption was tested during the matching process, so that a single charge code which matched with more than one MMDC (different drug, dose, route, form) was declared a mismatch. A mismatch rate of 0.69% (single charge code matched with multiple MMDCs) was found in the 2009 reference snapshot, while the rate was as high as 1.15% among financial codes that were used in 2009. While matching the entire financial catalog would have been ideal, the financial catalog itself was larger than the clinical catalog, since it contained multiple codes for the same drug, in order to facilitate processes like distribution of revenue among hospital service lines. In the absence of mappings to a CMT containing formal concept definitions, existing items in the financial catalog were likely duplicated when new ones were added, which contributed to the size, and introduced inconsistencies, which were identified in the form of mismatches. Ultimately, any strategy for enriching financial data would have to balance carefully the sensitivity afforded by the implicit level of trust between copies of reference metadata in different systems, and the specificity afforded by defining rules for estimating the quality of the reference information.
Each of the two other EHR systems considered in the investigation had different metadata and vocabulary elements in common with the current inpatient EHR (Table 1 and Figure 2). The historical PM had supported Multum, and the best way to integrate it with data from the current EHR would have been using Multum itself; however, Multum codes were not included in the final reference snapshot that was taken before the historical system was decommissioned, and these codes were not present in the archived HL7 messages sent to the dispensing system, which contained only the dispense codes. In addition, the purpose of this investigation was to develop more generic methods for integration of historical data, so other metadata and vocabulary elements (e.g. charge code, dispense code, NDCs) were preferred. Although NDCs existed in the historical PM, many of them did not have direct matches with NDCs in the current inpatient system, possibly due to certain codes becoming obsolete, and direct matches were limited to 44.85% in the reference catalog to 46.35% in the 2009 snapshot of codes in use even after normalizing the historical NDC format to 11-digits. Multum, MediSpan and RxNorm were also used to find related NDCs for the primary NDCs from the historical system in order to improve matching. Not surprisingly, matching on related NDCs using Multum produced the best results--with 82.34% match on reference catalog and 91.67% match in the 2009 snapshot--since the historical system had used Multum at one point, and many of these NDCs would have existed in older versions of Multum. Relying on metadata attributes like charge and dispense codes from the historical system produced matches as high as 84.44% and 68.93% respectively for the reference snapshot, and 94.23% and 94.68% respectively for codes used in 2009.
The link by dispense codes was noteworthy because these codes uniquely identified the items between the historical system and the dispensing cabinets, and even at 68.93%, they represented 2622 matching items, which was higher than the number obtained by any other method, including matches by charge code (2442), related NDCs using Multum (2574), RxNorm (2024) or MediSpan (1835) as a reference. Since the current EHR as well as historical PM communicated with the dispensing system using HL7 messages that contained dispense codes, the dispensing systems served as the source of truth for dispense codes, whereas both the current and historical inpatient systems contained copies of these codes to facilitate HL7 messaging between systems, in a manner similar to having copies of charge codes from the financial system. Charge codes were also used similarly to create matches between the different systems, and produced matches as high as 94% between the two different inpatient systems. Ideally, matches between different systems should be performed using concepts from a CMT or pharmacy vocabularies. The high percentages of matches obtained using metadata-based methods suggest that if pharmacy vocabularies or CMT were not available in the different systems, then metadata references to a common, external system could be used as a substitute to perform initial matches between different systems, which can then be reviewed by experts.
In both instances of integration described above, it was observed that integration using administrative metadata such as charge and financial codes was either better than or as good as integration using CMT. These findings can be explained by differences in how the different systems handle each of these data elements, and vocabularies. Centralized metadata and vocabulary management solutions have been proposed as solutions for enterprise-wide knowledge management, and several successful implementations exist across the nation. Unlike a centralized metadata management solution which can serve as the single source of truth for all metadata attributes referenced by other systems, metadata elements in the present investigation were obtained from different systems. Under this decentralized model, the financial system served as the source of truth for charge codes, the automated dispensing system served as the source of truth for dispense codes, the current inpatient EHR system served as the source of truth for current inpatient formulary items, and so on. No single system served as a centralized metadata store, although the current inpatient system contained the highest number of metadata and vocabulary elements among the systems considered in this investigation.
Exchange of data between these systems (Figure 1) using messaging standards like HL7 relies on coded data elements such as charge codes, dispense codes and catalog codes used to identify different medication items. Within each system which serves as the source of truth for that particular data element (Figure 2), the discrete unit is an entity, and each entity can have attributes in the form of reference to entities in other systems. For example, a charge code is an entity within the financial system, but an attribute of an entity item within the current inpatient EHR and historical inpatient PM, so that those systems can post transactions in the financial system by using the charge code as a token. Similarly, a dispense code is an entity within the dispensing system, but is an attribute of an entity item in the current inpatient EHR and historical inpatient PM. Even with changes in the EHRs, attributes with references to foreign systems changed, but the entities themselves remained intact in other systems. Consequently, data from EHRs, which referred to common entities in other systems, could be easily integrated using such metadata. Although various components of the financial and clinical infrastructure were replaced over a period of time (Table 1), at least one or more of the systems which served as sources of truth had continued to be in use, so that the changes themselves were staggered across a period of time. In addition, both data and metadata from each of these systems were available in the EDW, which served as a historical, longitudinal reference, as well as the venue for data integration. In order to successfully use metadata from different systems in a heterogeneous environment for data integration, systems that serve as sources of truth would have to meet the above criteria for persistence, and historical data and metadata, such as those typically stored in an EDW would have to be readily available.
Retrospective research using archived data requires quality data, as well as a sound understanding of how those data were captured in the EHR. The selection of patient cohorts for retrospective research, or even the enrollment of new patients in prospective studies would require reliable identification based on the inclusion or recruitment criteria. While it may be possible to search for cohorts from financial data based on descriptions of various line items in the patient's bills, financial descriptions are often based on institutional needs, and in order to enable sharing and consistent reporting of these data, it would become necessary to adopt common vocabularies. Enriching financial data with CMTs like RxNorm can allow for consistent, normalized and semantically accurate description of financial data, which can also be shared outside a given institution. Inclusion criteria from an IRB approved study, consisting of patients who were prescribed warfarin and had INR measurements performed during the same visit were used to obtain estimates of cohort sizes (Figure 6 and Additional File 1). Cohorts obtained from the financial system were slightly smaller than those obtained through clinical systems, due to the matching methods used, although it was noteworthy that 98% of the patients in the cohort identified from the clinical systems could also be identified using financial data. Since the financial system supported automated charge capture as well as manual charge entry processes from several source clinical systems (Figure 1), it would contain charges for items from multiple systems, which may or may not have corresponding matches to specific items in the inpatient EHR systems. Consequently, one might expect to find more patients in the financial system, than in the clinical system, although this was not observed in our investigation. In the absence of consistently used charge code, with meaningful descriptors in the financial system, manually billed items could have been assigned different codes than would otherwise be assigned in an automated charge capture process from the inpatient EHR. The FS also contained charge codes for miscellaneous pharmacy items, which can potentially create the same types of problems as not elsewhere classified (NEC), or not otherwise specified (NOS) , in vocabularies which are not true CMTs. Regardless of the source for obtaining the research cohort using automated methods, a more in-depth manual chart review would need to be performed on cohorts obtained using automated methods.
The transition to and implementation of EHR systems can often be a phased, multistep process, which can span several years. During the transition to EHRs, replacement of existing EHRs with newer systems, or with the gradual addition of functionality to an existing EHR, through functional integration with other systems, the type and richness of captured data can change or evolve. For institutions that have transitioned from older systems to modern EHRs, a large amount of data may still exist in older, semantically poorer formats, or such data may only be available from financial and billing systems, rather than clinical systems. Commercial EHRs are more pervasive than those developed in-house, and among commercial EHRs, it is rare to find support for third-party centralized metadata repositories or vocabulary services that can serve as sources of CMTs. Many commercial EHRs, instead, rely on their own solutions for managing metadata and vocabularies, which may be internally consistent within the systems, but difficult to integrate with analogous components of other, external systems. Given EHR implementations that employ a best of breed approach for different components such as pharmacy, labs, radiology, etc., such components may be derived from different commercial applications, with different approaches to the management of metadata and vocabularies, and centralized metadata and vocabulary services may not be feasible, thus creating data integration challenges that could potentially undermine some of the benefits of having EHRs.
Metadata-based methods developed in this investigation relied on the assumptions that certain systems were reliable, persistent sources of truth for specific metadata elements, that the copies of metadata elements in external systems were faithful representations of the original, and that these copies had been properly assigned as attributes of entities in those systems (Figure 2). Vocabulary-based methods developed in this investigation worked on similar assumptions that external vocabularies had been implemented and used properly in the different systems.
None of the methods under either approach assumed centralized metadata and vocabulary models, but rather relied on identifying and designating external, persistent stores of metadata and vocabularies as sources of truth. Inasmuch as EHR infrastructures in different settings are able to satisfy the above conditions, the above methods can be generalized to those settings. The ultimate implications of these findings in the context of the state of EHR implementations are that financial medication data can be repurposed for research through semantic enrichment techniques. In the absence of consistently used pharmacy vocabularies in historical/legacy data, automated metadata-based methods can be used for data integration, and a combination of both techniques would allow the creation of large, longitudinal datasets which can be used in research.
Due to differences in billing processes between the inpatient and outpatient EHRs, the enrichment of financial data, and it's comparison to clinical data was limited to data collected through the inpatient system. In addition, although the outpatient system supports CPOE, a dedicated outpatient pharmacy module has not been implemented at our institution; therefore, direct formulary to formulary comparisons between the inpatient and outpatient systems were not possible at the time of this investigation. Finally, our choice of using cohort selection in a retrospective study was motivated by the need to investigate differences in the approaches and to examine if enriched financial data may be suitable for this purpose, and before broadly applying these findings, an investigator may need to assess these processes using examples that resemble their potential use cases.
EHRs evolved gradually from more pervasive specialized hospital information systems over a period of time. With HITECH meaningful use requirements, as more hospitals adopt modern EHRs, financial and historical clinical data will remain abundant, and the consolidation, enrichment and integration of such data have the potential for creating large sets of data that can be used for selecting cohorts in retrospective studies, or potentially recruit patients for prospective studies. Commercial pharmacy vocabularies used in EHRs, which map to CMTs like RxNorm, compared favorably with RxNorm for integration of data between different clinical systems. Metadata-based methods for data integration performed as well as or at times better than vocabulary-based methods. Using metadata from different systems which served as sources of truth for those metadata elements in a decentralized manner was successful due to the staggered replacements of components in the EHR infrastructure, which allowed for persistence of metadata in different systems. In the absence of common vocabularies in different systems, metadata-based approaches could potentially be used for matching and data integration. For institutions that are transitioning to modern EHRs, the development of strategies for integration and enrichment--such as the ones described in this study--could allow repurposing of historical data for research.
A complete list of abbreviations used in this manuscript has been provided in Additional File 2.
This investigation was supported by the University of Utah Healthcare Information Technology Services (ITS), and the Department of Biomedical Informatics at the University of Utah. The authors are grateful to the Linda Tyler, Pharm.D., Director of Pharmacy Services at the University of Utah Hospital for providing pharmacy resources, and to Cary Martin, MS and Ming Tu, MS from ITS at the University of Utah Hospital for their help in obtaining the financial and historical pharmacy data.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.