An administrative data merging solution for dealing with missing data in a clinical registry: adaptation from ICD-9 to ICD-10

Background We have previously described a method for dealing with missing data in a prospective cardiac registry initiative. The method involves merging registry data to corresponding ICD-9-CM administrative data to fill in missing data 'holes'. Here, we describe the process of translating our data merging solution to ICD-10, and then validating its performance. Methods A multi-step translation process was undertaken to produce an ICD-10 algorithm, and merging was then implemented to produce complete datasets for 1995–2001 based on the ICD-9-CM coding algorithm, and for 2002–2005 based on the ICD-10 algorithm. We used cardiac registry data for patients undergoing cardiac catheterization in fiscal years 1995–2005. The corresponding administrative data records were coded in ICD-9-CM for 1995–2001 and in ICD-10 for 2002–2005. The resulting datasets were then evaluated for their ability to predict death at one year. Results The prevalence of the individual clinical risk factors increased gradually across years. There was, however, no evidence of either an abrupt drop or rise in prevalence of any of the risk factors. The performance of the new data merging model was comparable to that of our previously reported methodology: c-statistic = 0.788 (95% CI 0.775, 0.802) for the ICD-10 model versus c-statistic = 0.784 (95% CI 0.780, 0.790) for the ICD-9-CM model. The two models also exhibited similar goodness-of-fit. Conclusion The ICD-10 implementation of our data merging method performs as well as the previously-validated ICD-9-CM method. Such methodological research is an essential prerequisite for research with administrative data now that most health systems are transitioning to ICD-10.


Background
We have previously developed and reported on a method for dealing with missing data in a prospective cardiac registry database [1]. The method involves linking the prospectively-derived cardiac registry data on a patient-bypatient basis to corresponding administrative data, followed by a process of mapping the specific clinical diagnoses present in both the registry data and administrative data to create a single 'final' record of baseline diagnoses present in a given patient. Advantages of the methodology that we developed and validated are that it is conceptually simple (relative, perhaps, to more complex multiple imputation procedures), it can be readily implemented in many jurisdictions, it can be applied to non-random missing data situations such as ours, and it produces a 'complete' dataset that can then be used in subsequent statistical procedures for which data completeness is crucial. Since its initial description in the literature, the method has gone on to be widely used, both for research conducted by our group working with data from the Alberta Provincial Project for Outcome Assessment in Coronary Heart Disease (APPROACH) registry [2][3][4], and by other groups conducting similar research in other settings [5,6].
Our previously-described method was derived using a coding algorithm based on the 9 th revision of the International Classification of Diseases, Clinical Modification (ICD-9-CM) for mapping clinical diagnoses in administrative data [1]. The coding algorithm adopted elements of the Deyo ICD-9-CM adaptation of the Charlson comorbidity index for administrative data. We now, however, face a new methodological challenge as much of the world has since introduced the 10 th revision of the International Classification of Diseases and Related Health Problems (ICD-10).
The Tenth Revision (ICD-10) differs from the Ninth Revision (ICD-9 and ICD-9-CM) in several ways, although the overall content is similar: First, ICD-10 has alpha-numeric categories rather than numeric categories. Second, some chapters have been rearranged, some titles have changed, and conditions have been regrouped. Third, ICD-10 has almost twice as many categories as ICD-9. Fourth, some fairly minor changes have been made in the coding rules for mortality [7]. ICD-10 classifies a broader collection of diseases, injuries and causes of death, as well as external causes of injury and poisoning. Unlike ICD-9, ICD-10 applies beyond acute hospital care, and includes a broader collection of conditions and situations that are not diseases, but rather risk factors to health, such as occupational and environmental factors, lifestyle and psychosocial circumstances. ICD-10 is also more adaptable than previous versions, allowing more readily for the addition of codes as new diseases are discovered [8]. As a result of these enhancements, ICD-10 represents the broadest scope of any previous ICD revision to date. ICD-10 was adopted in our province, Alberta, Canada, on April 1, 2002, and has been adopted by other Canadian provinces in recent years. Many other parts of the world such as Australia and much of continental Europe have also been using ICD-10 for the past several years. Because of this widespread ICD-10 uptake, there is now a pressing need to 'translate' existing ICD-9 methodological tools to corresponding tools for ICD-10. The process of translating ICD-9-CM algorithms into ICD-10 is far from straight forward, as many codes are not directly convertible from one version to the other. Furthermore, it is insufficient to use existing automated cross-walk algorithms, because many existing ICD-9 coding sequences have more than one corresponding coding section in ICD-10. Recently, Quan et al. published ICD-10 coding algorithms to define both Charlson and Elixhauser comorbidities [9], and the methodological process outlined in that work demonstrates the complexity of the ICD-9 to ICD-10 translation process.
Here we describe the derivation and evaluation of an ICD-10 coding algorithm for implementation of the ICD-9-CM data merging solution first proposed by Norris et. al. in our group [1]. Specifically, we present (1) the ICD-10 code selection process, (2) its implementation in the data merging process, and (3) its validation. To validate the methodology, we (a) compared the prevalences of specific clinical risk factors when implemented in ICD-9-CM and later in ICD-10, (b) modelled the 1-year mortality with clinical risk factors as predictors and (c) assessed how well the ICD-10 algorithm predicts mortality.

APPROACH Project
The Alberta Provincial Project for Outcome Assessment in Coronary Heart Disease (APPROACH) is a province-wide inception cohort of all adult Alberta residents undergoing cardiac catheterization for ischemic heart disease [10]. APPROACH was initiated to study provincial outcomes of care and facilitate quality assurance/quality improvement for patients with coronary artery disease in Alberta. The APPROACH database contains detailed clinical information on adult patients with known or suspected coronary artery disease (CAD) who undergo invasive cardiac procedures. Patients in APPROACH are followed longitudinally after cardiac catheterization, thus allowing for assessment of subsequent procedure use (i.e. percutaneous coronary intervention [PCI] or coronary artery bypass graft surgery [CABG]), as well as outcomes such as mortality and quality of life. Data collection is ongoing, and as is typical in prospective data registries, there are occasionally data fields that are not completed in the data collection process.

Clinical variables
For the purposes of this data merging methodology research, clinical data were obtained for 86,649 adults (age ≥ 16 years) undergoing cardiac catheterization at one of the three hospitals in Alberta performing this procedure. Data elements recorded in the registry include patients' age, sex and presence of the following risk factors: cerebrovascular disease, congestive heart failure, chronic pulmonary disease, renal disease, type 1 diabetes, type 2 diabetes, dialysis, hyperlipidemia, hypertension, liver/gastrointestinal disease, malignancy, prior coronary artery bypass graft surgery, prior angioplasty, prior lytic therapy, prior myocardial infarction, and peripheral vascular disease. Clinical indication for catheterization is also collected at time of catheterization. It is these clinical variables that are occasionally missing in the data records produced for individual patients.

Administrative data source
We obtained corresponding administrative data for all patients undergoing cardiac catheterization at the three Alberta hospitals performing the procedure. Administrative data records were selected for use when the admission and discharge dates in the administrative data encompass the date of cardiac catheterization recorded in APPROACH. These data are coded according to the ICD- Hospitals are required to submit discharge abstracts to the provincial Ministry of Health and the Canadian Institute for Health Information for each acute care hospital separation (discharge, transfer, or death) and for major outpatient procedures. Data elements acquired from the administrative data source included the patients' unique provincial personal health care number, the hospital chart number, sex, birth date, admission date, up to 16 diagnostic codes, and up to 10 procedure codes.

APPROACH Methods
Our team has previously described our data merging methodology for dealing with missing data in APPROACH 1 . We developed ICD-9-CM coding definitions for each of the clinical variables identified in APPROACH, based largely on the ICD-9-CM comorbidity coding scheme derived by Deyo et. al. for defining the Charlson comorbidity index in American administrative data [11], and since then widely used by health services researchers in Canada and elsewhere. For variables that could not be matched to the Deyo coding algorithm in our originally described method [1], we scanned the ICD-9-CM for clinically appropriate codes to define specific clinical variables -a process implemented by two APPROACH team members (WAG & CMN), and through consensus. SAS computer code queries each of the diagnosis and procedure code fields in ICD-9-CM administrative data, thus defining presence or absence of each of the comorbidities.

Derivation of the ICD-10 algorithm
Our first step in the translation process to develop an ICD-10 conversion of our ICD-9-CM definitions was to determine which clinical comorbidities in APPROACH needed de novo ICD-10 definitions, as opposed to coding definitions that could be borrowed from published methodological research on ICD-10. Fortunately, important published work has recently been completed by collaborators in Australia, Canada and Switzerland who have developed and validated a translation from ICD-9-CM to ICD-10 of the Charlson comorbidity index (Deyo coding algorithm) and the Elixhauser comorbidity coding method [9]. This recently-published ICD-10 coding algorithm developed through a rigorous multi-step process permitted us to use the derived data definitions for defining presence or absence of some of the clinical diagnoses in the ICD-10 administrative data used in our merging process. For the other variables (see below) that are not included in the above-mentioned ICD-10 adaptation of the Charlson or Elixhauser methods, we proceeded with a multi-step consensus process for code selection.
A committee of several team members determined the de novo ICD-10 coding algorithm for the following clinical variables: diabetes type 1, diabetes type 2, current smoker, indication for catheterization, prior thrombolytic, prior CABG and prior PCI. Procedure codes from the Canadian Classification of Interventions (CCI) were used to define the clinical variables that related to performance of a specific procedure, while ICD-10 was used to define the diagnosis codes. The coding consensus process involved a total of eight individuals originating from both the APPROACH team based in Alberta, and collaborators working in a 'sister' initiative, the British Columbia Cardiac Registries. All members received a package of resource materials that included old ICD-9-CM codes previously used with detailed descriptions of all codes, the ICD-10 manual and printouts of an ICD-10 compact disk search for medical terms. The team then met face-to-face and slowly worked through the manual as a group until all the variables had been defined. Upon completion of the face-to-face consensus process, a list of selected codes was drafted and re-circulated to all participants for final verification and approval.

Implementation of ICD-10 merge
After using SAS code to query the administrative data, we merged administrative and clinical databases by provincial personal health numbers and hospital ID. If a comorbid condition was present in either database, our merged (enhanced) data was considered to have that condition present. This means that the clinical variables defined in administrative data were used in two situations: 1) when data were missing in APPROACH, in which case the administrative data filled in a missing data 'hole', and 2) to recode (i.e., enhance) clinical variables that were coded as absent in APPROACH when they were coded as being present in the administrative data [13] (a justifiable recoding of variables given the uniformly high specificity of variables coded as present in administrative data). The multistep data merging methodology is summarized in Box 1.

Analysis
After completing the conversion of our ICD-9-CM definitions into code for use with ICD-10, we enhanced our clinical data with administrative data. We then visually compared the prevalence of the clinical risk factors for the ICD-9-CM and ICD-10 methods on a plot of prevalence by year. We also compared two logistic regression models predicting mortality at one year based on the presence of clinical diagnoses present at baseline: (1) A model predicting 1-year mortality based on the 1995-2001 merged data, using ICD-9-CM coding, and (2) a model predicting 1-year mortality based on the 2002-2005 merged data using ICD-10 coding. All clinical variables listed in Table  1, in addition to age and sex, were entered into the logistic models with indication for catheterization also modelled (categorized as: MI, unstable angina, stable angina, or other indications [reference group]). To assess model performance of the two data merging methods, we determined the C-statistic for each of the two models. The Cstatistic corresponds to the area under the receiver operating characteristic (ROC) curve and is a measure of model discrimination. It has a maximum possible value of 1.0. A value of 0.5 corresponds to a model that has no ability to discriminate beyond chance. Ninety-five percent confidence intervals were calculated for the c-statistic using bootstrap. Lastly, we visually assessed the models' goodness-of-fit by plotting observed versus expected percentages across analysis-defined deciles of risk. The risk groups used in this latter graphical assessment of model perform- ance were those that arose from the models described above. All statistical analyses were performed using SAS version 8.1 (Cary, NC).

Summary of Methodology
1. Determine frequencies of missing data in clinical registry data; 2. Prepare clinical registry data with coding as 1 = present and 0 = absent. If missing or unknown, code as 0; 3. Obtain corresponding administrative data records with admission and discharge dates that encompass the procedure date;

Coding Algorithm
The new ICD-10 coding algorithm developed by consensus is presented in Table 1, along with the original ICD-9-CM algorithm that we previously reported [1]. The ICD-10 coding algorithm provides several more codes than the ICD-9-CM algorithm with changes perhaps most notable in the coding approaches for stable and unstable angina.

Pattern of missing data
We present the pattern of missing data in the original APPROACH dataset in Table 2. Overall there were 58.9% of cases with at least one variable missing. The proportion of missing data varies over the years, but the proportion of cases with at least some missing data remains fairly constant across years. For a two year period (2001 & 2002) missing values for prior CABG and prior PCI were automatically coded as "0" in the raw APPROACH data (as opposed to being coded as unknown or missing). This issue was recognized and adjusted such that rates of missing data for these two variables increased again in 2003. This issue did not affect any of the other study variables.

Percentage of the clinical conditions
The data consisted of 86,649 patients who underwent cardiac catheterization between April 1, 1995 andMarch 31, 2006. Table 3

Model performance for predicting mortality
The C-statistics for the two models predicting mortality based on the clinical variables were as follows: 0.788 (95% CI 0.775, 0.802) for the ICD-10 model versus c-statistic = 0.784 (95% CI 0.780, 0.790) for the ICD-9-CM model. These parameters suggest comparable performance in predicting mortality for the two models (ICD-10 and ICD-9-CM). Norris et al. previously demonstrated that the merged data performs better than the complete set of clinical variables without use of the administrative data merging methodology [1]. We find this to be true again for this analysis: c = 0.745 for the complete set of clinical variables without use of the administrative data merging versus c = 0.780 for the complete merged data. A subgroup analysis also showed that ICD10 performing slightly better than ICD The observed and expected percentages of death at one year across decile-of-risk categories were plotted for each of the two methods ( Figure 2). The ICD-9-CM coding method clearly predicts mortality and "spreads out" risk estimates, as does the ICD-10 coding method. The ICD-9-CM coding method estimates a spread of risk across decile groupings ranging from 0.7% for the lowest risk decile to 18.7% for the highest risk decile; the ICD-10 coding method produces a comparable spread of risks across deciles ranging from 0.5% to 16.9%.

Discussion
In light of ICD-10 being introduced in 1992 by the World Health Organization, and being adopted by numerous countries internationally, researchers are now faced with the methodological challenge of developing ICD-10 coding algorithms for applied health services and population health research. ICD-10 is a new system with potential enhancement of coded information. For that potential to be fulfilled, however, there is a need for validated methodological tools for ICD-10. This paper contributes to the literature by demonstrating the process of taking a validated ICD-9-CM tool and adapting it (through a process of code translation) to a validated ICD-10 tool. Similar to the recent work of Quan et al. in this area [9], this type of research is an essential prerequisite for effective health research to proceed with administrative data.
With the substantial cost associated with producing clinical registries such as APPROACH, and the commonly associated challenge of missing clinical registry data, our data merging method and translation to ICD-10 can be of value to other researchers. As already mentioned, the methodology that we have described (both ICD-9-CM and ICD-10) is in use for applied research studies performed using APPROACH data, and is also being applied in other jurisdictions. A prerequisite for implementation of this data merging solution is, of course, that researchers need to gain access to available administrative data -a process that is not always simple in the context of health data privacy considerations and locally variable administrative data access procedures. Data access challenges aside, however, our data merging method has numerous advantages over other missing data solutions such as 'imputing to zero'(where a missing value is assumed to mean that the condition is absent), complete case analysis, or multiple imputation. Such alternate approaches bring the disadvantages of, respectively, occasionally lacking validity, losing cases, and introducing statistical complexity. It is such consideration of advantages and disadvantages that has led the APPROACH network to now routinely use the data merging method described here as its missing data solution of choice [12].
A limitation of our study is that our data merging algorithm may need modification for use in other clinical data registry initiatives if the baseline clinical variables collected do not map directly to those that we define in our coding algorithm. In instances where the coding algorithm and method are modified, we caution that validity assessments such as those presented here should be performed. A second caveat is that ICD-10 coding quality (relative to a gold standard assessment of presence/ absence of specific conditions) may vary across jurisdictions, and that there may be a phenomenon of improved validity of ICD-10 coding over time as coders become more familiar with the new coding system. In relation to this latter point, it would have been highly informative to have a year of dually-coded data, with both ICD-9-CM and ICD-10 codes assigned to individual cases; such a data resource would have permitted a truer test of the comparability of the ICD-9-CM vs. ICD-10 coding algorithms. In this regard we emphasize that the logistic regression analysis that we did is not in and of itself proof of seamless transition and validation across algorithms. However, the logistic regression findings, in combination with the relatively smooth prevalence transitions pre-Percentage (%) of comorbidities defined by ICD-9-CM and ICD-10 coding schemes by year Figure 1 Percentage (%) of comorbidities defined by ICD-9-CM and ICD-10 coding schemes by year.  Table 3 suggest a reasonable performance. Interpretation of prevalence trends is challenging and somewhat difficult because of real clinical trends over time and also because of challenges such as evolving clinical definitions (e.g. the MI definition).
Despite these limitations, this paper presents the important methodological development work that health researchers need to undertake in coming years as a number of validated ICD-9-CM methodological tools for administrative data now need to be translated to ICD-10 so that productive and important health research can continue to be performed.

Conclusion
In this instance, we demonstrate the successful translation of an ICD-9-CM coding algorithm to ICD-10 for dealing with missing data in a clinical registry initiative. Our work reveals a seamless transition to ICD-10 for our methodological tool, and provides a template for others to follow as they undertake similar work.