Effects of aggregation of drug and diagnostic codes on the performance of the high-dimensional propensity score algorithm: an empirical example

Background The High-Dimensional Propensity Score (hd-PS) algorithm can select and adjust for baseline confounders of treatment-outcome associations in pharmacoepidemiologic studies that use healthcare claims data. How hd-PS performance is affected by aggregating medications or medical diagnoses has not been assessed. Methods We evaluated the effects of aggregating medications or diagnoses on hd-PS performance in an empirical example using resampled cohorts with small sample size, rare outcome incidence, or low exposure prevalence. In a cohort study comparing the risk of upper gastrointestinal complications in celecoxib or traditional NSAIDs (diclofenac, ibuprofen) initiators with rheumatoid arthritis and osteoarthritis, we (1) aggregated medications and International Classification of Diseases-9 (ICD-9) diagnoses into hierarchies of the Anatomical Therapeutic Chemical classification (ATC) and the Clinical Classification Software (CCS), respectively, and (2) sampled the full cohort using techniques validated by simulations to create 9,600 samples to compare 16 aggregation scenarios across 50% and 20% samples with varying outcome incidence and exposure prevalence. We applied hd-PS to estimate relative risks (RR) using 5 dimensions, predefined confounders, ≤ 500 hd-PS covariates, and propensity score deciles. For each scenario, we calculated: (1) the geometric mean RR; (2) the difference between the scenario mean ln(RR) and the ln(RR) from published randomized controlled trials (RCT); and (3) the proportional difference in the degree of estimated confounding between that scenario and the base scenario (no aggregation). Results Compared with the base scenario, aggregations of medications into ATC level 4 alone or in combination with aggregation of diagnoses into CCS level 1 improved the hd-PS confounding adjustment in most scenarios, reducing residual confounding compared with the RCT findings by up to 19%. Conclusions Aggregation of codes using hierarchical coding systems may improve the performance of the hd-PS to control for confounders. The balance of advantages and disadvantages of aggregation is likely to vary across research settings.


Background
Although early detection and assessment of drug safety signals are important [1][2][3], post-approval drug safety studies often face challenges such as small size, rare incidence of adverse outcomes, and low exposure prevalence after the launch of a new drug. In addition, nonrandomized studies of treatment effects in healthcare data are vulnerable to confounding bias. Propensity Score (PS) methods are increasingly used to control for measured potential confounders, especially in pharmacoepidemiologic studies of rare outcomes in the presence of many covariates from different data dimensions of administrative healthcare databases [4][5][6][7]. Methods of selecting variables for PS models based on substantive knowledge have been proposed [8][9][10][11][12], but substantive knowledge may often be lacking, and the meaning of various medical codes may often be unclear [13]: Seeger et al. proposed that health care claims may serve as proxies in hard-to-predict ways for important unmeasured covariates [14]; Stürmer et al. used PS models with over 70 variables representing medical codes present during a baseline period [5]; Johannes et al. created a PS model that considered as candidate variables the 100 most frequently occurring diagnoses, procedures, and outpatient medications in healthcare claims [15]. A recentlydeveloped strategy for selecting variables from a large pool of baseline covariates for PS analyses is the use of computer-applied algorithms [16,17], such as the High-Dimensional Propensity Score (hd-PS) algorithm. The hd-PS automatically defines and selects variables for inclusion in the PS estimating model to adjust treatment effect estimates in studies using automated healthcare data [16,18].
The hd-PS algorithm prioritizes variables within each data dimension (e.g., inpatient diagnoses, inpatient procedures, outpatient diagnoses, outpatient procedures, dispensed prescription drugs) by their potential for confounding control based on their prevalence and on bivariate associations with the treatment and with the study outcome [16,19]. Version 1 of the hd-PS algorithm excludes variables found in fewer than 100 patients (exposed and unexposed combined) and variables with zero/undefined covariate-exposure association or zero/ undefined covariate-outcome association. Once variables have been prioritized, a predefined number of variables with the highest potential for confounding per dimension is chosen to be included in the PS.
Combining medications or medical diagnoses into higher-level groupings increases the prevalence of the aggregated covariate which may increase the chances of a variable being selected by the algorithm. However, aggregation may also weaken covariate-exposure and/or covariate-outcome relations and reduce variable prioritization in the Bross formula [19]. In addition to the selection issue, control for a selected aggregated variable may lead to residual confounding in the adjusted risk ratios if not all of its components have the same confounding effect. No study to date has assessed how hd-PS performance is affected by aggregating medications and/or medical diagnoses, especially in cohorts with relatively few patients, rare outcome incidence, or low exposure prevalence. To investigate the impact of aggregation on hd-PS performance in cohorts with low outcome incidence or exposure prevalence, we created an empirical example based on prior research [16,20] with an observed elevated crude risk ratio, likely due to confounding by indication in studies of upper gastrointestinal (UGI) complications in rheumatoid arthritis (RA) or osteoarthritis (OA) patients initiating celecoxib compared to traditional non-steroidal antiinflammatory agents (tNSAIDs). Celecoxib has been shown to decrease the risk of UGI complications in several randomized controlled trials (RCT) by approximately 50% [21][22][23][24][25][26]. We therefore assume that a treatment effect estimate closer to 0.50 is less biased by confounding.

Selection of the study cohort
We constructed an incident user cohort [27] to examine UGI complication in RA and OA patients initiating celecoxib or a tNSAID, specifically ibuprofen or diclofenac. All individuals with a first dispensing between 1 July 2003 and 30 September 2004 of celecoxib, ibuprofen, or diclofenac were drawn from the Truven Health Analytics MarketScan W Commercial Claims and Encounters [28]. MarketScan is a longitudinal healthcare claims database which captures patient demographics, inpatient and outpatient diagnoses and procedures, and medications from a selection of large private employers, health plans, government agencies and other public organizations. We selected patients who were age 18-65 years, belonged to a health insurance plan with full medical and pharmacy benefits, and had at least 6 months of enrollment history as of the date of first dispensing of a study or referent drug (the "index date"). During the 6 months prior to the index date, patients must have had a diagnosis of RA (ICD-9 code 714.x) or OA (ICD-9 code 715.x, 721.x) but no NSAID dispensing (including aspirin); and no record of gastrointestinal ulcer disorders, gastrointestinal hemorrhage, active renal, hepatic, coagulation disorders, allergies, malignancy, esophageal or gastroduodenal ulceration.
The study outcome -UGI complicationwas defined as an inpatient or outpatient diagnosis for either first peptic ulcer disease complications including perforation, an UGI hemorrhage (ICD-9 code 531.x, 532.x, 533.x, 534.x, 535.x, 578.0), or a physician service code for UGI hemorrhage (Current Procedure Terminology (CPT) code 43255 or ICD-9 procedure code 44.43). The complication must have occurred during the 60 days after initiation of the study drug. These outcome definitions have been validated for 1,762 patients in an independent hospital discharge database with a positive predictive value of 90% against medical chart review [29].

Aggregations of medications and medical diagnoses
Major U.S. administrative databases contain prescription medication information coded with non-hierarchical National Drug Codes (NDC) and generic drug names. The medications can be aggregated using the hierarchical Anatomical Therapeutic Chemical (ATC) drug classification developed by the World Health Organization (WHO) for drug utilization studies [30]. Similarly, medical diagnoses are represented by International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9) codes. ICD-9 has limited hierarchical relationships [31], but the Clinical Classification Software (CCS) developed by the Agency for Healthcare Research and Quality (AHRQ) can be used to aggregate diagnoses into clinically meaningful groupings [32].
We aggregated medications to five levels of the ATC classification [30]. This system classifies active substances into different groups based on their target organ or system and their therapeutic, pharmacological and chemical properties. Drugs are classified into fourteen main groups (1 st level) with pharmacological or therapeutic subgroups (2 nd level). The 3 rd and 4 th levels are chemical, pharmacological or therapeutic subgroups, and the 5 th level is the chemical substance. Several ATC groups are subdivided into both chemical and pharmacological groups. The pharmacological group is often chosen if a new substance fits in both a chemical and pharmacological 4 th level. Substances in the same 4 th ATC level are not pharmacotherapeutically equivalent, as they may have different modes of action, therapeutic effects, drug interactions and adverse drug reaction profiles. New 4 th levels are commonly established if at least two approved substances fit in the group. A new substance not clearly belonging to any existing group of related substances of ATC 4 th level will often be placed in an X group ("other" group). There are very few new substances in each of the ATC 4 th levels with the same ATC 3 rd level for each ATC release. New substances belonging to different ATC 3 rd levels will have different codes for "X" groups in ATC 4 th level. Therefore, the mixed bag of "X" group is not an issue for the hd-PS algorithm.
We created several scenarios of code aggregation using the CCS groupings. In the base scenario, we applied the hd-PS with up to 5-digit granularity of ICD-9 for inpatient and outpatient diagnoses. Note that 3-digit ICD-9 codes are kept separate from 4-and 5-digit codes in the hd-PS despite the limited hierarchy between these levels. We transformed ICD-9 diagnoses into four-level CCS groupings via the cross-mapped ICD-9 to CCS multi-level diagnoses table [32]. There are 18, 134, 355 and 207 groupings in CCS levels 1, 2, 3 and 4, respectively. However, not all ICD-9 codes have a corresponding CCS code in all four levels. Therefore we created a "universal" CCS by using the most granular code available for each ICD-9 diagnosis code to obtain 355 groupings in CCS level 4. We separately investigated different levels of ICD-9 granularity by using the first 3-or 4-digit ICD-9 codes [16].
Sampling techniques to generate conditions of different size, outcome incidence and exposure prevalence The full cohort consisted of 18,829 patients (7,197 prescribed celecoxib and 11,632 prescribed ibuprofen or diclofenac); 117 patients developed an UGI complication. For each aggregation scenario (including no aggregation), we created six conditions of 100 sampled cohorts. First, we varied the total size of the cohort by drawing simple random samples of 50% (condition 1) and 20% (condition 2), 100 times each, without replacement. Second, we varied the outcome incidence by retaining all noncases, but drawing 50% (condition 3) and 20% (condition 4) simple random samples without replacement, 100 times each, from the 117 cases and recoding the unselected cases as noncases. Finally, we varied the exposure prevalence by retaining all unexposed patients, but drawing 50% (condition 5) and 20% (condition 6) simple random samples, 100 times each, without replacement, from the exposed subjects and replacing the unselected exposed subjects with the same number of randomly selected unexposed patients ( Figure 1).

The hd-PS algorithm
We implemented the hd-PS algorithm with five data dimensions commonly available in automated healthcare databases: pharmacy claims; outpatient diagnoses; outpatient procedures; inpatient diagnoses; and inpatient procedures. The algorithm identifies the top 200 most prevalent covariates within each data dimension by creating binary variables for each diagnosis, procedure and medication. The prevalence of each variable depends on the granularity of the coding: the more aggregation, the higher the prevalence tends to be as subcodes are combined into aggregated codes. Each variable is assessed for 3 levels of its within-patient occurrence: (1) once; (2) sporadic (≥ median number of times); or (3) frequent (≥ 75 th percentile number of times) with details described elsewhere [16]. With the default setting of 200 variables for each dimension, 3,000 indicator variables (200 × 3 levels × 5 dimensions) are then prioritized according to their potential for confounding control based on their prevalence and bivariate associations with the treatment and with the study outcome according to the Bross formula [19]. By default, the top 500 indicator variables are selected for the PS.

Statistical analysis
The hd-PS algorithm can augment the automaticallyselected covariates with predefined covariates chosen by the investigator. For each condition-scenario combination, we fit 5 log binomial models: (1) an unadjusted crude model; (2) adjusted for basic covariates (age [continuous], gender, calendar year of drug initiation); (3) adjusted for basic plus extended, pre-selected covariates (hypertension, congestive heart failure, coronary artery disease, inflammatory bowel disease, prior dispensing of gastroprotective drugs, warfarin, antiplatelet drugs, and oral steroids) selected based on biological rationale in the literature [16,18,20,[33][34][35]; (4) adjusted for basic plus hd-PS selected covariates; and (5) adjusted for basic, extended, and hd-PS selected covariates. The hd-PS was adjusted for by including indicator variables for each decile of PS in the regression model.
In the base scenario, we used generic drugs and up to 5-digit granularity of ICD-9, CPT or Healthcare Common Procedure Coding System (HCPCS). We then re-fitted all models in six scenarios for aggregation of medications, eight scenarios for aggregation of diagnoses, and one scenario that combined the medication and diagnosis aggregations that appeared to perform best across the six conditions of cohort samples.
We applied hd-PS to the full study cohort to estimate the treatment effect and used it as the reference value for comparison with results from the generated cohort conditions. For the 100 samples in each of the cohort conditions, we calculated summary statistics for the estimated risk ratios (geometric mean, 25 th and 75 th percentiles), including: the mean percentage of covariates selected by hd-PS in the full cohort that were also selected by hd-PS in the samples; the median number of exposed and unexposed subjects; the median number of exposed and unexposed outcomes. We evaluated each aggregation scenario by estimating the amount of residual confounding, calculated as the difference in the natural logarithms of the estimated risk ratio and the natural logarithm of 0.50, representing the RCT findings [21][22][23][24][25][26]. To estimate the change in residual confounding resulting from each aggregation scenario, we calculated the proportional difference in absolute degree of estimated confounding between the scenario of interest and the base (no aggregation) scenario. For example, for the 20% exposure prevalence cohorts (condition 6), the unadjusted (confounded but otherwise presumptively unbiased) estimate is RR u = 0.97, and two confounded (but otherwise presumptively unbiased) estimates are RR c1 = 0.89 (base: basic, extended and hd-PS covariates; no aggregation) and RR c2 = 0.81 (combined diagnostic and medication aggregation). Assuming that the unconfounded (true) value is RR t = 0.50, estimated confounding in the base estimate = │ln(0.89)ln(0.50)│ =0.577; estimated confounding in the combined aggregation estimate = │ln (0.81)ln(0.50)│ = 0.482. Thus, the proportional difference in absolute degree of estimated confounding between the two estimates = (0.482 -0.577)/0.577 = −16.3%. We would conclude that the combined aggregation estimate is 16.3% less confounded than the base estimate.
Because of limited data availability, and to mimic as closely as possible the intention-to-treat analyses in the trials, we used a single prescription reimbursement claim as the treatment measure. The current study was exempt by the Institutional Review Board of University of North Carolina at Chapel Hill.

Results
In the full cohort, there were 7,197 (38%) celecoxib and 11,632 (62%) ibuprofen or diclofenac initiators with 46 and 71 UGI events, respectively. Celecoxib users were older and had more risk factors for UGI complications than did the tNSAIDs users ( Table 1). The RR for UGI complication associated with celecoxib versus tNSAIDs was 1.05 (95% CI: 0.72-1.52) in the crude model, compared to 0.92 (95% CI: 0.62-1.37) in the model that used hd-PS automated variable selection in addition to the basic covariates (Table 2). Consistent with the sampling procedures described above, the median numbers of patients in cohorts in conditions 1 and 2 were about 3,594 and 1,441, respectively; the median outcome incidence proportions in conditions 3 and 4 were about 0.3% and 0.1%, respectively, and the median exposure prevalence in conditions 5 and 6 were about 19% and 8%, respectively.
In all cohort conditions except condition 2, where the total study size was only about 3,790, the geometric means of the hd-PS adjusted risk ratios were similar to the full cohort risk ratios. This similarity held even in cohort conditions 4 and 6, where the number of exposed patients with an outcome event was approximately 10. In all conditions except condition 6, where the exposure prevalence was only 8%, the geometric means of the hd-PS adjusted risk ratios were at least slightly closer to the RCT finding than the geometric means of the risk ratios adjusted for only the basic and extended covariates. A majority of the covariates that hd-PS identified in the full cohort were also selected by hd-PS in the samples in conditions 1, 3, and 5, where the number of exposed outcomes was at least 20, but also in condition 6, where there were only 10 exposed outcomes but a large total number of outcomes.
A scenario with combined aggregations of medications into ATC level 4 and of diagnoses into CCS level 1 consistently performed best, reducing residual confounding from 8.9% to 19.3% compared to the base scenario (Tables 3 and 4). Aggregating medications into chemical, pharmacological or therapeutic subgroups of ATC level 4, slightly improved adjusted estimates in all cohort conditions except condition 4, the 20% outcome incidence samples (data not shown). In contrast, aggregations of medications into groupings of the other ATC levels produced nearly the same or even worse adjusted risk ratios in all cohort conditions. When we experimented with different aggregations for diagnoses, without any aggregation for medications, aggregating ICD-9 diagnosis codes into different CCS levels inconsistently changed the adjusted risk ratios. Note that in our empirical setting, not controlling for any measure of diagnoses resulted in the estimate closest to the RCT finding (RRs in column "No Dx" of Table 3). When we aggregated ICD-9 diagnosis codes into CCS levels 1 or 2, the adjusted risk ratios in the samples were generally closer to the RCT finding. In contrast, aggregations of ICD-9 codes into CCS universal, CCS level 3, CCS level 4, or 3-or 4-digit ICD-9 groupings did not improve the adjusted point estimates (data not shown).

Discussion
We hypothesized that aggregations of medications and medical diagnoses into certain levels of ATC or CCS would help the performance of the hd-PS, especially with smaller cohort size, rarer outcome incidence or lower exposure prevalence. To explore these hypotheses, we selected a retrospective cohort where, as has been  previously observed, the hd-PS adjustment for confounding yielded an adjusted RR slightly closer to the RCT findings [21][22][23][24][25][26] than did PS adjustment using a limited number of investigator predefined covariates [16,18]. Of the 500 covariates identified by hd-PS in the full cohort, most were also identified by hd-PS in the random samples with fewer observations, rarer outcomes, or lower prevalence of treatments. Aggregations of medications into ATC level 4 alone or in combination with aggregation of diagnoses into CCS level 1 improved the hd-PS adjustment for confounding in the full cohort and most of the samples. The strength of our results on the effect of aggregating diagnoses is limited, however, by the fact that the overall confounding by co-morbidity was attenuated in the presence of 500 hd-PS covariates from medications, outpatient procedures and inpatient procedures in our empirical setting.
In general, aggregation of potential covariates into higher-level groupings increases the number of covariates that are present in at least 100 observations (the default requirement of the hd-PS version 1) and increases the prevalence of the covariate in exposed and unexposed groups which increases the covariate's prioritization from the Bross formula if it is associated with treatment [19]. But aggregation may simultaneously weaken covariate-exposure and/or covariate-outcome relations, reducing prioritization in the Bross formula [19]. The latter also has the potential to change the impact of control for the aggregated covariate on the adjusted risk ratios. The hd-PS algorithm theoretically may not favor the aggregation of confounder information. However, in particular cases (e.g., small samples, rare outcome incidence and low exposure prevalence), aggregations potentially help the hd-PS to reduce residual bias, for example, in this study. Version 2 of the hd-PS algorithm, which removed the restriction of a minimum 100 occurrences per potential confounder, allows important confounders to have a higher chance for the variable selection process and may improve bias reduction for treatment effect in small sample sizes and low exposure prevalence.
Grouping medications into ATC level 4 instead of the original generic drugs helped the hd-PS to robustly function in the samples, except for the 20% outcome incidence (condition 4). The use of other ATC levels for aggregating medications did not provide benefit and even resulted in some harm. For example, ATC level 4 code B01AC (platelet aggregation inhibitors excluding heparin) includes the following level 5 codes: B01AC04 (clopidrogel), B01AC05 (ticlopidine), B01AC07 (dipyridamole), B01AC23 (cilostazol), and B01AC30 (combined Table 2 Geometric mean of risk ratios and a summary analysis for different cohort size, outcome incidence and exposure prevalence of initiators of celecoxib or NSAIDs (ibuprofen or diclofenac) in a cohort 18- Table 5. The ATC level 4 with pharmacological subgroups seems the most appropriate level for aggregation of medications in this study. As for diagnostic codes, ICD-9 code 530.1 includes 530.11 (reflux esophagitis) and the additional codes 530.10 (esophagitis unspecified), 530.12 (acute esophagitis) and 530.19 (other esophagitis). In our study, the latter three codes each occurred in fewer than the 100 observation minimum that hd-PS requires by default and so would not be eligible for inclusion in the PS adjustment. With 5-digit granularity for diagnoses, the hd-PS selected ICD-9 code 530.11 (frequency 165, covariate-exposure RR = 1.3, covariate-outcome RR = 5.0see, Additional file 1: Table S6). Using 4-digit granularity for diagnoses, the hd-PS selected ICD-9 code 530.1 (esophagitis) which had a higher frequency (217) but slightly weaker covariateexposure (RR = 1.2) and covariate-outcome (RR = 4.6) associations. Situations like this could account for the slight Table 3 Risk ratios for different cohort size, outcome incidence and exposure prevalence of initiators of celecoxib or NSAIDs (ibuprofen or diclofenac) in a cohort 18-  worsening of confounding control in the 4-digit ICD-9 aggregation compared with the base case (up to . Additional examples to illustrate the changes in prevalence, covariate-exposure and covariate-outcome relations when we aggregated potential confounders, ICD-9 codes 530.11 (reflux esophagitis) and 530.81 (esophageal reflux) from 5-digit ICD-9 into 4-, 3-digit ICD-9, and CCS levels 4, 3, 2 and 1 are in Additional file 1: Table S6. It is worth noting that not all ICD-9 diagnosis codes have their equivalent CCS codes in all 4 levels [32]. This issue was more pronounced in CCS levels 3 and 4.
Using the most granular CCS code available for each ICD-9 code in the universal CCS did not improve results in most samples and the full cohort. We also did not observe any benefit while aggregating ICD-9 codes into first 3-or 4-digit groupings [16,31]. Since CCS has only 18 groupings for level 1 and 134 groupings for level 2, it could be argued that the benefit from aggregation comes about by enabling more variables from the other data dimensions (medications, inpatient and outpatient procedures) to fit within the 500 variable maximum in the hd-PS default. To address this concern, we also experimented with a maximum of k = 3,000 variables and consistently observed the benefit of aggregation of ICD-9 into CCS levels 1 or 2. Similarly, ATC level 1 has 14 groups, whereas level 4 has over 800 groupings, but aggregation of medications into ATC level 4 outperformed aggregation into level 1. Base scenario used up to 5-digit ICD-9, procedures, generic drugs for five data dimensions of the hd-PS. Our study has several limitations. This study empirically compared estimates from different aggregations and assumed any treatment effect estimates closer to the RCT findings to be less biased by confounding. There was relatively little confounding present in the data, and the effect estimates did not change much after adjustment for the baseline covariates. The magnitude of the percentage reductions in confounding depends on the value selected as the unconfounded value; however, the precise value selected from the published RCT [21][22][23][24][25][26] does not affect the ranking of performance across scenarios. Our comparison relies on the assumption that the codes in the original database are accurate. Also, our study is based in a single cohort in which hd-PS performed reasonably well. Fully specified simulations with true risk ratios in diversified scenarios could be used to prove the advantage of aggregation under certain conditions but would be unable to answer the important question of magnitude in real world settings. It is nevertheless unclear whether our findings regarding the effects of aggregation of medications and diagnostic codes on the performance of the hd-PS algorithm apply to other treatment-outcome pairs that may be subject to confounding by different factors. Studies with few events or small size may suffer from small sample bias or overfit PS models and outcome models using PS deciles to  [36,37]. The small number of UGI complication cases produced imprecise estimates and insufficient power to confirm differences between the different methods. The computer time requirements of the hd-PS algorithm constrained our ability to increase the size of our samples beyond 100 for each cohort condition. We thus should compare results with and without aggregations within each condition, but not across conditions. However, we are interested in bias which pertains to expected values of point estimates and statistical significance plays no defensibly useful role in the assessment or measurement of bias. Moreover, each aggregation scenario had six cohort conditions (600 samples). Thus, consistent patterns (the combined ATC level 4 plus CCS level 1) are supported by a large number of samples. Users of the hd-PS methodology should screen and remove instrumental variables and collider bias candidates [10][11][12]. This topic is out of the scope of this study. Further studies may explore examples of no drug effect on the outcome, increased drug-outcome risk, more common outcomes, and compare the aggregation approaches with the zero-cell correction or exposure-based association selection for the hd-PS [38], develop appropriate methods to replace missing codes in CCS levels, appropriate aggregations for procedures, simultaneous aggregation of diagnoses, medications and procedures, evaluation of the hd-PS functions in cohorts with different cohort size, outcome incidence and exposure prevalence.

Conclusion
Aggregation of drug and diagnostic codes using hierarchical coding systems may improve the performance of the hd-PS to control for confounders in cohorts with small size, low outcome incidence or low exposure prevalence but the balance of advantages and disadvantages of aggregation is likely to vary across settings.