This retrospective cohort study was conducted using four fiscal years (April 1, 2015 – March 31, 2019) of de-identified administrative health data from the Manitoba Population Research Data Repository at the Manitoba Centre for Health Policy in the province of Manitoba, Canada. Manitoba has a universal healthcare system, therefore almost all contacts with the healthcare system for the entire population are captured in administrative health data. The provincial population is approximately 1.3 million according to the most recently-available Statistics Canada Census data.
Data sources
Study data sources included the Manitoba Health Services Insurance Plan Registry (Population Registry), the Hospital Abstracts Database, and the Medical Services Database. These data sources were linked using a unique personal health identification number.
The Population Registry captures information on healthcare coverage for all insured Manitobans, and was used to determine eligibility for study inclusion. The Registry also includes demographic information used to characterize the study cohort. The Registry does not provide information about an individual’s gender, which is a sociological construct; it only provided information about biological sex. Individuals may change their biological sex designation.
Chronic condition diagnoses were obtained from the Hospital Abstracts Database, which contains information on discharges from hospitals, and the Medical Services Database, which records information on ambulatory services provided in physician offices. Diagnoses within hospital discharge abstracts are coded using ICD-10-CA (International Statistical Classification of Diseases and Related Health Problems, 10th Revision with Canadian Enhancements), while Medical Services diagnoses are coded with up to five digits using ICD-9-CM (International Statistical Classification of Diseases and Related Health Problems, 9th Revision, Clinical Modification).
Cohort development
The study cohort included all Manitoba residents with complete or partial Manitoba Health insurance coverage during the study observation period. Individuals entered the study on April 1, 2015 or the date that coverage started, and were followed until the end of the study period or until their insurance coverage ceased (e.g., due to death, migration out of province). Based on information about biological sex recorded in the Population Registry, males with female-specific conditions and females with male-specific conditions were excluded (n = 15); the presence of these potential inconsistencies could suggest errors in the recording of diagnostic or demographic information, and it was not possible to verify the reasons for these potential inconsistencies. Since disease networks were formed from disease co-occurrence relationships, the network analysis was limited to individuals with diagnoses recorded for at least two chronic health conditions during the study observation period.
Disease ascertainment
Chronic conditions were ascertained using diagnoses identified from inpatient discharge records in the Hospital Abstracts Database and from physician visit records in the Medical Services Database. Surgeries recorded in both data sources were also included. Prenatal and pregnancy-related records were excluded to minimize overstating disease co-occurrence among females.
A single diagnosis code was used to ascertain whether an individual was considered as having a specified condition in the study observation period. Individual diagnosis codes were grouped into 201 Expanded Diagnostic Clusters (EDC) and 27 higher-level Major Expanded Diagnostic Clusters (MEDC) using the Johns Hopkins Adjusted Clinical Group (ACG) System [30]. Diagnoses were loaded into the Johns Hopkins ACG System as World Health Organization (WHO) ICD-9 or ICD-10 codes. Five-digit ICD-10-CA codes from the Hospital Abstracts Database were truncated to the first four digits to improve compatibility with the Johns Hopkins System, which supports the WHO ICD system but not the Canadian revision. There were 49 unique Canadian-specific ICD-10-CA codes relevant to chronic disease status that were not captured by the Johns Hopkins System. These 49 Canadian-specific diagnosis codes were first translated to WHO ICD-10 codes for inclusion. An additional 17 Canadian-specific ICD-10-CA codes were not captured; however, they were irrelevant to disease status since they indicated location of occurrence or activity engaged in during occurrence. Chronic conditions classified as separate EDC categories based on severity or presence of complications were combined into single disease categories including asthma with or without asthmaticus, hypertension with or without complications, type 1 diabetes with or without complications, and type 2 diabetes with or without complications. As well, 25 EDC categories that were non-descriptive, or referred to non-chronic medical conditions or to the neonatal period were removed. Two categories indicating severity of malignant neoplasms, already classified elsewhere, were also excluded. Since co-occurrences with frequencies less than 15 were excluded from the association analysis to minimize statistical errors, seven EDC categories with low frequencies were excluded: heart murmur, lymphadenopathy, thrombophlebitis, tuberculosis infection, sinusitis, other inflammatory conditions of skin, and other female gynecologic conditions. After 34 EDC categories were excluded, 167 EDC categories remained for the network analysis.
Disease co-occurrence measures
Disease co-occurrence was defined as two or more chronic health conditions recorded at any time during the four-year study observation period (April 1, 2015 – March 31, 2019) for the same individual. The four-year study period was selected because it was the available time frame in which diagnoses in the Medical Services dataset were coded to five digits; this detailed level of diagnostic coding was important for identifying distinct health conditions.
Disease association was measured using seven different co-occurrence measures: lift (Eq. 1), relative risk (Eq. 2), phi (Eq. 3), Jaccard (Eq. 4), cosine (Eq. 5), Kulczynski (Eq. 6), and joint prevalence (Eq. 7) [31,32,33,34,35,36,37]. Phi and relative risk are two of the most commonly used measures in disease network analysis, while lift is commonly used in conjunction with association rule mining. Jaccard, cosine, and Kulczynski are null-invariant measures that have been suggested for use with sparse data such as disease status datasets [26]. Joint prevalence was included due to its ease of interpretation. These measures are defined for two health conditions (i.e., x and y; see Fig. 1) as:
Lift [31] | \(\mathrm{n}\frac{\mathrm{a}}{\left(\mathrm{a}+\mathrm{b}\right)\left(\mathrm{a}+\mathrm{c}\right)}\) | (1) |
Relative risk [32] | \(\frac{\mathrm{a}\left(\mathrm{c}+\mathrm{d}\right)}{\mathrm{c}\left(\mathrm{a}+\mathrm{b}\right)}\) | (2) |
Phi (ϕ) [33] | \(\frac{\mathrm{ad}-\mathrm{bc}}{\sqrt{\left(\mathrm{a}+\mathrm{b}\right)\left(\mathrm{a}+\mathrm{c}\right)\left(\mathrm{b}+\mathrm{d}\right)\left(\mathrm{c}+\mathrm{d}\right)}}\) | (3) |
Jaccard [34] | \(\frac{\mathrm{a}}{\left(\mathrm{a}+\mathrm{b}+\mathrm{c}\right)}\) | (4) |
Cosine [35] | \(\frac{\mathrm{a}}{\sqrt{\left(\mathrm{a}+\mathrm{b}\right)\left(\mathrm{a}+\mathrm{c}\right)}}\) | (5) |
Kulczynski [36] | \(\frac{1}{2}\left[\frac{\mathrm{a}}{\left(\mathrm{a}+\mathrm{b}\right)}+\frac{\mathrm{a}}{\left(\mathrm{a}+\mathrm{c}\right)}\right]\) | (6) |
Joint prevalence [37] | \(\frac{\mathrm{a}}{\mathrm{n}}\) | (7) |
- where a, b, c, d, and n are defined from the elements of a two-way contingency table.
Statistical significance of disease associations was assessed using the chi-square test when expected frequencies were greater than five, while Fisher’s exact test was used when the chi-square assumption did not hold. Associations that were not statistically significant using α = 0.01 were excluded. Since the focus of this study was on co-occurring disease, the analysis was limited to positive associations. Given that RR is an asymmetric measure of association, the maximum of the two RR measures was used.
The association analysis was limited to disease dyads. Disease association was computed using association rule mining by applying the widely-used Apriori algorithm [38]. Minimum joint frequency (called support in association rule mining) was limited to 15 to minimize statistical errors, and the minimum confidence parameter of association rule mining was left unbounded. Data preprocessing was conducted using SAS (v9.4), while R and the arules package (v1.6-7) was used to perform the association analysis [39].
Covariates
The study cohort was characterized using age, biological sex, number of chronic conditions, residence location (urban or rural), and income quintile, an area-level measure of socioeconomic status based on average household income from the most recently-available (i.e., 2016) Statistics Canada Census [40]. Birthdate and biological sex were extracted from the most recent insurance coverage period, while socioeconomic and urban/rural status were based on the latest residence recorded during the study period. Age was calculated at exit date (i.e., the study index date) and categorized as < 20, 20-39, 40-59, 60 +.
Network analysis
Weighted, undirected pairwise disease networks were separately constructed using each of the seven co-occurrence measures. Disease networks were stratified by the number of associations (i.e., edges) included: (a) all associations, (b) strongest 50% of associations, and (c) strongest 200 associations. Networks based on the strongest 200 associations were used to examine differences in networks that have higher visual interpretability, while networks based on the strongest 50% cut-off were used to examine how network similarity changes when a larger number of associations are included. Effect size estimates were used as edge weights and were bounded between 0 and 1 for networks measured using phi, Jaccard, cosine, and Kulczynski association measures, and unbounded for lift, relative risk, and joint prevalence.
Community structure was identified using a weighted and non-overlapping community detection algorithm developed by Blondel et al. [41]. This algorithm was chosen due to its computational efficiency on large networks and because it is widely used in applied network studies. Central nodes were identified using degree centrality and the centrality analysis was limited to the top 20 central nodes. Degree centrality was chosen because it has a clear interpretation in the context of disease co-occurrence networks (i.e., number of co-occurrence relationships) and the appropriateness of other centrality measures for use with disease co-occurrence networks is uncertain. Disease networks were visualized using the Fruchterman-Reingold force-directed layout algorithm constrained to the strongest 200 associations to improve interpretability [42]. Networks were visualized using Gephi (v0.9.2) and analyzed with Java and Gephi Toolkit (v0.9.2).
Evaluating and comparing disease networks
Disease networks constructed using different co-occurrence measures were compared on their community structure, central nodes, common edges, network complexity, and in terms of the joint prevalence and prevalence difference distributions of their network edges.
Community structure similarity was calculated using the adjusted Rand index (ARI) with the R package aricode (v1.0.0) [43]. Community structure was also characterized by the number of communities identified and modularity, where higher modularity values indicate more distinctive communities. Important nodes, identified using degree centrality, were compared across networks by calculating their agreement, as a percent, on the top 20 central nodes. Edge similarity was calculated using the percent of edges in common between network pairs. Across all networks, overall similarity of community structure, and central node and edge agreement, was quantified using the median and the 25th (Q1) and 75th (Q3) percentiles.
Network edges were compared using categorized prevalence of co-occurring disease pairs. Based on the prevalence distribution across all 167 Johns Hopkins disease categories, disease prevalence was categorized as low (< 1%), moderate (1 to < 7%), and high (≥7%). The percent of edges in each category was used to describe the tendencies of the co-occurrence measures in estimating association strength. Joint prevalence and prevalence difference distributions were described using the median and Q1-Q3 range.
Network complexity was characterized using density (i.e., the ratio of the number of edges present in a network to the number of possible edges between all node pairs), and frequencies of nodes and edges.