Development and evaluation of an algorithm to link mothers and infants in two US commercial healthcare claims databases for pharmacoepidemiology research

Background Administrative healthcare claims databases are used in drug safety research but are limited for investigating the impacts of prenatal exposures on neonatal and pediatric outcomes without mother-infant pair identification. Further, existing algorithms are not transportable across data sources. We developed a transportable mother-infant linkage algorithm and evaluated it in two, large US commercially insured populations. Methods We used two US commercial health insurance claims databases during the years 2000 to 2021. Mother-infant links were constructed where persons of female sex 12–55 years of age with a pregnancy episode ending in live birth were associated with a person who was 0 years of age at database entry, who shared a common insurance plan ID, had overlapping insurance coverage time, and whose date of birth was within ± 60-days of the mother’s pregnancy episode live birth date. We compared the characteristics of linked vs. non-linked mothers and infants to assess similarity. Results The algorithm linked 3,477,960 mothers to 4,160,284 infants in the two databases. Linked mothers and linked infants comprised 73.6% of all mothers and 49.1% of all infants, respectively. 94.9% of linked infants’ dates of birth were within ± 30-days of the associated mother’s pregnancy episode end dates. Characteristics were largely similar in linked vs. non-linked mothers and infants. Differences included that linked mothers were older, had longer pregnancy episodes, and had greater post-pregnancy observation time than mothers with live births who were not linked. Linked infants had less observation time and greater healthcare utilization than non-linked infants. Conclusions We developed a mother-infant linkage algorithm and applied it to two US commercial healthcare claims databases that achieved a high linkage proportion and demonstrated that linked and non-linked mother and infant cohorts were similar. Transparent, reusable algorithms applied to large databases enable large-scale research on exposures during pregnancy and pediatric outcomes with relevance to drug safety. These features suggest studies using this algorithm can produce valid and generalizable evidence to inform clinical, policy, and regulatory decisions. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-023-02073-6.


Background
Pregnancy is characterized by distinct periods of embryonic development representing critical exposure windows for children's health [1].Exposures before or during pregnancy, including pharmaceuticals, can affect conception, fetal development, pregnancy outcomes, and children's health.While up to 90% of women take medication during pregnancy [2,3], drug safety evidence is scarce because clinical trials often exclude pregnant people [4][5][6].Mechanisms for generating pregnancy drug safety evidence are available, such as teratology information services [7], pregnancy and birth registries [8][9][10][11][12], case control studies [13], prospective cohort studies [14], and linked registry and prescription data resources [15].However, these approaches often lack power to adequately assess rare exposures or outcomes, suffer from information biases, are slow to deliver results, may reflect selected populations, and are resource intensive.This research landscape produces an incomplete understanding of the benefits and risks of prenatal medication use and resultant birth outcomes.Timely and robust evidence is urgently needed in this population, as highlighted by the COVID-19 pandemic and the lack of efficacy and safety data for vaccine receipt during pregnancy.
Calls have been made to use real-world data (RWD) to study medication effects in pregnancy and are increasingly accepted by health authorities as part of postauthorization safety commitments [16,17].Large, administrative healthcare databases for pregnancy research are advantageous because they include large samples, multi-therapeutic area drug dispensing and diagnosis reimbursement claims, longitudinal patient observation, and reflect routine-care clinical practice [18].
To assess prenatal exposures on infant outcomes in RWD requires implementing algorithms to define pregnancy episodes and to link live births to infant records, which is challenging in the United States where national health record identifiers are absent.Mother-infant linkage has been conducted using US administrative healthcare databases, including among Medicaid, commercially-insured, and Military Health System populations [19][20][21][22][23][24].Other efforts, such as the Medication Exposure in Pregnancy Risk Evaluation Program (MEPREP) [25,26], have linked administrative and electronic health record data to state birth records.However, details on linkage confidence and evaluation are sparse [27].
Our study builds on past efforts to create mother-infant linked cohorts in RWD.The objective of this work was to link mother and infant data using two large, US commercial insurance databases.We also sought to evaluate the algorithm through comprehensive characterization comparisons between linked and non-linked mothers and infants.In contrast to other linkage studies that use proprietary algorithms, our algorithm is publicly available.The algorithm was developed for use against the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) [28,29], so it may be applicable to similar databases that have been standardized.Our linkage algorithm furthers earlier linkage work based on insurance enrollment ID matching only, by applying additional temporal criteria intended to increase linkage confidence.

Data sources
The study used two health insurance claims databases, IBM® Marketscan® Commercial Database (CCAE)  and Optum's de-identified Clinformatics® Data Mart Database (Clinformatics®) .Both contain de-identified, patient-level, encounter-based, longitudinal, employer-based US administrative health insurance claims records and include inpatient and outpatient diagnoses, procedures, and outpatient prescription dispensing records.Both databases use a unique insurance enrollment ID for identifying beneficiaries and their dependents under a single, primary insurance holder account.Both databases were transformed to the OMOP CDM, which provides a standardized representation of database structure and clinical content [30] to enable consistent analysis across disparate healthcare databases [31,32].Detailed database descriptions are in Additional file 1.

Linkage algorithm
The linkage algorithm relies on and is distinct from an algorithm for identifying pregnancy episodes and outcomes [33].The pregnancy episodes algorithm was previously described, implemented, and validated in several administrative healthcare databases, including those utilized in this study [33].In the pregnancy episodes algorithm, pregnancy outcomes (live births, stillbirths, abortions, and ectopic pregnancies) with associated dates were identified among women aged 12-55 years.Second, it estimated pregnancy start dates using a hierarchy of pregnancy markers, such as last menstrual period, amenorrhea, urine tests, and ultrasounds.The algorithm was validated through clinical adjudication of 700 electronic pregnancy episode profiles from Clinformatics® and the Clinical Practice Research Database that demonstrated high agreement between algorithm results and reviewers on 6 operating characteristics.This algorithm is currently being updated to include gestational age indicators in the ICD-10-CM vocabulary [34,35].

Step 1: identify candidate mothers and infants
We first identified candidate mothers as females whose pregnancy episode(s) ended with live birth and occurred during a period of insurance enrollment.
Multiple periods of insurance enrollment were combined into a single observation period provided gaps between an enrollment period end and subsequent start date were ≤ 30 days.We identified candidate infants as persons whose year of birth was the same as their first observation period start year (i.e., were 0 years of age at observation period start) and had an insurance enrollment ID shared with a candidate mother.Candidate infants' date of birth (DOB) was set as year, month, and day.Year of birth was available for all persons in both databases.Month and day were unavailable in the data sources we used through the patient de-identification process, so we inferred these components from observation period start month and day.Most day of birth values were set as 1 because insurance enrollment typically begins on the first day of a month.We refer to this date as the inferred date of birth, rather than the true date of birth, which we assert is the delivery date of the corresponding linked mother, where links were established.The algorithm will use month and day of birth if available but will set these values to month and day of enrollment start otherwise.This supports algorithm transportability if used in other insurance claims databases where birth date information may or may not be redacted.

Step 2: identify candidate mother-infant links
We identified candidate links between mothers and infants where they matched on insurance enrollment ID and the candidate infant's inferred DOB occurred during a candidate mother's observation period.

Step 3: classify probable mother-infant links
We identified probable links between mothers and infants by restricting to those where the candidate infant's DOB occurred within ± 60 days of the candidate mother's pregnancy episode end date.This correspondence window was varied in a sensitivity analysis (Additional file 1).

Step 4: exclude ambiguous mother-infant links
In Step 2, we identified rare instances where multiple mothers could be associated with a single infant.These records were excluded from analysis.

Cohorts used in algorithm evaluation
Nine cohorts were constructed to compare characteristics between linked vs. non-linked mothers and infants.The index date refers to the temporal reference against which covariates were constructed.
1) Mothers linked to ≥ 1 infant indexed at pregnancy episode start.2) Mothers linked to ≥ 1 infant indexed at pregnancy episode end.3) Infants linked to a mother indexed at inferred DOB. 4) Mothers not linked to an infant indexed at pregnancy episode start.5) Mothers not linked to an infant indexed at pregnancy episode end.6) Infants not linked to a mother indexed at inferred DOB.7) Candidate mothers indexed at pregnancy episode start.8) Candidate mothers indexed at pregnancy episode end.9) Candidate Infants indexed at inferred DOB.Note that cohorts 7, 8, and 9 were constructed to create cohorts 4, 5, and 6.For example, cohort 4 equals mothers in cohort 7 with mothers from cohort 1 removed.Cohorts 1-3 and 4-6 were used in characteristic comparisons.

Characterization analyses
We characterized mother cohorts using demographic, clinical, and healthcare utilization covariates relative to each index date: once with covariates that reflect events observed during the year before or on the pregnancy episode start date (reported in Table 1), and again with covariates that reflect events observed during the year before or on the delivery date (reported in Table 2).The intent of Table 1 is to describe pre-pregnancy characteristics, whereas the intent of Table 2 is to describe characteristics that occur mostly during pregnancy (recognizing the limitation that approximately 3 months of the oneyear covariate construction window is before pregnancy start).We characterized the infant cohorts with covariates that reflect events observed on or during the year after the inferred DOB.See Additional file 1 for details on how demographic, clinical, and healthcare utilization covariates were measured.For example, if a procedure code for a basic metabolic panel was observed on a patient record 3 months before delivery date, a measurement covariate would be constructed indicating that the test was performed but it would not include any lab results.
Lastly, we compared characteristics between linked vs. non-linked mothers and infants to evaluate differences between populations that did and did not meet linkage algorithm criteria.We made covariate comparisons by calculating the standardized mean difference (SMD) for each covariate in units of the pooled standard deviation, a metric uninfluenced by large sample sizes [36], and interpreted SMD values > 0.1 as meaningfully different [37,38].Figure 2 depicts the comparative prevalence of demographic, drug exposure, condition, procedure, and measurement occurrence covariates for the linked vs. non-linked mother and infant cohorts.
The plots illustrate that the characteristics of linked and non-linked mothers were generally similar.However, infant characteristics, including conditions, measurements, drugs, and procedures were more prevalent among linked vs. non-linked infants.Large SMD covariates with greater prevalence among the linked infants included procedural billing records related to infant care, infant screening procedures, immunizations, and some conditions (see web application to review all characteristics).We also observed a greater prevalence of birth-related covariates among linked infants than nonlinked infants (e.g., "Single live birth", "Finding related to pregnancy").Despite these differences, we still observed absolute SMDs of < 0.1 for > 99% of covariates across all algorithm implementations of each linked vs. nonlinked comparison in both databases where the number of covariate comparisons ranged from 58,611 (CCAE infants) to 68,368 (Clinformatics® mothers pregnancy end).   1 except for pregnancy episode length but were measured relative to pregnancy episode end date.Age was greater among linked mothers in CCAE (32.0 vs. 30.9years), which reflects the slightly greater linked pregnancy episode lengths reported above.There was greater post-pregnancy observation time among linked mothers in both databases (CCAE: 1084 vs. 690 days, Clinformat-ics®: 948 vs. 660 days).Although uncommon, emergency room visits were greater among non-linked mothers in CCAE (0.7 vs. 0.3).
Table 3 reports characteristics and SMDs of linked vs. non-linked infants for several characteristics measured at their inferred birth dates (enrollment start date).Non-linked births were more common in the early study period (2000-2003) in both databases.There was greater average post-birth observation time among linked infants in both databases (CCAE: 1060 vs. 886 days., Clinformat-ics®: 855 vs. 751 days).Average condition (CCAE: 6.8 vs. 5.7, Clinformatics®: 7.8 vs. 6.4) and procedure (CCAE: 11.6 vs. 9.9, Clinformatics®: 12.3 vs. 10.0)occurrences were greater among linked infants.Healthcare utilization   The final person and record counts for each of the 9 cohorts constructed by the 3 linkage algorithm implementations in each database are reported in Additional file 1. Result sets for the two algorithm sensitivity implementations are reported in Additional file 1.We observed similar stepwise attrition proportions across sensitivity implementations.Attrition proportions in the first births sensitivity implementation were greater in Step 3 because this is where first birth restrictions were made.There were no appreciable differences in linked vs. non-linked mother and infant characteristics across algorithm sensitivity implementations.

Discussion
We developed and implemented an algorithm to infer mother-infant links in two large US commercial healthcare databases that exhibited high linkage coverage and similar characteristics across linked vs. nonlinked persons.This signifies generalizability of linked mother-infant pairs to commercially insured source populations, which facilitates large-scale research on prenatal exposures and infant outcomes.This constitutes novel research by virtue of our emphasis on linked vs. nonliked characterization comparisons to support generalizability.Similarity of measured characteristics in linked vs. non-linked mother and infant records is supporting evidence that results produced by analyzing linked cohorts will generalize to the underlying source population, in this case commercially insured pregnant people and their infants.Our assessment of average linked-infant follow-up time (Clinformatics®: 855 days, CCAE: 1060 days) allows their inclusion in perinatal-exposure studies where outcomes of interested are not birth outcomes per se but longer-term infant conditions.Further, our linkage algorithm was implemented in the OMOP CDM, and the source code is publicly available.The utility of using standardized analytic routines against a standard data representation allows for transportable, complex algorithms to be implemented in other claims databases formatted to the OMOP CDM with no loss of fidelity [39].
Our algorithm identified > 3.4 million linked mothers and > 4.1 million linked infants.Access to large, linked populations makes feasible the study of a wide range of prescription drug exposures, maternal and neonatal outcomes, and subgroups that are often unavailable in smaller linked populations [40,41] and registries [18,42,43].This approach requires fewer study resources compared to studies that require primary data collection [44].
Across databases, linked mothers comprised 73.6% of all mothers with live births.In Clinformatics®, 77.3% of mothers were successfully linked to infants, which is lower but comparable to the 84% reported in a recent  study using data from the same source with fewer linkage restrictions [19].Despite similar methods, other linkage studies have reported mixed linkage coverage, suggesting that differences are due to data accuracy and/or availability variation across sources.Palmsten et al. linked Medicaid-enrolled mothers and infants and reported linkage coverage of 55.6% for inpatient deliveries, although with considerable variation by state (0-96%) [23], which the authors attributed to varying family identifier quality and use.A study in TRICARE enrollees in the Military Health System reported 90% of pregnancies ending in live births were linked with infants [24], which may be attributable to lower insurance coverage churn.
In our study, linked infants comprised 49.1% of all infants defined as persons 0 years of age at their observation period start.Contextualizing our linked infant coverage is difficult because most studies only report the proportion of linked pregnancies [19,23].However, Garbe et al. conducted a study using the German Pharmacoepidemiological Research Database (GePaRD), a claims database from four statutory health insurance providers, and reported that 77.3% of newborns were linked with mothers [45].Additionally, a study among Medicaid enrollees in Tennessee reported 97% of infants were linked with a delivery, however such high coverage is likely explained by the use of vital record data with identifying information [41].
While our primary analysis used a ± 60-day window between infant DOB and mothers' pregnancy episode end to identify candidate links, in sensitivity analyses, we observed high correspondence at 7, 14, and 30 days, including same-day correspondence of 31.3% in CCAE and 67.4% in Clinformatics®.Overall correspondence was greater in Clinformatics®, which may be due to more accurate and specific DOB information.Increasing the correspondence window to 90 days increased the proportion of linked infants by only 1.5% in CCAE and 0.2% in Clinformatics®, which we do not interpret as material because most of the correspondence occurred within ± 30 days.
Characteristic comparisons between linked and nonlinked mothers revealed similar demographic, clinical, and healthcare utilization profiles.Our linkage evaluation largely supports the generalizability of the linked mother population, having compared thousands of covariates between linked and non-linked mother cohorts and observing few differences.Of note, two of the differences we found in both CCAE and Optum were also detected in a recent study using the Sentinel network: non-linked mothers were younger and had shorter gestations than linked mothers [46].It is possible that these consistent differences are due to unmeasured factors associated with mother and infant not sharing the same insurance policy.Despite this, we found that SMDs were < 0.1 for nearly all observed characteristics, suggesting that in a prenatal exposure study on a small, exposed subset of the linked mother population, systematic differences between the study sample and non-linked mothers to whom the results will apply will be minimal.
Despite substantial similarity between linked and nonlinked infants, we observed more differences than when comparing mothers.Of note, linked infants had greater total healthcare utilization and prevalence of individual clinical events, including birth and infant-care related claims.Because our algorithm linked mothers to infants by a shared insurance ID within a defined temporal interval, candidate infants whose inferred DOB fell outside of that interval would be non-linked and less likely to have their clinical events captured in the database.This suggests that some billing records may be attributed to family members on other insurance plans among the non-linked populations.Still, this evaluation supports the generalizability of the linked infant population.In cases of multiple pregnancies, linkages are made between maternal records and all live births from that pregnancy.In the event of a multifetal pregnancy ending in both a live birth and a stillbirth, codes associated with the still born infant may be observed on the live born infant record or on that of the linked mother.The occurrence of this scenario is expected to be very rare, with less than 0.5% of multifetal pregnancies experiencing a stillbirth in one study [47].
Despite CCAE and Clinformatics® both consisting of administrative claims data from large US commercial insurance plans, data content heterogeneity between them still exists, which could contribute to results differences between them.In Clinformatics®, we observed more situations where multiple women of child-bearing age were associated with one candidate infant.This suggests that more extended family members may be included on the same insurance plan in Clinformatics® than in CCAE, which would increase the situations where one infant is associated with > 1 candidate mother on the same health plan.Regarding selection bias, by excluding multiple women of child-bearing age on the same insurance plan, we may be selectively decreasing representation of large, varied families covered in the Clinformatics® database.
We note that several recent studies have established mother-infant linkage algorithms in claims databases with similar methods to the one described in this paper [23,24,34,45,48].Specifically, linkage algorithms in the Clinformatics® [34] and in the CCAE [48] used infant dates of birth, maternal delivery dates, and family insurance IDs to link delivering mothers with infants.The algorithm used in CCAE captured slightly more links because it did not restrict to live births initially, had a wider correspondence window allowance, and when multiple mothers or pregnancies were associated with a single infant, it selected the earliest whereas our approach excluded those ambiguous links.While other studies have used related methods successfully [23,24,34,45,48], we show that our standardized approach works across multiple databases.The algorithm presented in this study offers a reproducible framework that can be implemented across different databases, particularly those transformed to the OMOP CDM.Further, we have characterized the populations of linked and unlinked mothers and children, which aids in contextualizing the output of these linkages and implications for their use in future research.
A strength of our study is the rigorous linkage approach utilizing insurance ID in addition to delivery and birth procedure dates in large US claims databases representing the commercially insured population.Further, we provide open-source code and a web-application to interactively review characterization results, which provides valuable context for the external validity of future studies among linked populations.Lastly, developing a reproducible mother-child linkage algorithm in large administrative databases facilitates evidence generation in pregnant populations with improved rigor by avoiding recall, referral, and self-selection biases inherent to registry or other primary data collection studies of prenatal medication use [18].
Using administrative healthcare claims databases in pharmacoepidemiologic research has limitations.Erroneously coded or missing diagnostic, procedure, and drug dispensing records results in misclassification which may under-or over-estimate exposures, covariates, health outcomes, other clinical events, and healthcare utilization [49].Subsequent information bias that can result from misclassification is underappreciated [50] and could bias findings of future drug safety studies.Further, because the data do not provide exact date of birth information for non-linked infants, estimating event prevalence during 365-days post-birth is imprecise.This may result in misclassification by failing to capture events specifically related to the birth encounter itself.We observed these birth-related conditions and procedures as imbalanced in Fig. 2 and the infants tab of the web application.
Still, developing reliable mother-infant linkages in large healthcare databases has increased the capacity to examine associations between rare prenatal drug exposures and infant outcomes with sufficient power.For example, prenatal use of antidepressants, stimulants, antihypertensive medications, and sulfonamides have been studied in relation to validated congenital anomalies [51][52][53][54][55].This has yielded needed real-world evidence on the safety of prenatal exposures.
While we found few differences between linked and non-linked populations suggestive of high internal validity to the underlying commercially insured US population, our results do not necessarily ensure external validity to those covered under other types of insurance or lacking coverage.The data in this study are representative of people with US-based, employer-sponsored health insurance, indicative of a higher socioeconomic status population.Given the established association between wealth and health [56,57], care should be taken not to assume that linked vs. non-linked similarity we observed is consistent across other socioeconomic demographics.Further, administrative healthcare databases include detailed outpatient drug dispensing records but provide fewer details on inpatient dispensing records, prescriptions, or administrations typically available in electronic medical records.Additionally, we note that pregnancy episode length was slightly shorter in non-linked pregnancies (Table 1), but we do not believe this difference could substantially influence observed linked vs. nonlinked maternal covariate differences in the year before birth, which were few.Lastly, our study has not been validated.However validation of a similar algorithm developed in claims data among Medicaid beneficiaries showed high positive predictive value [58].

Conclusions
Our study reinforces the shift towards implementing pharmacoepidemiology studies on prenatal drug exposures utilizing large electronic healthcare data as a supplement to traditional pregnancy registries.Our algorithm and evaluation demonstrate the ability to assemble large mother-infant linked cohorts for investigating prenatal drug exposure effects on infant outcomes.

Fig. 1 Fig. 2
Fig. 1 Mother-infant linkage algorithm attrition diagram Panel A: IBM® Marketscan® Commercial Database Panel B: Optum de-identified Clinformatics® Data Mart Database Footnote: Candidate mothers: women whose pregnancy episode(s) ended with live birth and occurred during a mother's observation period; Candidate infants: persons who were 0 years of age at observation period start; Candidate links: mothers-infant pairs who matched on insurance enrollment ID infant's date-of-birth occurred during a candidate mother's observation period; Probable links: candidate links where candidate infants date-of-birth occurred within ± 60 days of the candidate mother's pregnancy episode end date; Inferred links: removal of probable links where multiple mothers associated with one infant

Table 1
Selected characteristics of linked and non-linked mothers, measured 365 days before and including pregnancy start 8% and 1.4% were dropped respectively during Step 2, resulting in 2,915,538 candidate links.Links were reduced by 13.2% and 0.1% in steps 3 and 4 respectively, which resulted in 2,528,482 links: 2,146,726 linked mothers, and 2,528,482 linked infants.31.3% of linked infant's DOB were on the same day as their linked mother's pregnancy episode end date and 58.3%, 71.5%, and 92.1% occurred within ± 7 days, ± 14 days, and ± 30 days, respectively.Linked infant's DOB was on average 5.9 days (SD = 15.1, median = 1) after the pregnancy episode end date.Linked mothers comprised 70.1% of all mothers (n = 3,064,263) and linked infants comprised 51.2% of all infants (n = 4,935,376) (Additional file 1).
ResultsAll source code and an interactive web application for viewing full results is available at https://data.ohdsi.org/MotherInfantLinkEval/.A reader can navigate to this web-based application to review the full characterization results set for each linked vs. non-linked comparison.By default, the table reports characteristic prevalence results for linked vs. non-linked cohorts sorted by largest to smallest standardized mean difference between characteristic prevalence.Additionally, a reader can search for characteristics of interest using the search bar.Figure1depicts step-by-step attrition of the linkage algorithm.In CCAE, 3,064,263 candidate mothers and 2,942,216 candidate infants were identified in Step 1, of whom 26.CCAE: IBM® Marketscan® Commercial Database; Clinformatics®: Optum's de-identified Clinformatics® Data Mart Database; SMD: Standardized difference of means

Table 2
Selected characteristics of linked and non-linked mothers, measured 365 days before and including pregnancy end

Table 1
reports characteristics and SMDs of linked vs. non-linked mothers for several characteristics measured CCAE: IBM Commercial Database; Clinformatics®: Optum's de-identified Clinformatics® Data Mart Database; SMD: Standardized difference of means

Table 3
Selected characteristics and standardized differences of linked and non-linked infants (i.e., outpatient and inpatient visits) was similarly greater among linked infants.