Investigating linkage rates among probabilistically linked birth and hospitalization records

Background With the increasing use of probabilistically linked administrative data in health research, it is important to understand whether systematic differences occur between the populations with linked and unlinked records. While probabilistic linkage involves combining records for individuals, population perinatal health research requires a combination of information from both the mother and her infant(s). The aims of this study were to (i) describe probabilistic linkage for perinatal records in New South Wales (NSW) Australia, (ii) determine linkage proportions for these perinatal records, and (iii) assess records with linked mother and infant hospital-birth record, and unlinked records for systematic differences. Methods This is a population-based study of probabilistically linked statutory birth and hospital records from New South Wales, Australia, 2001-2008. Linkage groups were created where the birth record had complete linkage with hospital admission records for both the mother and infant(s), partial linkage (the mother only or the infant(s) only) or neither. Unlinked hospital records for mothers and infants were also examined. Rates of linkage as a percentage of birth records and descriptive statistics for maternal and infant characteristics by linkage groups were determined. Results Complete linkage (mother hospital record – birth record – infant hospital record) was available for 95.9% of birth records, partial linkage for 3.6%, and 0.5% with no linked hospital records (unlinked). Among live born singletons (complete linkage = 96.5%) the mothers without linked infant records (1.6%) had slightly higher proportions of young, non-Australian born, socially disadvantaged women with adverse pregnancy outcomes. The unlinked birth records (0.4%) had slightly higher proportions of nulliparous, older, Australian born women giving birth in private hospitals by caesarean section. Stillbirths had the highest rate of unlinked records (3-4%). Conclusions This study shows that probabilistic linkage of perinatal records can achieve high, representative levels of complete linkage. Records for mother’s that did not link to infant records and unlinked records had slightly different characteristics to fully linked records. However, these groups were small and unlikely to bias results and conclusions in a substantive way. Stillbirths present additional challenges to the linkage process due to lower rates of linkage for lower gestational ages, where most stillbirths occur.


Background
The ability to conduct linkage of perinatal records, obtained as part of routinely collected administrative health data, has increased the scope for population based studies of mother and infant health [1]. When a unique identifier is available, deterministic linkage is used to identify records for the same person [2,3], however, when no unique identifiers are available, increasingly large databases are being linked using probabilistic-based linkage methods. While probabilistic linkage usually involves combining records for individuals, perinatal research typically requires a combination of information from both the mother and her infant(s).
Mismatches are possible with probabilistic linkage. Two different individuals could be linked resulting in incorrectly reported outcomes or risk factors (false positive links), or two records from the same individual may not be linked (false negative links), resulting in missing information. The success of linkage, often described in terms of minimizing mismatches, can depend upon a number of factors, including the quality of the information used in the linkage process and how uniquely identifying reported information is. Recent studies have shown that, unlike deterministic methods, the flexibility of probabilistic record linkage allows for minimization of mismatches under variations in data quality [24]. With the potential for mismatches it is important to consider the possibility of systematic biases that may arise between linked and unlinked populations of records. Researchers are becoming increasingly aware of the potential bias created by excluding unlinked records, and more recently this has prompted a publication of guidelines for reporting studies using linked data [25].
The aims of this study were to (i) describe probabilistic linkage for perinatal records in New South Wales (NSW) Australia, (ii) determine linkage proportions for these perinatal records, and (iii) assess records with complete linkage of mother and infant hospital-birth record and unlinked records for systematic differences.

Data sources
This study used linked records of the NSW Perinatal Data Collection (PDC), and the NSW Admitted Patient Data Collection (APDC). The PDC (referred to as 'birth records') is a population-based statutory surveillance system that includes all live births and stillbirths of at least 20 weeks gestation or if gestational age is not known of at least 400 grams birth weight, and includes information on maternal characteristics, pregnancy, labor and delivery factors and infant outcomes. 'Hospital records' (for mothers and infants) that relate to the birth (birth admission records) were obtained from the APDC, which includes demographic and hospitalization related data for every inpatient admitted to any public or private hospital in NSW. Diagnoses and procedures for each hospital admission are coded according to the 10 th revision of the International Classification of Disease, Australian Modification (ICD10-AM) and the Australian Classification of Health Interventions (ACHI).

Study population
The study population included all mothers who gave birth, and their infants, in NSW, Australia, from 1 January 2001-31 December 2008. NSW is the largest state in Australia with around 7,287,600 million people representing 32% of the Australian population [26]. Homebirths (0.2%) as identified in the birth records were excluded as these would not have a linked hospital birth admission.

Probabilistic record linkage
Birth, and maternal and infant hospital records for 2001 to 2008 were probabilistically linked [27] by the Centre for Health Record Linkage (CHeReL) [23] using a best practice approach in privacy preserving record linkage [28] and the open source probabilistic record linkage software Choice Maker [29]. Best practice involves ensuring separation of personal identifiers and health information. The CHeReL receives personal identifiers only (i.e. no health information) from the data custodians to generate a linkage key, and a linkage key is returned to the data custodians. Finally, researchers receive only health information and a linkage key from the data custodians.
The link between the mother and infant is provided by the common birth record. Probabilistic linkage is used to link records for the same individual, and in this context the outline that follows is in reference to linking the mothers' birth and hospital records, and the infants' birth and hospital records.
The CHeReL used a variety of fields that are common to both datasets for matching records in the linkage process. These include first name, last name, address, sex, date of birth, and country of birth. Additional information used, where available, includes hospital code and medical record number (MRN), admission date, discharge date, hospital discharged from, hospital discharged to, alias names, plurality and birth order for multiple pregnancies (twins, triplets and higher order multiple pregnancies).
Standardization and parsing techniques are used to allow a comparison of common fields and to facilitate matching. As a first stage, blocking is used to quickly search the target database for records that are possible matches. 'Blocking' is an automated algorithm designed to find as many as possible records that potentially match each other without exceeding a given and manageable block size. This increases the efficiency of a second stage of more detailed matching by reducing the number of pairs that are compared in the more accurate second stage matching. Records within the same block are scored during the second stage of matching. 'Scoring' generates the probability that two records match based on a series of weighted 'clues'. Clues (known as 'features' in Artificial Intelligence literature) are attributes of records that are suggestive of match or non-match decisions. Examples of clues are that the date of birth does not match, or there is a match on the phonetic code for the first name. Phonetic code is generated from coding schemes such as Soundex and the New York State Identification and Intelligence System (NYSIIS). This reduces the effect of minor typographical errors or spelling variations by assigning the same codes to words or syllables with similar pronunciation i.e. Robert and Rupert. The weight for each clue has been derived using previously matched data and a machine learning process called Maximum Entropy Modeling. During the scoring process these weights are combined using a formula based on maximum entropy theory to create a probability between 0 and 1 that two records match. Upper and lower probability cut-offs (thresholds) determine whether records are classified as matches, non-matches, or possible matches requiring clerical review (Figure 1). The CHeReL initially uses upper and lower probability cut-offs of 0.75 and 0.25 and adjusts these manually for each individual linkage to minimize false and missed links. Groups of records with indeterminate probabilities are reviewed manually to determine whether they should be classified as a match or not.
The CHeReL undertakes quality assurance for any data linkage and assesses the linkage quality by manually reviewing personal identifiers for a sample of the records obtained for linkage. For this project, the CHeReL reported the linkage quality as < 1/1,000 missed links and < 2/1,000 false positive links.

Linkage groups
For this study we defined six different groups of records based on the linkage configuration. The 'linked mothers and infants' group includes birth records with a linked hospital admission for both the mother and the infant(s), representing the 'complete' set of perinatal records. The 'mothers only' group includes birth records with a linked hospital birth admission record for the mother but without one for the infant, while the 'infants only' group includes birth records with a linked hospital birth admission record for the infant but without one for the mother. These two groups represent the 'partial linkage' groups. Finally, there are three different groups of unlinked records. The first is 'unlinked birth records' which includes birth records without a linked birth admission record for either the mother or the infant. The second is the 'unlinked maternal hospital records' which includes hospital birth admission records identified for a pregnancy that did not link to the birth records. The third is the 'unlinked infant hospital records' which includes hospital birth admission records identified for infants that did not link to the birth record.

Stillbirths and plurality
Stillbirths are reported on the mother's hospital birth admission record and do not usually generate an infant hospital admission record for the infant. Therefore most will not have complete linked mother and infants records. Further, there may be misclassification of stillbirths and miscarriages and it has been indicated previously that linkage for stillbirths is problematic [30]. Linking is conducted separately for singleton and multiple pregnancies as multiple pregnancies generate infant records with identical information such as mothers name, date of birth, hospital of birth and even sex, so extra care is required [31,32].

Identification of hospital birth admission records
ICD10-AM [33] diagnosis and ACHI procedure codes, and administrative information, were used to identify hospital birth admission records for mothers and infants independently of the birth record.
Infant birth admissions were initially selected where records indicated an age of 0-1 days and either a live birth (ICD10-AM = Z38), born in hospital, or a birth weight and an ICD10-AM code for a condition of the perinatal period. For those records that linked to the birth record, we required the admission date to be within ±1 day of the date of birth and the hospital of birth reported on the hospital record to match that reported on the birth record (Table 1).
Maternal hospital records for the birth admission were initially selected where there were any ICD10-AM diagnosis or procedure codes reported for delivery. We also required the same hospital of birth to be reported by the hospital and birth record, and the date of birth to have occurred during the period between the admission and separation dates for the selected birth admission record ( Table 2).

Variables
Maternal variables compared between linkage groups were gestation that antenatal care commenced, marital status, country of birth (Australia/other), birth in a private hospital, delivery by caesarean section, diabetes, hypertension, induction of labor, maternal age, parity (number of previous births), smoking during pregnancy, placenta praevia, placental abruption, duration of pregnancy less than 26 weeks gestation and socio-economic status (Australian Bureau of Statistics Socio-Economic Index For Areas -Index of Relative Socio-economic Disadvantage) [34]. Infant variables compared across linkage groups were admission to a special care nursery (SCN) or neonatal intensive care unit (NICU), Apgar score at one minute less than 4, sex, birth weight, death in hospital, and gestational age. All variables, except for marital status, placental abruption and placenta praevia were available from the birth record, and where possible  obtained from the hospital birth admission records using diagnosis and procedure codes (Table 3).

Analysis
Reported for all births are (i) rates of linkage for the birth-hospital record linkage groups by plurality and live born/stillborn as a percentage of all birth records and (ii) rates of identification for deliveries and births as ascertained from the hospital birth admissions as a percentage of the number of deliveries/births reported in the birth records. Note that delivery is used to refer to a mother giving birth, and birth to refer to a baby being born. Thereafter, we limited the analysis to live born singleton deliveries/births. Descriptive statistics of both maternal and infant characteristics by linkage groups were reported using either information from the birth or hospital birth record. For those variables reported on both, information from the birth record was used unless the hospital birth admission record was indicated as being more reliable according to validation studies of birth and hospital data [35][36][37]. Descriptive analysis was performed in SAS 9.2 [38]. Ethical approval was obtained from the NSW Population and Health Services Research Ethics Committee.

Linkage rates for all births
In the period January 2001 to December 2008, there were 706,685 deliveries resulting in 713,522 live births and 4,460 stillbirths recorded in the birth records (PDC). The rate of complete linkage (birth record linked to both mother and infant hospital birth admission records) dropped from around 96% at 37 weeks gestation to <90% at 30 and <70% at 25 ( Figure 2). For birth  Figure 2 Linkage rate for complete group by gestational age (weeks). Complete linkage rate (number of birth records linked to both a mother and infant hospital admission birth record as a percentage of all birth records) by gestational age for all births (blue line) and liveborn singletons (dotted black line).
weight, complete linkage was around 95% for weights above 2500 grams, but below this dropped to < 80% by 1000 grams (Figure 3). Probabilistic linkage resulted in 688,802 birth records with complete linkage to both mother and infant hospital admission birth records (95.9%) ( Table 4). Partial linkage was available for a further 3.6% of birth records, including 2.2% with birth record to the mother's hospital record ('mothers only') and 1.4% with birth record to the infant's hospital record ('infants only'). Less than one per cent (0.5%) of birth records did not link to any hospital record (Table 4).
From the hospital records, 713,190 infant birth records were identified, almost the same number of live born birth admissions as reported in the birth records (N = 713,522), > 99.9%. From the hospital records, 704,009 delivery records (mothers) were identified, representing 99.6% of those reported in the birth records (N = 706,906).
For the largest group of birth records, live born singletons, 96.5% of records had complete linkage to both a mother and an infant birth admission record compared to 96.0% of live born multiple births. For stillbirths, the largest linkage group was the 'mothers only' at around 94% for both singletons and multiple births. Unlinked birth records were more common for stillbirths (3-4%) than live births (0.3-0.4%).
Given the incomplete linkage of stillbirths (recorded as a maternal outcome) and the difficulty of presenting results for multiple births (requiring duplication of maternal information), comparisons of maternal and infant linkage groups are presented for singleton live births. Coding of stillbirth/live birth and plurality could not be identified for 1,505 of the 704,009 deliveries identified in the hospital records (0.2%) and pregnancies with duration <26 weeks were over-represented in this group   could not be classified and preterm birth was overrepresented in this group (6.4%).

Singleton live births
Among singleton live births the rate of complete linkage dropped from around 96% at 25 weeks gestation to only 72% at 20 weeks gestation (Figure 2). For birth weight, complete linkage was around 96% for weights above 1000 grams, but below this dropped to around 80% by 400 grams (Figure 3). Maternal characteristics differed across the groups of linked and unlinked records ( Table 5). The two groups that appeared most different were the unlinked birth records and the mothers only group. The unlinked birth records had higher proportions of nulliparous, Australianborn women, aged 35 and over, births in private hospitals, by caesarean section and the lowest levels of social disadvantage (quintile 1). Missing health information was more common in the unlinked groups.
The 'mothers only' group (no associated infant hospital record), had higher levels of social disadvantage (quintile 5), women aged less than 25, non-Australian born mothers, births by unmarried women, smoking during pregnancy, commencement of antenatal care after 14 weeks gestation, caesarean section, placental abruption, and duration of pregnancy less than 26 weeks.
Infant characteristics also varied across linkage groups ( Table 6). The 'mothers only' group appeared most different with higher proportions of admission to a SCN or NICU, Apgar score at 1 minute less than 4, birth weight less than 1000 grams, birth less than 37 weeks gestation, and infant deaths in hospital.

Discussion
To our knowledge, this is the first study that has assessed the linkage of mother and infant birth and hospital records rather than mothers and infants separately. As maternal and pregnancy factors are important predictors of infant outcomes, assessment of the complete linkage is important. In this study the level of complete linkage (95.9%) was high for all births and highest for live singleton births (96.5%). Partially linked mother records (no infant hospital record) had slightly higher rates of adverse events and common risk factors while the partially linked infant records (no mother hospital record) were very similar to those with complete linkage. This study has shown that stratifying linkage by plurality to overcome the recognized difficulty of linking multiple births [31,32] has generated comparable linkage rates for singleton and multiple live births. Stillbirths represent a very different group in terms of linkage. As infant hospital admission records are not generated, stillbirths should not be present in the complete linkage group. While this explains the majority of stillbirth records being in the 'mothers only' group, the proportion of unlinked birth records for stillbirths was also much greater than that for live births (4% vs. 0.4%), reflecting that stillbirths remain a problem for linkage. The lower rate of linkage for stillbirths and the issue of lower rates of complete linkage for live born singletons ≤24 weeks gestation are probably related. Infants born close to the border of viability (misclassification of stillbirths and live births, and births and miscarriages) have been previously identified as a problematic domain for perinatal record linkage [30]. For these reasons, unless infants ≤24 weeks are of particular interest, studies using probabilistically linked records may benefit from restriction to the population of at least 24 weeks gestation. For stillbirth studies, specialist linkages may be needed to improve linkage rates to the levels needed for robust research.
Among singleton live births, the proportions of birth records with partial (1.4-1.6%) or no linkage (0.4%) to hospital records was small. However, there was some evidence of systematic differences for the partially linked records that had no infant hospitalization record ('mothers only'). This group has slightly higher rates of adverse infant outcomes and associated risk factors, consistent with observations in other studies [10,[39][40][41]. Reduced matching of infant records may be related to the association between missing information, social disadvantage and adverse outcomes, or that severely ill infants with prolonged hospitalization may not necessarily be coded as a birth admission. Restriction to later gestational ages would further reduce the already small size of this group of records. It is important to quantify the number and characteristics of unlinked or partially linked records to assess the potential for bias in estimation of the burden of disease and association between risk factors and outcomes. In our study inclusion of additional records would not change, for example, the estimated preterm birth rate nor is it likely to change risk estimates. However, in other settings with higher proportions of unlinked or partially linked records, exclusion of such records could introduce bias.
Our finding that the unlinked birth records represent a relatively low risk group of mothers and babies is likely to be a local phenomenon. The over-representation of births in private hospitals in the unlinked birth records is likely a result of missing name information. It is at the discretion of private hospitals as to whether name information is collected, and so generally have a large amount of missing name information for both mothers and infants, thus affecting linkage rates for both mothers and infants. Changes to the data provided from private hospitals for linkage could potentially reduce the size of the unlinked birth records.
The results highlight the importance of comparing the characteristics of probabilistic record linkage for perinatal research for mothers and infants, given the potential bias introduced into analysis by incomplete record linkage. It is recommended that for the chosen study population, linked and unlinked records should be requested for analysis and a comparison of linked and unlinked records be undertaken as part of any research using probabilistically linked data. This is of even greater importance when newly-established datasets and linkages are used, which is in contrast to the well-established datasets and linkage protocols used by the CHeReL which generated the linked data for this study. Further, in order to properly discuss the potential impacts, it is necessary for researchers to have a reasonable understanding of how the probabilistic linkage process works and the matching processes involved. The hospital birth admission records for mothers and infants that did not link to a birth record were small in number and of comparable size to the number of unlinked birth records, and inevitably include some missed links. However, particularly for mothers, there is difficulty in establishing birth admission records as more than one hospitalization may be identified as a birth admission. Although used in the past [42,43], we found that selecting maternal hospital records on a single outcome of delivery code (ICD10: Z37, ICD9: V27) to be inadequate and a much more comprehensive list was required ( Table 2). This agrees with a US study that showed that identifying maternal hospital records using outcome of delivery missed complicated pregnancies [44]. Furthermore, due to the nature of ICD coding there was difficulty in classifying the plurality and whether the birth(s) were live born or stillborn. In general a good understanding of coding practices can help to improve identification of these records.

Conclusions
Probabilistic methods can achieve high, representative levels of complete linkage for mothers and infants. Although some systematic differences occur for the mothers records that do not link to a corresponding infant record, and to a lesser degree for unlinked birth records with respect to private hospitals, these groups are very small and unlikely to bias estimates of effect or conclusions in a substantive way, particularly if the study population is live born singletons.