A 44 % increase was observed in admissions to neonatal intensive care of babies born ≤26 weeks completed gestational age in England between 1995 and 2006. Hospital Episode Statistics (HES) may provide supplementary information to investigate this. The methods and results of a probabilistic data linkage exercise are reported.
Two data sets were linked for each year (1995 and 2006) using 3 different algorithms (Fellegi and Sunter, Contiero and estimation-maximisation).
In 1995, linkage was performed between 668 EPICure and 486,705 HES records; 1,820 linked pairs were identified of which 422 (63.17 %) were confirmed. In 2006, from 2,750 EPICure and 631,401 HES records, 8,913 linked pairs were identified with 1,662 (60.40 %) confirmed as true. Reported births in HES at <26 weeks gestation increased 37.0 % from 867 to 1188.
Results support the EPICure findings that there was an increase in the birth rate for extremely premature babies between 1995 and 2006. There were insufficient data available for detailed investigation. Routine data sources may not be suitable for investigations at the margins of viability.
Survival of extremely premature babies has improved in recent decades , particularly following the introduction of antenatal steroids and postnatal surfactant [2–4]. Previous work suggests that routine data sources collected in England such as Hospital Episode Statistics (HES) or the Births and Deaths Registry provide insufficient data to permit detailed investigation [5, 6].
Between March and December 1995, the EPICure study collected information on all babies born in Great Britain and the Republic of Ireland at less than 26 weeks completed gestation who were admitted into neonatal intensive care . The EPICure 2 study collected data on all births occurring below 27 weeks gestation in England during the whole of 2006 . Comparison of the two data sets demonstrated a disproportionate 44.0 % increase in the number of babies admitted into intensive care at less than 26 weeks , whereas the birth rate overall increased by only 0.7 %, from 613,257 to 635,748 per year .
It is unclear whether the increase seen represents a true rise in the numbers of live babies being born extremely prematurely or if, instead, it reflects changes in management in the delivery room. To investigate this question, we required specific extra demographic data for 1995 to explain the rise in admission rates. We therefore sought to supplement the EPICure data sets with additional information from Hospital Episode Statistics by performing probabilistic data linkage between each of the two data sets (EPICure and HES) available for 1995 and 2006 as there were insufficient patient identifying variables to permit deterministic linkage.
Available data sets
In 1995, EPICure collected a brief delivery room log comprising date of birth, gestational age, birth weight and infant sex; a more complete data set was only collected for babies admitted into neonatal intensive care. For the EPICure 2 study, delivery data were collected using an expanded form which comprised extensive delivery and resuscitation data with the help of the Confidential Enquiry into Maternal and Child Health (CEMACH). HES data for 1995 and 2006 were obtained from the NHS Health and Social Care Information Centre (HSCIC); the full list of the variables requested and the available data that were returned is shown in Table 1 of Additional file 1.
Each data set was cleaned, then restricted to variables required for matching. A detailed explanation of how the HES and EPICure data sets potentially match is included in Additional file 2. All analyses were conducted using R .
Choice of variables
Variables chosen for inclusion in the matching exercise were: baby’s date of birth, sex, gestational age and weight at birth, birth order, total number of babies in the pregnancy, the mother’s number of previous pregnancies, discharge date, maternal age at delivery, date of death, ethnicity and postcode. Maternal age at delivery was included in preference to maternal date of birth to minimise errors from data entry; date of death was derived for HES using “date of discharge” and “discharge method”. “Ethnic category” was recoded to match the EPICure categorisation and was included even though supplementary information on ethnicity was one of the desired results. Derived variables and ethnicity were included in the matching for 1995 to improve subject discrimination as postcode was unavailable.
Linkage was performed for both study epochs in the same way. Each of three algorithms available in the “RecordLinkage” package  of R  were used. These are based on the estimation-maximisation algorithm  and on the methods of Fellegi & Sunter (stochastic linkage) , and Contiero (EpiLink algorithm) . For the Fellegi & Sunter analysis, weights (w) are calculated stochasticly, based on M (i.e. that both records of a pair are from the same subject) and U (where records in a pair belong to different subjects) probabilities . We performed one round of matching using “best guess” values, and a second round using estimates from Dattani et al. , supplemented where no prior information was available with the “best guess” estimates. Values are shown in Table 1. A more detailed explanation of the linkage methods is provided in Additional file 3.
Direct comparison (i.e. using different parameters) was made between the two versions of the stochastic linkage using the “best guess” and Dattani probabilities.
Data were tabulated to identify appropriate cut off points for clerical review. Initial thresholds attempted only to obtain a “reasonable” number of matches for review; subsequent revision was not possible due to time limitations.
Following linkage, a master data set was created for each epoch by combining data for retained ID pairs. Rows corresponding to duplicate entries of a single EPICure ID were manually reviewed. For true matches, that specific row and all other potential matches with those IDs were removed from further consideration. The review process was repeated using both the EPICure and HES subjects as the base for comparison until no further true matches could be identified.
For each method of linkage, we assessed matching accuracy by merging the true matches with the saved unique pairs. This permitted the number of “true matches” to be identified and enabled sensitivity, specificity, positive predictive value and negative predictive value to be calculated (see Fig. 1).
For the 1995 EPICure cohort, data from 668 babies who were admitted into neonatal intensive care were available. The EPICure 2 data set contained 4,144 rows which, after removal of data not collected in HES (terminations of pregnancy: 768; still births: 626), resulted in 2,750 individual subject records being available.
Hospital Episode Statistics (HES) data were supplied by the NHS HSCIC for each year of analysis. There were 575,509 records for 1995 and 631,499 for 2006. To match the time period of the EPICure study, births occurring in January or February were excluded from the 1995 data; 8,807 records with a missing date of birth were retained, meaning 486,705 records were used for linkage. There were no duplicates in 1995; in 2006, 98 duplicate rows were removed, leaving 631,401 records for analysis.
Postcode and, consequently, Socioeconomic data were completely absent from HES data for 1995, and fewer than 20.0 % of the subjects had information on ethnicity. For 2006, socioeconomic information was available for over 50.0 % of subjects, and ethnicity unavailable for only 157,781 (25.0 %) records. Levels of missingness for matching variables are shown in Table 2: HES data were less complete for each time period. Expanded distributions for gestational age and birth month are shown in Additional file 4: Table S1.
In 1995, 2,184 (82.9 %) of 2,634 subjects in HES with a recorded birth weight of less than 500 grams were described as having a gestational age of 35-45 weeks (Table 3). For those recorded as being of a low gestational age, birth weight was missing in 14.3 %, 11.4 %, and 6.2-7.5 % at 20, 21 and 22 to 23 weeks gestational age respectively, and in only 0.2 % of those born between 35 and 39 weeks. In 2006, problems were even greater with the entire data set (Additional file 4: Table S2). No issues were identified with either of the EPICure data sets.
Stochastic analysis - baseline estimated values
In 1995, the maximum weight of a linked pair was 42.3. 2,093 unique pairs were identified above a threshold of 15, representing 537 EPICure IDs and 1,846 HES IDs. There was a marked drop above 17, to 792 unique pairs (365 unique EPICure IDs and 692 unique HES IDs, Table 4). This is seen in the density graph of weights (coded “fs.D” in Fig. 2a), and in the number of unique records linked from each data set (Fig. 3a). Above 30, the number of linked pairs equalled the number of IDs from each data set – i.e. there were 86 uniquely matched pairs.
The maximum weight in 2006 was 54.51; a cut-off of 10 was chosen. Graphs for 2006 are shown in Figs. 2b (density graph) and 3b (unique IDs). Above the cut-off value, there were 44,719 unique record pairs identified, representing 2,729 unique EPICure 2 IDs and 36,025 HES IDs. A large decrease was seen above a cut-off of 12, to 2,459 pairs overall with 1,569 and 1,811 unique EPICure 2 and HES IDs, respectively (Additional file 4: Table S3).
Stochastic analysis - Dattani estimates
With the Dattani et al.  probabilities, the maximum weight in 1995 was 65.7 and, in 2006, 71.57; thresholds of 35 and 15 were chosen, respectively. In 2006, there were 53,413 potential links with the number of HES IDs dropping from 32,051 to 3,129 between weights of 18 and 19. In 1995, there was a relatively constant decrease in the number of EPICure IDs, whereas the number of potentially linked HES IDs dropped from 16,385 to 3,540 at a weight of 36. Full details are presented in Additional file 4: Table S4 (for 1995) and Additional file 4: Table S5 (for 2006), with graphs shown in Additional file 5: Figure S1 (density graphs) and Additional file 5: Figure S2 (unique IDs linked).
For both years, a cut-off value of 0.35 was chosen. 45,349 pairs were retained in 1995 compared to 6,323 in 2006. There was a much better spread of weights in 2006 (see density graphs, Additional file 5: Figure S3), reflected by the maximum weight obtained in each of the analyses: 0.9494 in 2006 but only 0.8678 in 1995.
Convergence in the numbers of matched IDs occurred around a weight of 0.45 in both epochs. Unique matches were only identified in 1995 above a threshold of 0.75 (20 pairs – Table 5), with none identified in 2006 (Additional file 4: Table S6). Graphs of unique IDs from each data set are shown in Additional file 5: Figure S4.
Estimation-maximisation likelihood algorithm
For the estimation-maximisation algorithm, the maximum weight in 1995 was 65.7, and 71.57 in 2006 with a threshold of 10 used for both analyses. There was a steadier attenuation in the number of linked pairs in 1995 than 2006 (Additional file 5: Figure S5a and S5b show the unique IDs; density graphs are shown in Additional file 5: Figure S6). In 1995, only above a weight of 43 were pairs uniquely matched (Additional file 4: Table S7), and in 2006, only two unique pairs were identified – above a weight of 70 (Additional file 4: Table S8).
Manual review of linked pairs
1,820 linked pairs from the different analyses in 1995, and 8,913 in 2006, were concatenated together to create data sets of unique pairs – 1,070 in 1995 and 4,378 in 2006. 1995 data were manually reviewed four times, confirming 422 matches between the EPICure and HES data (63.2 % of the 668 maximum potential matches). For 2006, three rounds of manual review were performed, reducing the data set to 1670 rows which included 1,666 unique EPICure 2 and 1,670 unique HES IDs. Insufficient data were available to discriminate among the four remaining EPICure 2 IDs, each of which were paired with two HES IDs. Discarding these unconfirmed links meant that overall there were a total of 1,662 confirmed of a maximum 2,750 possible matches – 60.4 %.
Assessment of error
We calculated sensitivity, specificity, positive and negative predictive values. In all analyses, specificity and negative predictive value were 1.0. The Fellegi-Sunter analysis using baseline best guesses provided the most accurate results in both epochs, correctly identifying 402 pairs in 1995, and 1740 in 2006. It also had the highest sensitivity in each time period – although it only identified 63.3 % of subjects in 2006, and 60.2 % in 1995. Results are presented in Tables 6 and 7 for 1995 and 2006, respectively.
Saved HES data
During the 10 months of the EPICure study in 1995, from 1st March to 31st December, there were 867 births recorded in HES with a gestational age of 25 weeks or lower. These were merged with the 422 “true” matches identified in the probabilistic linkage; there were 300 matches, leaving 567 subjects for whom no further investigation was possible. In 2006, there were 2,535 HES records identified of births at less than 27 completed weeks gestational age. These were combined with the 1,662 records from the probabilistic matching: there were 932 matching rows, leaving 1603 for whom further review was not possible.
Answering the original question
In advance of record linkage, it was realised the original question could not be answered due to lack of additional data. However, a cautious investigation of demographic shifts between 1995 and 2006 using the HES data alone was possible. Identical populations to EPICure could not be identified: HES records data from live births of less than 24 weeks and for all births at 24 weeks gestation and above, and does not distinguish live births who died in the delivery room from those who were admitted into neonatal intensive care but died on the same calendar date.
Table 8 shows how the data changed between the two study epochs. There were 867 births reported in HES in 1995 of <26 completed weeks gestational age; 213 were still births. Examining this in relation to the corresponding 2006 data (i.e. also of less than 26 completed weeks gestational age and from a similar time period) shows a 37.0 % increase in reported births. For reported live births, there was a 42.8 % increase. The table also contains data about three other populations from HES:
The “true” population: this contains data for HES subjects identified by the linkage exercise following clerical review.
The “confirmed” population: represents those reported in HES as below 26 (and, for 2006, also below 27) weeks who were confirmed by the linkage exercise.
The final group contains those from the reported group who were not identified during linkage.
We were unable to confirm the hypothesis that HES data are a suitable data source with which to investigate the apparent 44 % increase neonatal admissions between 22 and 25 completed weeks gestational age that was seen between 1995 and 2006. Overall, only approximately 60 % of available EPICure records were successfully linked in each study epoch using a combination of probabilistic methods. Of three linkage methods utilised, the Fellegi and Sunter technique using “best-guess” estimates of matching probabilities was the most successful in 1995, with no clear “best technique” in 2006.
Examination of the HES data demonstrated an increase of 37.0 % in the number of reported births between 1995 and 2006, and 42.8 % in live births in a population similar to that of the EPICure studies (less than 26 weeks gestational age and born between March 1st and December 31st). This suggests the 44.0 % increase in admissions to neonatal intensive care seen in the EPICure data might be real. However, there were insufficient other data (ethnicity, socioeconomic status) to permit detailed investigation.
Hospital Episode Statistics is a routine data set collected since 1989 from secondary care sources with primarily non-clinical motives . Birth data in HES are incomplete. Births in non-NHS locations (private hospitals or birthing centres, or at residential locations) may not be collected, and there is marked variation in reporting by different health care providers [5,15,17]. Data may be entered by midwives immediately after delivery via point-of-care systems or separately by clinical coders; reporting practices have changed over time . In contrast, the EPICure data were specific cohort studies run in collaboration with national confidential enquiries (Confidential Enquiry into Stillbirths and Deaths in Infancy (CESDI) and CEMACH) [7,8]. Data were only collected about specific births by those directly involved in care under the responsibility of a delegated EPICure contact (usually a doctor) at each perinatal centre in England [7,8].
These differences were apparent. The EPICure data are more likely to be accurate with respect to gestational age as these were rechecked against source data and recalculated if necessary [7,8]. In the HES data, there were inconsistencies between gestational age and birth weight category as well as deficiencies in data quantity. High levels of missing data were seen in variables used for linkage; many others contained a complete absence of data. This severely limited the capacity for accurate data linkage and prevented further meaningful investigations.
HES data problems may have biased the results. Population coverage may have led to selection bias if data were less well reported in some regions or for some hospitals than in the EPICure studies, thus resulting in matches not being identified when they could have been. Information bias is likely as a consequence of HES data consistency issues. Similar work linking HES with maternity data for England and Wales has shown low rates of discordance between sources [15,18–22]; however, data quality issues are more likely to be an issue for those born in unusual circumstances like those who are extremely premature. Such errors are likely to apply across the gestational age ranges included in this study, thus causing non-differential misclassification and biasing linkage towards non-identification of true matches.
Confounding cannot be excluded, but is unlikely. It might occur if birth weight were closely correlated with gestational age – but this was not the case. No other matching variables in this study would be expected to show a strong correlation with extreme prematurity and successful linkage.
Random error may also affect analyses, but given that the purpose of probabilistic linkage is only to assign a weight, it is unlikely to be of great importance. This is because manual intervention is required, if not for review purposes, at least for selection of a threshold. This consequently provides a counterbalance to random error: an acceptable level of error is determined by the number of records to review.
A different problem arises from combining the results from the different analyses prior to manual review. This shortened analysis time, but introduced contamination between the different linkage methods. This is because remaining pairs with HES or EPICure IDs corresponding to those in an identified “true” match were removed, meaning identification of a match from one analysis potentially influenced the choice of match arising from another.
Fellegi and Sunter analysis
For the best guess analyses, M and U probabilities but not the resultant weights were considered in advance. However, where the M and U probabilities were identical, weights for matched and non-matched pairs equalled zero, meaning no distinction was made between matched and non-matched pairs – and thus that the variable was not considered during matching. The impact of this was minimised by inclusion of sufficient other variables in each matching exercise. However, it may have been possible to increase discrimination between linked pairs.
The second set of analyses used estimates obtained with data from a previous matching exercise , and better utilised the matching variables. The weakness here was that probability estimates were based on 2006 data,  which may not have been appropriate for the earlier time period. However, it was fortunate that there were estimates, and only through linkage can the veracity of probability estimates be confirmed.
The EPICure data set for each epoch was examined in relation to a single day’s worth of HES data at a time. This resulted in different starting points each day for the estimation-maximisation algorithm, potentially causing errors to be introduced. It is unknown what effect this differential misclassification may have had, but it is likely this produced an underestimate (i.e. a nullification of effect), as several dates were noted when no convergence of the algorithm was achieved – indicating weights could not be calculated and hence resulting in no matches.
EpiLink (Contiero) approach
There was potential for a similar error in the EpiLink analyses  because the Contiero algorithm bases estimates of weights on the frequency of responses and estimated error rates [11,14]. This was avoided by specifying these factors in advance.
Although it was not possible to investigate changes in socioeconomic factors or ethnicity over time using the HES data, it is interesting to note similar increases of around 40 % of all births and of live births in those reported in HES to the 44 % rise in admissions seen in EPICure. This provides some confidence that the EPICure findings are true.
Previous studies have focused on linkage for the entire gestational age range, thus errors at extremes are dissipated. This study demonstrated important data quality concerns in a specific sub-population.
The findings in relation to the primary objective – to confirm whether there was an increase in births between 1995 and 2006 – are important as they suggest that extremely premature birth is becoming more frequent and build on the observations of the EPICure study .
In conclusion, this study found that HES data are a poor source for information about those born extremely prematurely, with no improvements in data quality seen between 1995 and 2006. However, increases in the absolute numbers of babies born at extremely premature gestations were seen that were in the same direction and of a similar size to those seen in the EPICure studies.
Ethics approval and consent to participate
The EPICure 2 study was approved by the City and East London REC (05/Q0605/107), with additional approval obtained from the Patient Information Advisory Group (PIAG 3-07(f)/2005). Permission for this study was granted by the Ethics and Confidentiality Committee of the National Information Governance Board (NIBG) for Health and Social Care (ECC 1-02(FT3)/2012). Due to delays in obtaining data and during the analysis, a 6-month extension was granted in March 2013 to permit study resolution.
Consent for publication
Availability of data and materials
The EPICure studies are subject to a data sharing policy that may be downloaded from http://www.epicure.ac.uk. Hospital Episode Statistics are all rights reserved, copyright 2012, and re-used with the permission of The Health and Social Care Information Centre. Statistical code is available from the corresponding author.
Costeloe KL, Hennessy EM, Haider S, Stacey F, Marlow N, Draper ES. Short term outcomes after extreme preterm birth in England: comparison of two birth cohorts in 1995 and 2006 (the EPICure studies). BMJ (Clin Res ed). 2012; 345:7976.
Winkler WE. Using the estimation-maximisation algorithm for weight computation in the Fellegi-Sunter model of record linkage In: American Statistical Association, editor. Proceedings of the Section on Survey Research Methods. USA: American Statistical Association: 2000. p. 667–71, doi:10.1.1.17.175.
Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc. 1969; 64(328):1183. [doi:10.2307/2286061].
Contiero P, Tittarelli A, Tagliabue G, Maghini A, Fabiano S, Crosignani P, Tessandori R. The EpiLink record linkage software: presentation and results of linkage test on cancer registry files. Methods Inf Med. 2005; 44(1):66–71. [doi:10.1267/METH05010066].
Oakley L, Maconochie N, Doyle P, Dattani N, Moser K. Multivariate analysis of infant death in England and Wales in 2005-06, with focus on socio-economic status and deprivation. Health Stat Q/Off Natl Stat. 2009; 42(Summer):22–39.
The authors wish to thank Dr. Katie Harron, (Institute of Child Health, UCL) for her guidance on data linkage, without whom this study would not have been possible. We also acknowledge the additional support of the NHS Health and Social Care Information Centre who act as data guardians for Hospital Episode Statistics and supplied us with the data.
Authors and Affiliations
Institute for Womens’ Health, UCL, 74 Huntley Street, London, UK
The authors declare that they have no competing interests.
ASM, NM and ESD designed the study. KC and NM were responsible for collection of the EPICure data sets; ASM, NM and ESD prepared the application and were guarantors for the HES data. ASM performed all statistical analysis, provided primary interpretation of the results and wrote the first draft of the paper. All authors were involved in revisions. All authors read and approved the final manuscript.
The EPICure studies are funded by the Medical Research Council (G0401525). Neil Marlow receives part funding from the Department of Health’s NIHR Biomedical Research Centre’s funding scheme at UCLH/UCL. The funders had no role in study design, data collection, data analysis, data interpretation, or writing of the report.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Morgan, A.S., Marlow, N., Costeloe, K. et al. Investigating increased admissions to neonatal intensive care in England between 1995 and 2006: data linkage study using Hospital Episode Statistics.
BMC Med Res Methodol16, 57 (2016). https://doi.org/10.1186/s12874-016-0152-0