Investigating increased admissions to neonatal intensive care in England between 1995 and 2006: data linkage study using Hospital Episode Statistics

Background A 44 % increase was observed in admissions to neonatal intensive care of babies born ≤26 weeks completed gestational age in England between 1995 and 2006. Hospital Episode Statistics (HES) may provide supplementary information to investigate this. The methods and results of a probabilistic data linkage exercise are reported. Methods Two data sets were linked for each year (1995 and 2006) using 3 different algorithms (Fellegi and Sunter, Contiero and estimation-maximisation). Results In 1995, linkage was performed between 668 EPICure and 486,705 HES records; 1,820 linked pairs were identified of which 422 (63.17 %) were confirmed. In 2006, from 2,750 EPICure and 631,401 HES records, 8,913 linked pairs were identified with 1,662 (60.40 %) confirmed as true. Reported births in HES at <26 weeks gestation increased 37.0 % from 867 to 1188. Conclusions Results support the EPICure findings that there was an increase in the birth rate for extremely premature babies between 1995 and 2006. There were insufficient data available for detailed investigation. Routine data sources may not be suitable for investigations at the margins of viability. Electronic supplementary material The online version of this article (doi:10.1186/s12874-016-0152-0) contains supplementary material, which is available to authorized users.


Fellegi & Sunter analysis
Matching was performed for both study epochs in the same way. Each of the three matching algorithms available in the "RecordLinkage" package were used. [4] The most straight forward of these calculates weights (w) stochasticly, based on Fellegi and Sunter's work, whereby both the M probability (i.e. that both records of a pair are from the same subject) and U probabilities (where records in a pair belong to different subjects) are specified in advance. [4] The calculations are performed as follows: Values chosen for M and U probabilities may have an important impact on the results thus should be chosen carefully. Dattani et al [1] provide some data on which the 2006 estimates of these values may be based. However, as not all of the variables to be used for matching had prior estimates, it was decided to perform one round of matching using best-guess values, and a second round of matching using the Dattani et al estimates. The best guess values were derived using the following rules: M -probability based on the estimated accuracy of record completion. U -probability based on chance agreement: the likelihood that two subjects would match if the subjects were chosen randomly. For the M -probabiliities, date of birth, mother's age at delivery, baby sex and number of babies were considered to have a high probability (≥ 90%) of having been entered correctly; for other variables, the estimated probabilities varied as low as 20%. Best guess U -probabilities for date of birth and death were set at 1/365 = 0.00274, and for discharge date, 1/500, as HES is likely to be discrepant from EPICure data in this respect; for birth order, number of babies and number of previous pregnancies at 90% as pregnancies of lower birth are more common, as are lower parity women; and sex at 0.49 so as to account for those of indeterminate sex. Gestational age at birth and maternal age were based on approximate number of categories with a slight adjustment for unequal distributions. Birth weight was assigned a U -probability of 1/1000, i.e. 0.001. The full set of values, along with corresponding weights, are shown in table 1. c Date of death and delivery method were both modified using an adjusted best guess for the second linkage analysis performed using estimates from Dattani et al.

Probability estimates for linkage analyses between Hospital Episode Statistics and EPICure data based on best guesses and prior knowledge (adapted from data linkage performed by Dattani et al between Hospital Episode Statistics (HES) and NHS Numbers 4 Babies data sets).[1]
In the comparison round of matching, using the Dattani estimates, data were available for date of birth, postcode, number of babies in the pregnancy, sex, birth weight, gestational age and ethnicity; of these, absolute numbers were provided for number of concordant and discordant pairs for number of births per pregnancy and sex, and percentages of concordant pairs for the remaining variables. It was therefore possible to calculate probabilities for these variables using equations 2 and 3 (C = concordance rate, D = discordance rate, and P nm = percentage not missing): Where no prior information was available from the Dattani et al estimates for variables to be used in the matching, the best guess values were used in supplement.

Contiero analysis
The second method of matching uses the algorithm designed by Contiero, on which the EpiLink software is based. [4] For this method, the overall weight (w o ) for each subject-pair can be calculated as: where s i is the value of the comparison between the i th records from each of the data sets x and y, and w i is the weight attached to that particular (variable) comparison. Weights are assigned in the range 0 ≤ w ≤ 1. [4,6] Both error rates and frequencies used to derive the variable weights were explicitly set according to the default values for the overall data sets.

Estimation-maximisation analysis
The final method of matching uses an automated method to assign weights based on maximum likelihood, and is known as the estimation-maximisation algorithm.
[4] This did not require any parameters other than the names of the data sets to be passed to it. to save all this information as the vast majority were false matches. Therefore, each linkage method required a preliminary review of the calculated weights in order to select appropriate cut-offs above which to retain linked or potentially linked data pairs (one each from the HES and EPICure data sets). Cut-off points were selected according to where a "reasonable" number of linked pairs was obtained.