Data
Hospital Episode Statistics (HES) is an administrative data source containing information on all admissions to NHS hospitals in England. Linkage of HES is coordinated through NHS Digital (previously known as the Health and Social Care Information Centre) [17]. The HES extract used for this study had previously been linked with a reference (gold-standard) dataset of records extracted from the Personal Demographic Service (PDS), which is also coordinated by NHS Digital (http://systems.digital.nhs.uk/demographics/pds). PDS contains the latest demographic details corresponding to a given NHS number. PDS also contains historical information such as previous addresses and is used for the NHS number tracing service (known as the Demographics Batch Service) and to provide identifiers for the NHS Patient Spine. Linkage with PDS reference data allowed us to quantify identifier errors. In this study, we define identifier error as discrepancies between PDS and HES, e.g. where identifiers had been recorded incorrectly, had legitimately changed over time (e.g. postcode) or were missing in HES.
For the purposes of this study, we defined our true (reference) match status by agreement or disagreement of NHS number between HES and PDS. We used a random sample of 10,000 record pairs from HES inpatient data linked with PDS, for the financial year 1st April 2011 to 31st March 2012, for each of three cohorts defined by date of birth: i) infants aged <1 year; ii) children aged 5–6 years; and iii) young adults aged 18–19 years. For each age cohort, the set of matches was created by identifying the PDS record associated with the NHS number on each HES record (n = 10,000 matches). The set of non-matches was created by identifying all PDS records with different NHS numbers to each HES record. This resulted in (10,000 × 10,000)-10,000 = 99,000,000 non-matches for each age cohort. However, the majority of these non-matches did not agree on any identifier, or only agreed on sex, and so were excluded from consideration. This resulted in around 30,000 non-matches for each age cohort.
The data used for this study comprised patterns of agreement/disagreement between date of birth, sex and postcode in HES-PDS linked pairs, but contained no actual identifiers. Agreement patterns were aggregated by age cohort, sex and ethnic group.
Identifier error rates
We estimated identifier error rates for sex, date of birth and postcode, based on the number of times these identifiers disagreed in matched HES-PDS records. We modelled the risk of identifier error using logistic regression with a set of attribute predictors recorded in HES (ethnicity, age and sex). We used a multi-level model with hospital as a random effect to explore organisational-level variation. Dependence between pairwise identifiers was also tested using multi-level logistic regression models using Stata [18].
Probabilistic match weights
-
1.
Traditional probabilistic match weights (assuming independence between identifiers)
We derived conditional probabilities for sex, date of birth and postcode based on the observed error rates for each identifier. Probabilities were derived from the number of times an identifier agreed or disagreed in pairs of matched HES-PDS records, e.g. for sex:
$$ \begin{array}{l} m- probability = {m}_{sex}= P\left( agree\ on\ sex\Big| M\right)\hfill \\ {} u- probability = {u}_{sex} = P\left( agree\ on\ sex\Big| U\right)\hfill \end{array} $$
where M represents a match and U represents a non-match. Missing values were treated as disagreement.
Match weights were then derived by summing the log-ratio of m- and u-probabilities over all k identifiers, i.e.
$$ \boldsymbol{W} = {\displaystyle \sum_k} l o{g}_2\left(\frac{m_k}{u_k}\right) = l o{g}_2\left(\frac{m_{sex}}{u_{sex}}\right)+ l o{g}_2\left(\frac{m_{dob}}{u_{dob}}\right)+ l o{g}_2\left(\frac{m_{postcode}}{u_{postcode}}\right) $$
-
2.
Match weights incorporating dependence between identifiers
Each HES-PDS record pair was associated with an agreement pattern φ representing agreement or disagreement on the joint set of three identifiers {sex, date of birth, postcode}. For binary agreement (agree = 1; disagree = 0), there are 23 = 8 possible agreement patterns for sex, date of birth and postcode: {1,1,1}, {1,1,0} … and {0,0,0} etc. Conditional probabilities were derived jointly over all identifiers for each observed agreement pattern, e.g. for agreement on sex, date of birth and disagreement on postcode, represented as {110}:
$$ \begin{array}{l} m- Probability={m}_{\varphi}= P\left( agree\ on\ sex\ and\ date\ of\ birth,\ disagreement\ on\ postcode\ \left| M\right.\right)= P\left(\varphi =\left\{110\right\}\ \left| M\right.\right)\hfill \\ {} u- probability={u}_{\varphi}= P\left( agree\; on\; sex\ and\ date\ of\ birth,\ disagreement\ on\ postcode\ \left| U\right.\right)= P\left(\varphi =\left\{110\right\}\ \left| U\right.\right)\hfill \end{array} $$
Match weights were then derived as:
$$ W={ \log}_2\left(\frac{m_{\varphi}}{u_{\varphi}}\right) $$
-
3.
Attribute-specific and organisational-specific match weights
We derived attribute-specific match weights using the procedures described above, but now for each combination of characteristics as recorded in PDS (age cohort, sex, ethnic group, N combinations = 36). This process is distinct from blocking, in that agreement on any of these attributes is not required for linkage (and attribute-specific weights can be calculated for variables not used within the linkage, e.g. ethnic group). Organisational-specific match weights were derived by calculating m- and u-probabilities separately for each hospital (N hospitals= 388). Attribute-specific and organisational-specific match weights were calculated in the traditional manner (i.e. assuming independence between identifiers), as it was not possible to stratify each agreement pattern by age, sex, ethnicity due to low numbers.
Simulation study
Aim
We performed a simulation study to determine the effect of the identifier-independence assumption and the value of incorporating attribute information into match weight calculation. Our scenario was linkage of hospital admissions records containing sex, date of birth, postcode, and NHS number. The aim was to estimate readmission rates by linking multiple hospital records for the same individual over time. Where there was a match between hospital records, this indicated that an individual had been admitted multiple times within the study period. Individuals with only a single hospital record and no matches were admitted only once during the study year.
Data generating mechanism
For each simulation, we created our ‘matches’ by randomly sampling agreement patterns (with replacement) from matched pairs in the HES-PDS extract, retaining distributions of age, sex and ethnicity from the original data. We created our ‘non-matches’ by sampling agreement patterns from non-matches in the HES-PDS extract. Sampling of matches and non-matches was stratified by age, sex and ethnicity, in order to reflect differences in readmission rates observed in the literature.[19] This approach avoided any distributional assumptions about identifier error rates for date of birth, sex or postcode, and also preserved associations between identifiers and individual characteristics.
Since by design, the original HES-PDS extract only included records that agreed on NHS number, we introduced NHS number identifier error rates representative of those observed in the literature [20, 21]. We used several scenarios to determine the effect of different NHS number error rates on results:
-
1.
NHS number was randomly missing or incorrect in 30% of records
-
2.
NHS number was randomly missing or incorrect in 0.5% of records.
-
3.
NHS number was missing or incorrect in 30% of records overall, but was twice as likely to contain errors if there were errors in any of the other identifiers (sex, date of birth or postcode).
-
4.
NHS number was missing or incorrect in 30% of records overall, but errors were distributed with the same pattern as errors in ethnicity (as observed in the HES-PDS extract).
For each simulation, records were rank ordered by match weight, and a cut-off threshold for classifying records as matches was chosen by determining the maximum weight or probability that would not exceed a false-match rate of 1% (or 99% specificity). It was possible to fix this threshold since the true match status was known in the simulated data, although this would not be possible in real data.
Comparisons
Results from three approaches were averaged over 500 simulated datasets and compared with those from traditional match weights: i) match weights incorporating dependence between identifiers (based on agreement patterns), ii) attribute-specific match weights (based on 36 different combinations of characteristics) and iii) organisational-specific match weights (based on 388 hospitals). We compared sensitivity (i.e. the proportion of true matches that were identified) between methods and compared estimated readmission rates from each method with the ‘true’ readmission rate within 12 months (8.8%) in the simulated data. We assessed the performance of each method by measuring bias, i.e. the percentage difference between estimated and true readmission rates.