 Research article
 Open Access
 Published:
Estimating parameters for probabilistic linkage of privacypreserved datasets
BMC Medical Research Methodology volume 17, Article number: 95 (2017)
Abstract
Background
Probabilistic record linkage is a process used to bring together personbased records from within the same dataset (deduplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacypreserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacypreserved record linkage using Bloom filters.
Methods
Our method was tested through a simulation study using synthetic data, followed by an application using realworld administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for deduplication linkages. Linkage quality was determined by Fmeasure. Each dataset was privacypreserved using separate Bloom filters for each field. Match probabilities were estimated using the expectationmaximisation (EM) algorithm on the privacypreserved data. Threshold cutoff values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. Deduplication linkages of each privacypreserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the Fmeasure at the estimated threshold values was also compared to the highest Fmeasure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on realworld data.
Results
Linkage of the synthetic datasets using the estimated probabilities produced an Fmeasure that was comparable to the Fmeasure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an Fmeasure that was higher than the Fmeasure using calculated probabilities. Further, the threshold estimation yielded results for Fmeasure that were only slightly below the highest possible for those probabilities.
Conclusions
The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacypreserved datasets.
Background
Record linkage is a process that allows us to gather together personbased records that belong to the same individual. In situations where unique identifiers are not available, personally identifying information such as name, date of birth and address are used to link records from one or more data collections. As administrative collections typically capture information for large portions of the population, the linked data allows researchers to answer numerous health questions for the whole population at relatively low cost.
Privacypreserving record linkage
Legal, administrative and technical issues can prevent the release of nameidentified data for record linkage. New methods have emerged that do not require the release of personally identifying information by data custodians; rather, data custodians use specific encoding processes to transform personally identifying information into a permanently nonidentifiable state (an irreversible ‘privacypreserved’ state). These methods are collectively referred to as privacypreserving record linkage (PPRL). Under a trusted third party linkage model [1], this operation occurs before the release of any data to record linkage units. Thus, personally identifying information is not disclosed by the data custodian. These PPRL methods can be used within existing record linkage frameworks, and are subject to some of the same challenges [2].
One of the most promising PPRL techniques to emerge is a method which uses Bloom filters in record linkage [3]. A Bloom filter is a probabilistic data structure originally developed to check set membership that can also be used to approximate the similarity of two sets. The ability to provide similarity comparisons on two sets of data is highly desirable for accurate record linkage.
An evaluation of Bloom filters in largescale probabilistic record linkage has shown high linkage quality (equal to that achieved with unencrypted linkage) with relatively good efficiency [4]. This evaluation utilised single field Bloom filters as opposed to recordlevel Bloom filters, where all identifiers are added into a single Bloom filter [5]. One of the outstanding challenges for a practical probabilistic PPRL approach is to accurately estimate parameter settings [4]. Typical methods to estimate parameters involve manually examining small samples of data. In the privacypreserving case, this data is not available to examine so alternate parameter estimation methods are required.
Probabilistic record linkage
In probabilistic record linkage, individual records are compared on a pairwise basis. This process makes the number of possible comparisons extremely large for all but small data files. To reduce computation overhead, records are usually only compared if they have information in common i.e. they have the same value in a particular field or set of fields. Known as blocking, this method reduces the computational comparison space. Pairs of records in each block are compared and assessed through comparison of the values in each matching field (e.g. first name, surname, address, etc.). As shown in Fig. 1, each field comparison is assigned a field score, the value of which depends on whether the field value agrees or disagrees. These agreement and disagreement scores (weights) are computed separately for each field. All field scores are then summed to produce a final score. If this score is greater than a set threshold value, the record pair is designated a match. The set of fields used in the linkage are chosen based on characteristics such as completeness, consistency and discriminating power within each dataset. The discriminating power is a measure of entropy, indicating how useful an identifier might be in the record linkage process [6, 7].
In the FellegiSunter model of record linkage [8], the agreement and disagreement scores used in field comparisons are based on the calculation of two specific probabilities, called the mprobability and uprobability [8]. The mprobability is the likelihood of two fields matching if the records belong to the same individual. The uprobability is the likelihood of two fields matching if the records do not belong to the same individual. These two probabilities are converted into agreement and disagreement weights for each field as follows:
The FellegiSunter model incorporates a simplifying assumption where the chances of agreement or disagreement for one field is independent of the chances of agreement or disagreement for another field [8]. This independence assumption allows us to calculate agreement and disagreement weights for each field separately. Extensions to the FellegiSunter model have been developed for approximate comparisons, allowing the assignment of a partial weight for partial agreement that lies somewhere between agreement and disagreement [9]. While there are many types of approximate comparisons for various types of data, most deal with the distance between two strings [10,11,12]. To fit these approximate comparisons into a probabilistic model, the distance is converted into a partial weight [13].
Missing values can be problematic in probabilistic record linkage. Comparisons are typically treated in one of three ways: a missing value is assigned the disagreement weight, a zero weight, or a separate weight accounted for explicitly. The last option extends the independence assumption to include probabilities for missing values, altering the calculations for weights. Other approaches involve removing the field from matching or even removing the entire record [10, 14].
Parameter estimation
Several methods have been developed to estimate m and uprobabilities [15, 16]; in practice, most methods are based on investigations around data quality and prior knowledge, such as the iterative refinement procedure [17].
Automated methods for deriving mprobabilities, such as through EM (expectationmaximisation) estimation have been devised [16, 18, 19]. The EM algorithm has the potential to provide accurate estimates for mprobabilities, in some cases outperforming the probabilities obtained via the iterative refinement procedure [13]. Other estimation methods do exist, such as an algebraic solution by Fellegi and Sunter [8] and the IMSL routine ZXSSQ (an implementation of the LevenbergMarquardt algorithm) [20]; however, these are more sensitive to initial parameters and require adjustment functions to keep estimates within bounds [21]. An extensive analysis of parameter estimation techniques for the FellegiSunter model of linkage has been detailed by Herzog et al. [15].
Determination of the appropriate threshold setting above which to accept recordpairs as valid matches typically occur through manual inspection of recordpairs within a range of weight scores [22]. The use of PPRL methods within a probabilistic linkage framework, where only encrypted identifiers are used for linkage, preclude the use of any manual, clerical review and so must rely on the use of alternative, computerised methods to determine the best cutoff values. This ability to correctly estimate parameters is of paramount importance if PPRL techniques are to be practical [4].
In this paper, we present a method for accurately estimating probabilities and an optimal threshold cutoff value that can be applied when using Bloom filters within the FellegiSunter model for record linkage. The work builds on a previous privacypreserving study, which utilised a probabilistic record linkage framework [4]. In this paper, we evaluate our parameter estimation method in two ways: firstly, in a simulation study using synthetic datasets with varying degrees of error; and secondly, on three largescale administrative datasets, comparing the resultant linkage quality against the quality achieved using calculated m and uprobabilities.
Methods
Simulation study using synthetic datasets
A series of synthetic datasets were created for our simulation study. Firstly a single ‘master’ dataset was created, containing 1 million records, with multiple records belonging to the same individual. This dataset did not contain any missing values, or errors typical of what would be seen in administrative data. Then, a series of new datasets were created by first taking the errorfree master dataset, and removing or degrading the quality of particular fields.
The synthetic data was generated using an amended version of the FEBRL data generator [23]. The distribution of duplicate records (how many records pertain to each individual) was based on the distribution found in the Western Australian hospital morbidity data collection. The values found in the master dataset were based on frequency distributions found in the Western Australian population. Each record in the dataset contained first name, middle name, surname, sex, date of birth, address, suburb, and postcode information. Address information was randomly selected from the National Address File, a public dataset containing all valid Western Australian addresses.^{Footnote 1}
Additional ‘corrupted’ datasets were created by modifying the master dataset with a set level of error. In the 1% error file, 1% of field values to be used for linkage were randomly selected to have their values set to missing; a further 1% were randomly selected to have their values corrupted, through the use of typographical errors, misspellings, truncation and replacement of values. In this way, each record could potentially have multiple fields set to missing or corrupted. The same procedure was used to generate a 5% error file, 10% error file and 20% error file. A privacypreserved version of each dataset was created, using single field Bloom filters.
Testing using administrative datasets
Three datasets comprising real administrative data (hospital admissions records from New South Wales (NSW), Western Australia (WA) and South Australia (SA)) were used to demonstrate the applicability of the method to realworld data. These datasets have previously been deduplicated to a very high standard using full identifiers. The results of those deduplication linkages are used in this study and act as our ‘truth set’. The information in this ‘truth set’ was not used during the linkage process or the estimation of parameters, but was used only as a standard by which to evaluate our results. This data was made available as part of the Population Health Research Network Proof of Concept 1 project [24].
Privacypreserved versions of each administrative dataset were created, using single field Bloom filters, in the same way as the synthetic datasets. Due to the size of these administrative datasets, five samples (a random 10%) of each privacypreserved dataset were created; probabilities are estimated for each sample. A deduplication linkage was performed on each sample and also against the full dataset. The resulting quality was calculated using the ‘truth set’.
Application of Bloom filters
The privacypreserved versions of the synthetic and administrative datasets were created using Bloom filters. Bloom filters were constructed in line with previous work [3]. An empty (or missing) field in the original datasets was left as empty in the privacypreserved versions.
Matching strategies used for the datasets were based on the strategies used in a published evaluation of linkage software [25]. Two blocking strategies were used; last name Soundex with first name initial, and date of birth with sex. The matching identifiers included Bloom filters for names, address and suburb, using the SørensenDice coefficient comparison for similarity [3]. SørensenDice coefficient values are converted to partial agreement values using a piecewise linear curve, created using Winkler’s [13] method. All other fields, including blocking variables, which are created at the same time as the Bloom filters, used exact matches on cryptographically hashed values. Missing value comparisons were assigned a zero weight.
Measuring linkage quality
In line with earlier work [3, 26], we used precision, recall and Fmeasure as our linkage quality metrics. Precision (also known as positive predictive value) measures the proportion of true positive pairs (correct matches) found from all classified matches. Recall (also known as sensitivity) measures the proportion of true positive pairs found from all true matches. Both precision and recall return a score between 0 and 1, with higher scores indicating less false positives and false negatives (missed matches) respectively. The Fmeasure is the harmonic mean between precision and recall, providing a single figure with which we can compare results. Typically, a middleground is sought between precision and recall, as there is a tradeoff between these values. As the probabilistic linkage threshold is increased, the number of false positives decreases (and so precision increases); however, the number of correct matches missed will also increase, leading to a decrease in recall.
The calculations for these metrics are provided below.
Estimating m and u probabilities
The EM algorithm has been used to calculate the mprobabilities (m ), uprobabilities (u) and the proportion (p) of record pairs that match in probabilistic linkage [21]. It is an iterative algorithm that uses the output values of one iteration as the input to the next. We added two additional variables to the EM algorithm as described by Jaro [21], the missing mprobability and missing uprobability values (denoted by m _{ m } and u _{ m } respectively), to more accurately estimate a single threshold cutoff value (discussed later).
Jaro [21] suggests the algorithm is not particularly sensitive to the starting values (m, u, m _{ m } , u _{ m, } p). However, the starting values for m should be higher than those for u. We thus set an initial value of 0.1 for m _{ m } and u _{ m, } 0.8 for m and 0.1 for u.
Given two files, A and B, we began by iterating through all possible combinations of field comparisons between A and B. The count of each field state combination was tabulated (an example is shown in Table 1). There are, at most, 3^{n} possible field state combinations for n fields, assuming each field either agrees, disagrees or is missing. The ‘missing’ state occurs when a pairwise comparison involves a missing or empty value.
The first part of the EM algorithm is the expectation step. For each field state combination, we calculate recall and false positive rate (fpr). For recall, each agreement in the table is replaced with m, each disagreement with (1 – m _{ m } – m), and each missing with m _{ m }. The product of these is the recall for that field state combination. Similarly, for the fpr, each agreement in the table is replaced with u, each disagreement with (1 – u _{ m } – u) and each missing with u _{ m } . The product of these provides the fpr.
The recall and fpr allow us to calculate the proportion of true matches for each field state combination j:
The maximisation step involves the calculation of m, u, m _{ m }, u _{ m } and p. The m value for each field is calculated as the ratio of true matches that ‘agree’ for that field to the total true matches. Likewise, the u value for each field is calculated as the ratio of false matches that ‘agree’ for that field to the total false matches. The m _{ m } and u _{ m } values use the ratio of matches that are ‘missing’.
The output values of (m, u, m _{ m }, u _{ m, } p) are then used as the input into the next iteration. Iterations are run until values converge. Convergence will occur when the output values differ only minimally from the input values.
Determining a threshold/cutoff setting
In addition to estimating probabilities for a probabilistic linkage, it is important to specify a threshold value that provides optimal resultant linkage quality.
Using the information generated during the EM step, we can estimate the quality of linkage for every combination of weights between a range of possible threshold values (i.e. using precision, recall and Fmeasure). However, the table of field state combinations used for the EM step only contains field state combinations that were present in the datasets A and B. The full set of possible combinations is required to calculate a suitable threshold setting. Field state combinations that are not present in the field state combination table were added with a count of zero, and recall and fpr were calculated.
Using the full field state combination set, we calculated the weight for each field state combination. Each agreement entry in the table was replaced with the corresponding agreement weight for that field using m and u calculated by the EM algorithm. Likewise, each disagreement entry was replaced with the disagreement weight for that field using the same m and u . Each ‘missing’ entry was replaced with a weight of zero.
To estimate precision, recall and Fmeasure, we calculated the True Positives and False Positives for every field state combination. For these estimations, we required the total True Matches (true positives and false negatives) and False Matches (true negatives and false positives). The total True Matches was estimated as part of the EM algorithm, and thus we used the value calculated in the final iteration of the maximisation step. The total False Matches was reestimated as the total comparison space less the True Matches.
For a single file deduplication, the total comparison space is:
To calculate the True Positives and False Positives, we multiplied the recall and false positive rate for each field state combination by the total True Matches and False Matches respectively.
We calculated the True Positives and False Positives for each field state combination so that precision could be estimated. To calculate the precision for a particular threshold, each field state combination with a weight above that threshold value had their True Positives and False Positives summed before precision was estimated.
We did not calculate False Negatives, as this can be derived from the total True Matches (True Positives plus False Negatives) value calculated earlier to estimate recall. To calculate recall for a particular threshold, the True Positives were summed from values for each field state combination that have a weight above that threshold.
As the computation requirements for calculating precision, recall and Fmeasure are relatively low; we calculated these for all possible weight combinations. With a list of threshold values and corresponding precision, recall and Fmeasure values, we were able to determine an optimal threshold value for each linkage (i.e. the single threshold score with the highest estimated Fmeasure).
Evaluation of parameter and threshold estimation
For each version of the synthetic datasets, and additionally, for the administrative datasets, probabilities for m and u were estimated together with a threshold cutoff value. The EM algorithm was used to estimate m only for each deduplication linkage. The frequencies used for our EM algorithm were calculated on blocks, and as such, the number of nonmatches observed was greatly reduced, thereby introducing an undesirable bias into the EM algorithm’s u estimates [21]. Consequently, we elected to use Jaro’s uprobability estimate (on unblocked data) u, together with the EM algorithm’s estimated m value.
As part of our simulation study, a deduplication linkage was run on each synthetic dataset using this combination of values, and a linkage was also run using calculated m and u probabilities. Optimal threshold values were estimated for both sets of probabilities. The highest Fmeasure and estimated threshold Fmeasure were recorded and compared for all synthetic dataset deduplication linkages. Similarly, in our test using real data, deduplication linkages were run on the administrative data; calculated m and u probabilities were obtained using the administrative data ‘truth sets’. The accuracy of the probability estimates on the administrative dataset samples was measured using the rootmeansquare error (RMSE), comparing the Fmeasure obtained from the EM algorithm probabilities with that obtained from calculated probabilities. RMSE was also used to compare the Fmeasure obtained at the estimated threshold with that which would be obtained at the correctly chosen threshold. The formula used was as follows:
Results
Synthetic data
The characteristics of the synthetic datasets are shown in Table 2. As the dataset error rates increase, the number of unique values for each field increases significantly because of the corruption introduced during dataset creation. The discriminating power for each field also increases with the simulated data corruption.
The results from deduplication linkages of the synthetic datasets using calculated probabilities and EM probabilities are shown in Table 3. These results show that the use of EM for probability estimation, combined with our threshold estimation technique, provided linkage quality comparable to the best achievable using calculated probabilities, on data with up to 20% error.
As one would expect, deduplication of the master dataset (without error) produced a perfect result with Fmeasure of 1.0 at a threshold of 49 (the sum of all agreement weights for each field). The use of EM estimated mprobabilities produced the same result. However, estimation of a threshold value for the master dataset was significantly lower, with a value of 8 for both calculated and estimated probabilities. Note, however, that although this threshold estimate is much lower, it results in just 60 false positives from the entire comparison space, giving an Fmeasure of 0.9999995.
While it is possible for the threshold to be estimated to one or two decimal places, the use of a whole number here was made for simplicity. It is possible that a better estimate could be made with a finer precision but the differences between thresholds shown here using whole numbers are already negligible.
As Table 3 shows, using our estimation technique, there is a slight decrease in linkage quality as error rates in the data increase (i.e. for 1% error, an Fmeasure of 0.9979 vs. 0.9979, compared to 20% error with an Fmeasure of 0.5217 vs. 0.4917). However, even at 10% error, the difference is very small with an Fmeasure of 0.8443 vs. 0.8436.
Administrative data
The characteristics of the fields in each administrative dataset, such as the number of unique values, missing percentage, and discriminating power were recorded, shown in Table 4. The random samples generated for each administrative dataset were highly representative of the full dataset.
Linkage quality from EM estimates
The estimated m and uprobabilities of the samples reflect the characteristics described above, with negligible differences observed between the samples for each dataset. The estimated probabilities for each dataset are shown in Table 5.
Comparisons of linkages using the calculated probabilities and the EM mprobabilities with estimated uprobabilities are shown in Table 6. The highest Fmeasure obtained using the estimated probabilities was slightly higher than that achieved using calculated probabilities in all cases.
Accuracy of threshold estimation
The quality of linkage using the Fmeasure at the estimated threshold is compared to the highest Fmeasure for each sample, as shown in Table 7. The RMSE values for each dataset were 0.0019 for NSW, 0.0001 for SA and 0.0046 for WA. The estimated threshold value was slightly below the best threshold for each dataset.
Discussion
In our simulation study, the use of the EM algorithm to estimate probabilities for a deduplication linkage produced results comparable to those produced by calculated probabilities, even with synthetic datasets that contained 20% introduced error. Similarly, in our tests using administrative datasets, the probability and threshold estimation technique produced very highquality linkage results. In comparison to the quality of linkage using calculated probabilities, the probabilities used from the EM algorithm produced linkage quality of the simulation datasets that was comparable to the best possible. However, we found better quality results using estimated probabilities on the real administrative datasets, at least in regards to Fmeasure. This is a somewhat surprising result, and why this occurred for all three administrative datasets is not entirely clear. A recent analysis of the popular Fmeasure metric suggests that it may not provide a fair comparison between linkage methods if the selected thresholds produce a different number of predicted matches [27]. This behaviour is one possible explanation for our results, and future work will consider additional metrics for measuring linkage quality. It should be noted that the differences between the linkage quality results were relatively small, and we would not expect this to be the case for datasets of all sizes and quality.
The original unencrypted versions of these datasets had previously been linked by Boyd et al. using probabilities estimated with knowledge of previous linkages and refinement through pilot linkages [24]. The probabilities derived from the EM algorithm produced a higher Fmeasure for both the NSW (0.996 vs. 0.995) and WA (0.992 vs. 0.990) Bloom filter datasets; data for the unencrypted SA dataset was unavailable. On face value, at least, these results indicate that use of the EM algorithm for probability estimation is a viable option, especially where sampling techniques for estimation are not available due to the privacypreserved nature of the data.
Our study found that the mprobabilities estimated via the EM algorithm did not necessarily match the calculated mprobabilities for each field; however, there was a general consistency of the mprobabilities across all fields. Both our synthetic datasets and the administrative datasets contained many matches and were thus good candidates for probabilities estimated through the EM algorithm. The EM algorithm is known to perform poorly with datasets that have too few matches [15]. Being able to identify and address this issue for privacypreserved data will require further research.
Our threshold estimation technique also returned very good linkage quality, with a resulting Fmeasure that consistently approached the best Fmeasure achievable given the probabilities used. To our knowledge, no alternative method of estimating thresholds exists for use with privacypreserved data. Without the ability to provide any manual review postlinkage, it is important to be able to estimate a single accurate threshold cutoff value. As such, this technique should be considered for use with Bloom filters for probabilistic linkage.
The threshold values estimated in our study were consistently higher than the optimum threshold when using the calculated probabilities, with fewer false positives and more false negatives returned in each of the linkages (with the exception of the ‘perfect’ synthetic dataset). Interestingly, we found the opposite to be true when using the estimated probabilities, with a consistently lower threshold. Additional simulation studies may help to understand this effect and improve the estimation accuracy. This effect may be a result of the blocking technique used to gather field state combinations and the similarities in the estimation methods for both probabilities and threshold. Although it may be possible to adjust for this underestimation, an advantage of using a lower threshold is that alternative approaches can be implemented which specifically target false positive matches. It may be possible to run automated clerical review procedures on the results, such as graph theory techniques, to find and correct false positive errors [28]. The effectiveness of these techniques on privacypreserved data is unknown, however.
Future research will examine the use of the EM algorithm on composite Bloom filters. While single field Bloom filters provide excellent quality with probabilistic linkage, they may not provide a sufficient level of privacy for some stakeholders. As such, the use of composite Bloom filters may be necessary. Rowlevel Bloom filters would not be viable; at least two fields are required for probabilistic record linkage. However, multiple Bloom filters comprising two or three fields may function sufficiently. The use of the EM algorithm and the threshold estimation technique on Bloom filters comprising two or more fields is untested, and more research into the performance of the EM algorithm on data containing composite fields is warranted.
Finally, it is worth noting that the EM algorithm and threshold estimation technique described in this paper have wider application and could be used for any probabilistic linkage (encrypted and unencrypted), not just Bloom filters for PPRL. Provided the datasets being linked have sufficient matches, the estimation technique will produce optimal mprobabilities and a suitable threshold cutoff for the linkage. The uprobabilities can be estimated using Jaro’s estimation method. Unencrypted linkages would benefit from this technique as well, providing a strong empirical foundation from which to build a robust linkage strategy.
Conclusions
Previous evaluations have shown that privacypreserving record linkage can be as accurate as traditional unencoded linkage. An important element in developing a practical probabilistic privacypreserving approach is to determine how to appropriately set parameters without recourse to manual inspection or prior knowledge of data. As we have shown, use of the EM algorithm and our threshold estimation technique provides a robust method of estimating parameters for probabilistic linkage of Bloom filter datasets. This method appears highly accurate on datasets with varying error levels. Further testing is required on realworld datasets with poorer quality data and on datasets with fewer potential matches. The ability for these techniques to produce consistently accurate results on a variety of data will determine whether they are viable in an operational setting.
Notes
 1.
Abbreviations
 EM:

Expectationmaximisation
 FPR:

False positive rate
 NSW:

New South Wales
 PPRL:

Privacypreserving record linkage
 RMSE:

Root mean square error
 SA:

South Australia
 WA:

Western Australia
References
 1.
Vatsalan D, Christen P, Verykios VS. A taxonomy of privacypreserving record linkage techniques. Inf Syst. 2013;38(6):946–69.
 2.
Brown AP, Ferrante AM, Randall SM, Boyd JH, Semmens JB. Ensuring privacy when integrating patientbased datasets: new methods and developments in record linkage. Front Pub Health. 2017;5:34.
 3.
Schnell R, Bachteler T, Reiher J. Privacypreserving record linkage using Bloom filters. BMC Med Inform Decis Making. 2009;9(1):41.
 4.
Randall SM, Ferrante AM, Boyd JH, Bauer JK, Semmens JB. Privacypreserving record linkage on large real world datasets. J Biomed Inform. 2014;50:205–12.
 5.
Schnell R, Bachteler T, Reiher J. A Novel ErrorTolerant Anonymous Linking Code. In: Working Paper Series No WPGRLC201102. Nürnberg: German Record Linkage Center; 2011.
 6.
Basharin GP. On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables. Theory Probab Applic. 1959;4:333–6.
 7.
Wajda A, Roos LL. Simplifying Record Linkage: Software and Strategy. Comput Biol Med. 1987;17(4):239–48.
 8.
Fellegi I, Sunter A. A Theory for Record Linkage. J Am Stat Assoc. 1969;64:1183–210.
 9.
DuVall SL, Kerber RA, Thomas A. Extending the FellegiSunter probabilistic record linkage method for approximate field comparators. J Biomed Inform. 2010;43:24–30.
 10.
Christen P. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Berlin/Heidelberg: Springer Science & Business Media; 2012.
 11.
Winkler WE. Preprocessing of lists and string comparison. Rec Linkage Tech. 1985;985:181–7.
 12.
Thibaudeau Y. Fitting loglinear models when some dichotomous variables are unobservable. In: Proceedings of the Section on statistical computing: 1989; 1989. p. 283–8.
 13.
Winkler WE. String comparator metrics and enhanced decision rules in the FellegiSunter model of record linkage. Paper presented at the Annual ASA Meeting in Anaheim. Washington: Statistical Research Division, U.S. Bureau of the Census; 1990.
 14.
Ong TC, Mannino MV, Schilling LM, Kahn MG. Improving record linkage performance in the presence of missing linkage data. J Biomed Inform. 2014;52:43–54.
 15.
Herzog TN, Scheuren FJ, Winkler WE: Data quality and record linkage techniques. Springer Science & Business Media. 2007.
 16.
Winkler WE. Using the EM algorithm for weight computation in the FellegiSunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association: 1988; 1988. p. 671.
 17.
Newcombe HB, Kennedy JM, Axford SJ, James AP. Automatic Linkage of Vital Records. Science. 1959:954–9.
 18.
Grannis SJ, Overhage JM, Hui S, McDonald CJ. Analysis of a probabilistic record linkage technique without human review. Am Med Infom Assoc. 2003:259–63.
 19.
Bauman G John Jr: Computation of Weights for Probabilistic Record Linkage using the EM Algorithm. (Masters Thesis). Available from All Theses and Disserations (Paper 746): Brigham Young University; August 2006.
 20.
Inc IMaSL. User's manual: IMSL library: problem solving software system for mathematical and statistical FORTRAN programming, Ed. 9.2, rev edn: IMSL; 1984.
 21.
Jaro MA. Advances in recordlinkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989;84(406):414–20.
 22.
Gill L: Methods for automatic record matching and linkage and their use in national statistics. In: National Statistics Methodological Series No 25. Office for National Statistics. 2001.
 23.
Christen P, Pudjijono A. Accurate synthetic generation of realistic personal information. Adv Knowl Discov Data Min. 2009;5476:507–14.
 24.
Boyd JH, Randall SM, Ferrante AM, Bauer JK, McInneny K, Brown AP, Spilsbury K, Gillies M, Semmens JB. Accuracy and completeness of patient pathways–the benefits of national data linkage in Australia. BMC Health Serv Res. 2015;15(1):312.
 25.
Ferrante A, Boyd J. A transparent and transportable methodology for evaluating Data Linkage software. J Biomed Inform. 2012;45(1):165–72.
 26.
Randall S, Ferrante A, Boyd J, Semmens J. The effect of data cleaning on data linkage quality. BMC Med Inform Decis Making. 2013;13(64):e1.
 27.
Hand D, Christen P. A note on using the Fmeasure for evaluating record linkage algorithms. Stat Comput. 2017:1–9.
 28.
Randall SM, Boyd JH, Ferrante AM, Bauer JK, Semmens JB. Use of graph theory measures to identify errors in record linkage. Comput Methods Prog Biomed. 2014;115(2):55–63.
Acknowledgements
The project acknowledges the support of data custodians and data linkage units who provided access to the jurisdictional data.
Funding
Data for the project was provided as part of a Population Health Research Network (PHRN) ‘Proof of Concept’ collaboration which included the development and testing of linkage methodologies. The PHRN is supported by the Australian Government National Collaborative Research Infrastructure Strategy and Super Science Initiatives. AB has also been supported by an Australian Government Research Training Program Scholarship.
Availability of data and materials
The data that support the findings of this study are available from state data linkage units in NSW, SA and WA, but restrictions apply to the availability of these data, which were used under agreement with data custodians, and so are not publicly available.
Author information
Affiliations
Contributions
AB, SR and JB designed the study. AB performed the evaluation and analysed the data. AB and SR wrote the first draft of the manuscript. SR, AF, JS and JB critically reviewed the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Ethical approval for developing and refining linkage methodology, which includes the parameter estimates for probabilistic linkage of privacypreserved datasets, was obtained from Curtin University Human Research Ethics Committee (Reference: HR 15/2010) as well as approval from South Australia Department of Health and Ageing Human Research Ethics Committee (Reference: HREC 511/03/2015), New South Wales Cancer Institute Human Research Ethics Committee (HREC/10/CIPHS/37) and Western Australian Department of Health Human Research Ethics Committee (HREC/2009/54). Ethics approval included a waiver of consent based on the criteria in the national statement on ethical conduct in human research.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Brown, A.P., Randall, S.M., Ferrante, A.M. et al. Estimating parameters for probabilistic linkage of privacypreserved datasets. BMC Med Res Methodol 17, 95 (2017). https://doi.org/10.1186/s1287401703700
Received:
Accepted:
Published:
Keywords
 Record linkage
 Probabilistic
 Privacy
 Data quality
 Linkage quality