Skip to main content

Data cleaning and management protocols for linked perinatal research data: a good practice example from the Smoking MUMS (Maternal Use of Medications and Safety) Study



Data cleaning is an important quality assurance in data linkage research studies. This paper presents the data cleaning and preparation process for a large-scale cross-jurisdictional Australian study (the Smoking MUMS Study) to evaluate the utilisation and safety of smoking cessation pharmacotherapies during pregnancy.


Perinatal records for all deliveries (2003–2012) in the States of New South Wales (NSW) and Western Australia were linked to State-based data collections including hospital separation, emergency department and death data (mothers and babies) and congenital defect notifications (babies in NSW) by State-based data linkage units. A national data linkage unit linked pharmaceutical dispensing data for the mothers. All linkages were probabilistic. Twenty two steps assessed the uniqueness of records and consistency of items within and across data sources, resolved discrepancies in the linkages between units, and identified women having records in both States.


State-based linkages yielded a cohort of 783,471 mothers and 1,232,440 babies. Likely false positive links relating to 3703 mothers were identified. Corrections of baby’s date of birth and age, and parity were made for 43,578 records while 1996 records were flagged as duplicates. Checks for the uniqueness of the matches between State and national linkages detected 3404 ID clusters, suggestive of missed links in the State linkages, and identified 1986 women who had records in both States.


Analysis of content data can identify inaccurate links that cannot be detected by data linkage units that have access to personal identifiers only. Perinatal researchers are encouraged to adopt the methods presented to ensure quality and consistency among studies using linked administrative data.

Peer Review reports


The linkage of routinely collected perinatal and other administrative health data has broadened the scope of maternal and child health research as it enables researchers to establish and follow-up large samples or whole populations and ascertain multiple factors for risk adjustment [1]. Linking perinatal to pharmaceutical dispensing data offers a valuable approach for pharmacovigilance and examination of medication safety in pregnancy [2] given ethical concerns about including pregnant women in clinical trials [3] and bias associated with voluntary reporting to post-market pharmaceutical surveillance systems [4].

In many countries, including Australia, unique individual identifiers are not available across all of the administrative data collections relevant to perinatal research. In this situation, probabilistic linkage methods are used to link individuals’ records [5, 6], but probabilistic linkage is not perfect [7]. Previous studies have reported that the sensitivity (i.e. truly matched records) of probabilistic linkage ranges from 74 to 98%, and specificity (i.e. truly unmatched records) ranges between 99 and 100% [8]. False and missed matches can introduce bias and affect the validity of research findings. While data linkage units aim to improve the quality of linkage, there is a growing consensus that data cleaning (i.e. detecting, diagnosing, and editing data anomalies) [9] and proper documentation are essential aspects of quality assurance [9,10,11]. The RECORD Statement recommends that observational studies using routinely collected health data should provide information on the process and quality of linkage and data cleaning [10]. Furthermore, systematic checks have a potential to improve quality of future linkage through provision of feedback to data linkage units.

In studies that involve cross-jurisdictional linkages, additional data cleaning considerations are required. Australia has a federated health care system with delivery and administration of services being the responsibility of either States/Territories (e.g. hospital services) or the Federal government (e.g. subsidised pharmaceuticals). In this setting, cross-jurisdictional linkage brings together diverse and rich data sources, enabling national-level research studies [12]. Cross-jurisdictional linkage performed by different data linkage units, however, is subject to discrepancies resulting from variations in the use of personal identifiers, techniques for constructing linkage keys and quality assurance policies. Consistency checks, therefore, are vital before merging records from different States.

Cleaning linked data is a complex process and requires thorough planning and knowledge about data collection methodologies and the validity of the data items. While there are existing frameworks and check lists for data cleaning [9, 11], literature that describes how to systematically examine the consistency of content data in linked perinatal records [13], and how to identify and resolve disparities arising from cross-jurisdictional linkages is lacking. Additionally, researchers rarely provide their coding syntax, making it difficult to replicate their data cleaning procedures. This paper presents a series of steps for assessing data consistency and cleaning in the Smoking MUMS (Maternal Use of Medications and Safety) Study [14] which involves the linkage of perinatal records from two Australian states—New South Wales (NSW) and Western Australia (WA)—to national Pharmaceutical Benefits Scheme (PBS) claims data. Exemplar documentation and SAS code presented in the paper can be adopted in similar studies.


Study design and data sources

The Smoking MUMS Study is an observational cohort study including all women who delivered in NSW and WA between 1 January 2003 and 31 December 2012, and their babies. For mothers, perinatal records (i.e. the mother’s deliveries, including pre-2003 records) were linked to hospital separations (i.e. hospital discharge), emergency department (ED) attendances, death, and pharmaceutical claims records. For babies, perinatal records (i.e. the baby’s birth) were linked to hospital, ED and death data. Congenital defect notifications were included in the linkage for babies born in NSW (Fig. 1). New South Wales is Australia’s most populous State with more than 7.5 million residents, while WA has a population of 2.6 million [15]. Table 1 describes the data collections used in the study.

Fig. 1
figure 1

Data linkage and examples of data set layouts

Table 1 Descriptions of data sets

Data linkage

All the linkages for the Smoking MUMS Study used probabilistic linkage methods and a privacy preserving approach [16,17,18]. Specifically, personal identifiers were separated from health information, with the data linkage units receiving personal identifiers only (i.e. no health information) and encrypted record IDs from the data custodians. The linkage units assigned a project-specific person number to all records that belonged to the same person and returned these person numbers and encrypted record IDs to the respective data custodians who released the approved research variables together with the person numbers (i.e. no personal identifiers) to the researchers [16,17,18].

In NSW, the Centre for Health Record Linkage (CHeReL) has established a Master Linkage Key to routinely link the Perinatal Data Collection with the other NSW data collections (Table 1), except the Register of Congenital Conditions which was specifically linked for NSW babies in this project. Likewise, the WA Data Linkage Branch (WA DLB) regularly links the Midwifery Notification System to the other WA data collections (Table 1). The Master Linkage Keys in NSW and WA are regularly updated and assessed via robust quality assurance procedures. The false positive rates for NSW and WA were estimated to be 0.3 and 0.05% respectively [19, 20]. Once the linkages for mother and baby cohorts were finalised, the CHeReL and WA DLB created a Project Person Number for each mother (mumPPN) and each baby (babyPPN, mapped to mumPPN).

In Australia, records of claims for pharmaceutical dispensing processed by the Federal government PBS are not routinely linked to State-based health records. For this study, PBS data custodian assigned a project-specific Patient Identification Number (PATID) to each woman who had claim records and provided PATIDs and personal details to the Australian Institute of Health and Welfare (AIHW) Integration Services Centre, while CHeReL and WA DLB provided the list of mumPPNs and identifiers (Fig. 1). The AIHW conducted probabilistic linkages based on personal identifiers and assigned weights (i.e. degree of similarity between the pairs of records, higher weights indicating greater similarity) to matches between PATIDs and PPNs. Based on AIHW clerical reviews, recommended threshold for accepting the matches to NSW mumPPNs was 29.0 (link rate 99.43%, link accuracy 98.62%) and 28.0 for matches to WA mumPPNs (link rate 99.02%, link accuracy 98.65%) [21]. Separate mapping tables for each State, including any PATID-PPN matches with weight ≥ 17 were released to researchers, as were separate files containing claims records relating to PATIDs that were included in the mapping tables (Table 1, Fig. 1). The release of claims records for matches with weights lower than the recommended threshold allows for sensitivity analyses in which different thresholds are used.

Steps to check consistency of State-based data

Prior to the assessment of data consistency, all data sets were examined to make sure that all variables and associated data dictionaries were delivered as expected, and the number of persons and records were in accordance with reports provided by the data linkage units. The mother’s hospital separation record and the child’s hospital separation record that correspond to the delivery of the mother and the birth of the child were carefully identified based on previously reported methods [6]. The range of data values, distribution by year and missing values were explored for all variables. Data items that underwent historical changes (as per data dictionaries or the published midwife notification forms) were examined whether the distribution of data is consistent with the documented changes (results not shown).

Consistency of State-based data was assessed through a series of steps (Fig. 2 and Table 2).

  • Steps 1 to 3 examined the uniqueness of records.

  • Steps 4 to 8 checked the consistency within and across pregnancies based on perinatal data items, including baby date of birth (DOB), parity, pregnancy plurality, birth order, gestational age, and birthweight. These variables were used because previous validation studies have reported high levels of accuracy in their recording [22, 23]. Parity was defined as the number of previous pregnancies ≥20 weeks and numerically coded (e.g. 0, 1, 2, 3). Plurality assigned pregnancies as single or multiple-fetus (coded as singleton, twins, triplets, quadruplets, etc.) while birth order indicated the order each baby was born (coded as 1st, 2nd, 3rd, etc.). Plural pregnancies generated more than one perinatal record which contained the same maternal information but baby-specific information, including order of birth. Gestational age was defined as number of completed weeks of gestation. Date of conception was calculated (baby DOB – completed weeks of gestation × 7 + 14 days).

  • Steps 9 to 16 assessed the consistency of information across data sources, including consistency between unique events (birth, death) and episodes of health service use. These steps capitalised on the availability of the same information (e.g. baby DOB, interchangeably date of delivery, mother’s month and year of birth) in multiple data sets and validity of these variables [22, 23].

Fig. 2
figure 2

Summary of data cleaning steps and results

Table 2 Steps undertaken to assess consistency of State-based data

On-screen scrutiny of relevant records was undertaken (as indicated in Table 2) when multiple entries of the same death (Step 1) or birth (Step 3) were suspected (i.e. partial duplicates), using additional information (e.g. demographic details, birthweight, Apgar scores, delivery hospital, hospital diagnoses and discharge status). Manual review of these records was time efficient because inconsistencies were found in a small number of cases.

Identified inconsistences were categorised as person-level or record-level. Person-level inconsistencies suggest likely false positive links and the persons were flagged for “exclusion” from future data analyses. Examples include a woman who conceived a second child before delivering her first child (Step 6) or had a baby after a total hysterectomy procedure (Step 13). In some cases, errors were identified for a child (e.g. date of admission later than date of death) while no inconsistencies were identified for the mother. For those cases, the mother and records for all of her children were flagged for “exclusion”.

Findings including duplicates, missing data, invalid data or likely typographical errors, and where date of admission was later than date of discharge were considered random and at record-level. Duplicates were flagged, and missing or typographical errors were corrected if plausible. Hospital separation and ED records found to contain inconsistent dates of birth, admission and discharge (Steps 9, 14 and 16) were flagged for “deletion”. Inconsistencies for which no changes were made were quantified and documented for consideration in specific analyses.

At the completion of each step, new variables were created and merged into the original data sets rather than deleting records or overwriting data values, this allowed the original data content to remain unmodified. For efficiency, decisions reached through each cleaning step were applied before undertaking the subsequent step (e.g. removal of duplicates and the use of corrected birth order to select one record per pregnancy).

Steps to check cross-jurisdictionally linked data

Table 3 and Fig. 2 present steps (17 to 22) to resolve discrepancies in the linkage performed by different linkage units and assess validity of apparent cross-State links. Specifically, cases where a PBS PATID matched to multiple mumPPNs were detected and sent to the AIHW linkage unit for review, through which clusters of mumPPNs (i.e. records likely to belong to the same woman) were identified (Step 17) and assessed for person-level consistency (Step 19). Step 20 examined consistency among records for women who had records in both States. Following the creation of the variable finalPPNmum (Step 21) to integrate mother’s records, consistency was checked for finalPPNmums that had multiple PATIDs (Step 22).

Table 3 Extract recommended PBS links and steps undertaken to check cross-jurisdictional linkage

All analyses were performed in SAS 9.3. Samples of SAS codes are provided in Additional file 1.


The checks for consistency of State-based data (Table 2) suggested false links for 703 women in NSW (0.12%) and 90 women in WA (0.05%), and flagged these women for “exclusion”. Corrections were made in 2062 perinatal records for variables including birth order (10 records), parity (1379 records), and baby date of birth (673 records) and in 41,516 hospital separation and ED records for baby’s age.

Assessing cross-jurisdictional links (Table 3), Step 19 flagged an additional 149 mumPPNs as “exclusion” and confirmed 3323 clusters of mumPPNs while Step 20 identified 1986 women who had records in both States (Step 20). Records of these mumPPNs clusters and cross-State mothers were integrated through the construction of the variable finalPPNmum (Step 21) which were used as the new person number for the mothers. The last step further identified 2763 finalPPNmums for “exclusion”, bringing the total number of women flagged for “exclusion” from future data analyses to 3705. The final cohort included 774,449 women and 1,225,341 babies born between 2003 and 2012. In this cohort, about 4.6% of women had the expected number of pregnancies greater than the number of deliveries recorded in the perinatal data, suggestive additional births elsewhere, and 4.5% had likely errors in the recording of parity. In 1838 cases, finalPPNmums were matched to two or more PBS PATIDs.

From the original mapping tables (shown in Table 1), 625,972 PBS links with weight ≥ recommended threshold were extracted and among those, 16,138 matches (2.6%) were further disregarded (Table 3). For the remaining 608,834 matches, 14,212,875 claims records were subset for the final mother cohort.


In this perinatal cross-jurisdictional data linkage study, we developed a series of steps to identify, and where appropriate, correct inconsistent data values. The methods were based on standard and reliable content data items [22, 23], and thus can be adopted in other perinatal research. The methods included a stepwise approach to resolving disparities in linkage performed by different linkage units and identifying women who had records in more than one State, for whom integration of records is required for analyses.

Data errors are commonly detected incidentally during statistical analyses or interpretation of results, leading to inefficient checking of data and repeating analyses [9, 11] and, potentially, lack of reproducibility of results if ad-hoc or undocumented data edits are made. We found inconsistencies that were indicative of false positive links and clusters of women’s IDs which suggest missed State-based links. These findings were fed back to the State-based data linkage units for further examination and rectification prior to future linkages, conferring benefits for other data users. Researchers play an important role in contributing to quality assurance, through systematic assessment of data consistency, given that content data have not traditionally been accessible to data linkage units under the “best practice” protocol [16,17,18]. The detection of the probable missed links improved data completeness, matching a further 448 perinatal records to records of maternal hospital admission for the delivery. Assessment of the consistency of the recording of parity identified women who might have additional births elsewhere (4.6%) and who had likely errors in the recording of parity (4.5%). Obstetric history is particularly important for longitudinal analyses or evaluation of interventions or exposures in the period between pregnancies.

In this study, the proportion of NSW women who were flagged for “exclusion” was lower than the false positive rate estimated by the data linkage unit in NSW (0.12% vs. 0.3%), while for WA women these proportions were similar (0.05% vs. 0.05%). This study was unable to examine the characteristics of the unlinked perinatal records, while previous studies have reported that unmatched records might hold different maternal and pregnancy characteristics compared to fully linked records [6, 7, 24]. Limitations in the data cleaning methods should also be acknowledged. Assessment of parity was less likely to detect link errors among women with fewer perinatal records, and the cut-off to flag “exclusion” due to inconsistencies in parity and mother YOB was based on a conservative decision. Given discrepancy in baby date of birth found in 667 perinatal records (0.05% of the babies), birth registrations as an additional data source would potentially helpful in assessing these discrepancies. Following the checks for clusters of mumPPNs within a PBS PATID (Step 19), an anomaly in the opposite direction (i.e. clusters of PBS PATIDs within a finalPPNmum) was present among 1838 cases (Step 22). For these women, the recording of parity, month and year of birth were consistent but no further checks using dispensing data were performed. Checking the consistency of clinical information against medicines dispensed was deemed inappropriate given that maternal morbidities recorded in the perinatal, hospital and ED data might not require a pharmacotherapy. Furthermore, our PBS data extract did not contain records for all medicines, nor did the PBS data contain records for all subsidised medicines dispensed (i.e. prior to April 2012 only subsidised medicines dispensed to social security beneficiaries were captured completely) [25]. The presence of more than one identifier in the PBS data suggests that more pharmaceutical dispensing will be attributed to these women, perhaps inappropriately, hence sensitivity analyses excluding these women should be considered.

The data cleaning process outlined in this manuscript can be summarised into stages that can be adopted in studies based on administrative health data. Moreover, majority of the specific checks undertaken in this study are generalizable to other studies. As a first step, it is important to gather necessary information to inform the development of a data cleaning plan. These include descriptions of the data collections, the variables and associated data dictionaries, the reliability of the recording of these variables as well as the procedures through which the project’s data were linked. It is advisable that the researcher examines the distribution of data (e.g. frequency, cross-tabulation), unusual patterns of the data should be discussed with the data custodians and researchers with experience working with the same data source.

It was noticed in this study that, for example, hospital records of healthy newborns were included in NSW data but were typically excluded (84%) from WA hospital admission data.

Subsequently, it is essential to draft a plan, outlining general rules about decisions to be made for identified errors, and content of specific checks (i.e. objectives and detailed algorithms). Factors to consider when creating general rules include whether there will be data sharing among analysts or use of the data for multiple research objectives, potential causes of errors (e.g. incorrect links, inconsistent patient response, inaccurate recording, typographical errors) and possible implications of decisions. Data in this project are used for several sub-studies, therefore, no deletion or overwriting of the original data value was made, instead, flag variables and corrected data values were added. Data analysts were provided with detailed documentation including noting of inconsistencies for which no changes were made so that informed decisions could be made for specific analyses. The decision regarding how to handle an error was guided by the probable cause of the error. Flags for exclusion were applied to the mother (thus, all her children) even if a linkage error was found for a child, because excluding only the problematic pregnancy record could affect analyses that investigate or control for outcomes of the prior pregnancy or health service utilisation (e.g. medication use, hospital procedures) between pregnancies. Where possible, missing, invalid and erroneous data was corrected. Flags for deletion were applied to ED or hospital records which contained inconsistencies in dates. Duplicates were flagged for removal. No changes were made for “grey” unexplainable inconsistencies.

In terms of planning for specific consistency checks, a structured approach should be used to ensure that important aspects are covered and to avoid digressing. Factors that can be used to inform which data items should be checked and the sequence of the checks include the methods of the linkage (i.e. deterministic, probabilistic), the base data set and its variables (i.e. the data sets used to derive the study population), commonalities between data sets, the coherence between different pieces of information that relate to the same event, the uniqueness of an event or expected findings, and likely consequences of unmanaged inconsistencies. It is easier to conduct the checks in the order of increasing complexity, such as commencing the checks of data items within a record, followed by examining consistencies between records of the same data set before linking records across data sets.

Our check for missing death registration record (Step 2) applicable for only NSW death data demonstrates the application of the “uniqueness” rationale that can be applied for all studies and data sources. For projects that involved cross-jurisdictionally linked data, the checks for consistency in the IDs matching (e.g. Steps 17) illustrate the effective “uniqueness” rationale to identify potential incorrect links when the study participants were represented by different sets of IDs. In studies when the IDs mapping tables are not provided by the cross-jurisdictional data linkage unit (i.e. the IDs were embedded in the data sets), researchers are advised to create the mapping tables by summarising the IDs variables to identify inconsistencies. Checking for consistency among people identified as moving between jurisdictions and the integration of IDs (Steps 19–21) are essential for all studies using cross-jurisdictional linkage of person-level unit records. A failure to identify and manage the IDs matching inconsistencies would result in a lost (if one-to-many merging) or over-collation (if many-to-many merge) of information.

During the development of algorithms, it is critical to make sure that the exclusion of study participants is not related to their health status or outcomes (i.e. the algorithms not creating selection bias). This selection bias can arise because people having multiple contacts with health services would have higher chance of inconsistencies being identified. The decision to classify the inconsistencies as incorrect links, therefore, should be based on biological and chronological plausibility, and coherence between different data items. Inconsistencies that are biologically and/or chronologically impossible (e.g. different women mapped to a single ID of the child, medications dispensed years after date of death) are indicative of incorrect linkage. When linkage errors cannot be ruled out immediately, additional information obtained from related variables or records can help to inform decisions. For example, dates of the maternal hospital separation associated with the delivery were used to verify baby DOB (Steps 6 and 9) or inconsistencies were found in more than one data items (mother’s sex and month/year of birth as in Step 12). When decisions about reasonable values or patterns are imposed, it is important to evaluate the implications of chosen cut-offs by quantifying extent of the exclusion. For instance, a conservative decision was made for inconsistencies in parity (Step 8.3.1) as a less restrictive criteria i.e. expected number of pregnancy =1 and the count of pregnancy record ≥3 (instead of ≥4) would result in an additional 156 women being flagged for exclusion (578 instead of 422 women).


In conclusion, comprehensive and well-documented data consistency checks prior to commencing planned statistical analyses will improve the quality and reproducibility of perinatal research using linked administrative data. The data cleaning methods developed for the Smoking MUMS Study are recommended in other perinatal linkage studies, with appropriate modifications made based on knowledge about the data collections, validity and coherence of data items. Adoption of similar data cleaning methods across studies will assist in making comparisons across jurisdictions and countries, as well as across studies that are using ostensibly the same source datasets.



Australian Capital Territory


Admitted Patient Data Collection


Causes Of Death Unit Record File


Date of birth


Emergency department


Emergency Department Data Collection


Hospital Morbidity Data Collection


Midwives Notification System


New South Wales


Project-specific Patient Identification Number (PBS linkage)


Pharmaceutical Benefits Scheme


Perinatal Data Collection


Project-specific person number (State-based linkage)


Registry of Births, Deaths and Marriages


Register of Congenital Conditions


Western Australia


Year of birth


  1. Harron K, Gilbert R, Cromwell D, van der Meulen J. Linking Data for Mothers and Babies in De-Identified Electronic Health Data. PLoS One. 2016;11(10):e0164667.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Colvin L, Slack-Smith L, Stanley FJ, Bower C. Pharmacovigilance in pregnancy using population-based linked datasets. Pharmacoepidemiol Drug Saf. 2009;18(3):211–25.

    Article  PubMed  Google Scholar 

  3. Blehar MC, Spong C, Grady C, Goldkind SF, Sahin L, Clayton JA. Enrolling Pregnant Women: Issues in Clinical Research. Womens Health Issues. 2013;23(1):e39–45.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Australian Therapeutic Goods Administration. Reporting adverse events [cited 10 Nov 2016]. Available from:

  5. Meray N, Reitsma JB, Ravelli AC, Bonsel GJ. Probabilistic record linkage is a valid and transparent tool to combine databases without a patient identification number. J Clin Epidemiol. 2007;60:883–91.

    Article  PubMed  Google Scholar 

  6. Bentley JP, Ford JB, Taylor LK, Irvine KA, Roberts CL. Investigating linkage rates among probabilistically linked birth and hospitalization records. BMC Med Res Methodol. 2012;12(1):149.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, et al. Data Linkage: A powerful research tool with potential problems. BMC Health Serv Res. 2010;10(1):346.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Silveira DP, Artmann E. Accuracy of probabilistic record linkage applied to health databases: systematic review. Rev Saude Publica. 2009;43(5):875–82.

    Article  PubMed  Google Scholar 

  9. Van den Broeck J, Argeseanu Cunningham S, Eeckels R, Herbst K. Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities. PLoS Med. 2005;2(10):e267.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Benchimol EI, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Med. 2015;12(10):e1001885.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Huebner M, Vach W, le Cessie S. A systematic approach to initial data analysis is good research practice. J Thorac Cardiovasc Surg. 2016;151(1):25–7.

    Article  PubMed  Google Scholar 

  12. Boyd JH, Ferrante AM, O’Keefe CM, Bass AJ, Randall SM, Semmens JB. Data linkage infrastructure for cross-jurisdictional health-related research in Australia. BMC Health Serv Res. 2012;12(1):480.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Sandall J, Murrells T, Dodwell M, Gibson R, Bewley S, Coxon K, et al. The efficient use of the maternity workforce and the implications for safety and quality in maternity care: a population-based, cross-sectional study. Health Serv Deliv Res. 2014;2(38)

  14. Havard A, Jorm LR, Preen D, Daube M, Kemp A, Einarsdóttir K, et al. The Smoking MUMS (Maternal Use of Medications and Safety) Study: protocol for a population-based cohort study using linked administrative data. BMJ Open. 2013;3(9)

  15. Australian Bureau of Statistics. Australian Demographic Statistics, Mar 2016 Canberra: ABS; 2016 [cited 10 Nov 2016]. Available from:

  16. Kelman CW, Bass AJ, Holman CDJ. Research use of linked health data: A best practice protocol. Aust N Z J Public Health. 2002;26(3):251–5.

    Article  CAS  PubMed  Google Scholar 

  17. Western Australia Data Linkage Branch. Western Australian Data Linkage. Western Australia Data Linkage Branch [Online] 2016 [cited 9 July 2017]. Available from

  18. Centre for Health Record Linkage. How record linkage works [Online] 2016 [cited 10 Nov 2016]. Available from:

  19. Western Australia Data Linkage Branch. Data linkage – making the right connections. Western Australia Data Linkage Branch [Online] 2016 [cited 9 July 2017]. Available form

  20. Centre for Health Record Linkage. Master Linkage Key Quality Assurance [Online]. 2012 [cited 10 Nov 2016]. Available from:

  21. Australian Institute of Health and Welfare Integration Services Centre. Sample based clerical review report for the Smoking MUMS Study. 2015.

    Google Scholar 

  22. Downey F. Validation Study of the Western Australian Midwives Notification System, 2005 data. Department of Health, Western Australia: Perth; 2007.

    Google Scholar 

  23. Taylor L, Pym M, Bajuk B, Sutton L, Travis S, Banks C. Validation study NSW Midwives Data Collection 1998. NSW Health Department: Sydney; 2000.

    Google Scholar 

  24. Ford JB, Roberts CL, Taylor LK. Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatr Perinat Epidemiol. 2006;20:329–37.

    Article  PubMed  Google Scholar 

  25. Department of Health. Pharmaceutical Benefits Scheme collection of under co-payment data. 2014 [cited 15 Jan 2017]. Available from:

Download references


The authors would like to thank the NSW Ministry of Health, the Department of Health WA, the Australian Government Department of Health and Ageing, the Department of Human Services, the NSW Centre for Health Record Linkage, the Australian Institute for Health and Welfare, the Western Australia Data Linkage Branch and data custodians of the NSW Perinatal Data Collection, WA Midwife Notification System, NSW Admitted Patient Data Collection, WA Hospital Morbidity Data Collection, NSW and WA Emergency Department Data Collections, NSW Registry of Births, Deaths and Marriages, and NSW and WA Causes Of Death Unit Record Files, and NSW Register of Congenital Conditions for allowing access to the data and conducting the linkage of records. The NSW Cause of Death Unit Record Files are held by the NSW Ministry of Health Secure Analytics for Population Health Research and Intelligence and provided by the Australian Coordinating Registry on behalf of the NSW Registry of Births, Deaths and Marriages, NSW Coroner and the National Coronial Information System. The authors acknowledge Sanja Lujic and Mark Hanly (Centre for Big Data Research in Health, UNSW) for their comments on the draft manuscript.


The Smoking MUMS study is supported by an Australian National Health and Medical Research Council Project Grant (#1028543) and the study investigators comprise Alys Havard, Louisa R Jorm, David Preen, Michael Daube, Anna Kemp, Kristjana Einarsdóttir, Deborah Randall and Duong Thuy Tran. AH is supported by a National Heart Foundation Future Leader Fellowship (#100411).

Availability of data and materials

The data sets were constructed with the permission of each of the source data custodians and with specific ethical approvals. Authors do not have permission to share patient-level data because of the highly confidential nature of the data. Permission to access to the data is restricted to researchers named and approved by relevant human research ethics committees.

Author information

Authors and Affiliations



DTT conducted data cleaning and prepared the manuscript with input from AH and LJ. All authors approved the final draft.

Corresponding author

Correspondence to Duong Thuy Tran.

Ethics declarations

Ethics approval and consent to participate

The project was approved by the Australian Institute of Health and Welfare Ethics Committee, the NSW Population and Health Services Research Ethics Committee and the Department of Health WA Human Research Ethics Committee. The study used routinely collected data that have been anonymised. Waiver of consent to participate was obtained.

Consent for publication

Not applicable.

Competing interests

The authors have declared that no competing interests exist.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1:

Examples of SAS codes used for the Smoking MUMS Study data cleaning. (DOCX 69 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tran, D.T., Havard, A. & Jorm, L.R. Data cleaning and management protocols for linked perinatal research data: a good practice example from the Smoking MUMS (Maternal Use of Medications and Safety) Study. BMC Med Res Methodol 17, 97 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: