Skip to main content

Table 2 Steps undertaken to assess consistency of State-based data

From: Data cleaning and management protocols for linked perinatal research data: a good practice example from the Smoking MUMS (Maternal Use of Medications and Safety) Study

Step Data sets

Explanationa

Findings

Uniqueness of record

1

Record duplicates

 
 

Perinatal,

Hospital,

ED,

Death

1.1 Identify identical duplicates i.e. all variables contain the same information, except the unique ID of the records;

1.2 Identify partial duplicates for the same death: Remove identical duplicates (identified in Step 1.1). Among the remaining records, identify PPNs that present in ≥2 records, then review records of these PPNs, using information from other data sets if necessaryb.

• Marked the records as “duplicate”: 9 perinatal, 1161 hospital admissions, 659 ED visits, and 49 death records.

2

Missing record of death registration (NSW only)

 

Death

2.1 Identify whether a PPN is absent from the death registration but present in the causes of death data set

• No cases found (as expected on the basis of deterministic linkage methodology for death data).

3

Uniqueness of babyPPN

 

Perinatal

Non-unique babyPPN may be due to linkage errors or multiple data entries;

 3.1 Remove records flagged as “duplicate” (Step 1);

 3.2 Identify babyPPNs that present in ≥2 perinatal records, then identify related mumPPNs;

 3.3 Examine whether different mumPPNs are mapped to the same babyPPN. If this is the case, flag mothers as “exclusion”.

 3.4 For the remaining mothers, review all of their perinatal records c. Flag the mother as “exclusion” if the review suggests linkage error, otherwise mark the record as “duplicate”.

• 31 mothers were flagged as “exclusion”

• 114 records were marked as “duplicate”.

Consistency of perinatal information

4

Birthweight and gestational age

 

Perinatal

4.1 Cross-tabulate birthweight (categorised as missing, <400, 400–999, 1000–1999, 2000–4999, and ≥5000 g) with gestational age (categorised as missing, <20, 20–26, 27–36, 37–44, and ≥45 weeks);

4.2 Quantify unexpected records where gestation <20 weeks and birthweight <400gd;

4.3 Quantify outlier gestational age (≥45 weeks);

4.4 Quantify extreme birthweight (relative to gestational age);

4.5 No changes were made as outlier values could relate to medical reasons.

• 28 records with both gestation <20 weeks and birthweight <400 g;

• 79 records with gestation ≥45 weeks;

• 57 records with birthweight <1000 g and gestation ≥45 weeks;

• 40 records with birthweight ≥2000 g and gestation ≤26 weeks.

5

Birth order and pregnancy plurality

 

Perinatal

It is necessary to select one record per delivery (where birth order = 1st) when pregnancy is the unit of analysise.

 5.1 Identify records with implausible order of birth given plurality (e.g. a singleton with birth order = 2nd, twins with birth order = 3rd);

 5.2 Subset to plural pregnancies and sort according to baby DOB and ascending birth order. For each plural pregnancy, create the expected sequence of birth: sequence =1 for the first birth and increases by 1 for each subsequent birth, allowing for possibility that twins, triplets born on different datesf;

 5.3 If the expected sequence of birth differs from the recorded birth order, then identify the mothers and review all of their perinatal records. Make correction if the review suggests typo error, otherwise mark the record as “duplicate”.

• No record with implausible birth order (given the plurality);

• Correction made for 10 twins records (birth order changed from 1st to 2nd) and another 6 records (baby DOB);

• 4 records marked as “duplicate”.

6

Interval between the two consecutive pregnancies

 

Perinatal

6.1 Select records where birth order = 1st and sort in ascending order of baby DOB;

6.2 Calculate date of conception (baby DOB – gestation weeks × 7 + 14 days);

6.3 Calculate the interval between two consecutive pregnancies (date of conception – date of prior delivery – 7 days), allowing for gestation being recorded as completed weeks;

6.4 Flag mother as “exclusion” if pregnancy interval < 0.

• 396 mothers marked as “exclusion”.

7

Missing value of parity

 

Perinatal

7.1 Subset to mothers who had a missing value of parity.

7.2 If the missing parity in a record relating to plural births, replace the missing with the value available in the other twins or triplets records.

7.3 After replacing missing parity in plural births, further subset to records where birth order = 1st. Then, for each mother:

 7.3.1 Quantify how many pregnancies are recorded in the data (regardless of parity);

 7.3.2 Quantify how many records with missing parity;

 7.3.3 Examine whether parity in the second record is zerog;

 7.3.4 Among records with a parity (non-missing), sort in ascending baby DOB, then categorise the sequence of parity as logical or illogical. The sequence is logical if between any two consecutive records, the parity value in the prior record is less than the value in the next record (e.g. parity values as 0-1-2-4); otherwise the sequence is illogical (e.g. parity values as 0-2-1);

7.4 Make no changes for mothers who either had only one pregnancy recorded, ≥2 records with missing data, parity as zero in the second record, or illogical parity sequence (further examine in Step 8);

7.5 Among the remaining mothers: Replace missing parity in the

 7.5.1 first record (=next parity −1) if next parity equal to 1; or next parity > 1 and pregnancy interval < 40 weeksh; otherwise make no changes;

 7.5.2 last record (=prior parity +1) if pregnancy interval < 40weeksd;

 7.5.3 record other than the first and last (=prior parity +1) if the difference between the two adjacent parity values equal to 2; or the difference > 2 and interval < 40weeksi; otherwise make no changes.

• Missing parity was replaced for 1218 out of 1633 records.

8

Consistency in parity

 

Perinatal

8.1 Select records where birth order = 1st, sort in ascending baby DOB, then for each mother:

 8.1.1 Count how many pregnancies are recorded in the data (regardless of parity);

 8.1.2 Calculate the expected number of pregnancies (=highest parity value - lowest parity value +1);

 8.1.3 Categorise the sequence of parity as logical or illogical (as per Step 7.3.4);

8.2 Among mothers who have logical sequence of parity:

 8.2.1 Expected number of pregnancies equal to the count indicates parity consistency;

 8.2.2 Expected number less than the count is due to missing data in parity (not replaced in Step 7). Make no changes;

 8.2.3 Expected number greater than the count (by 1 to 3) suggests mother might have intervening births interstate or overseas (e.g. parity as 1-2-4-5);

 8.2.4 Expected number substantially greater than count, especially the expected number ≥ 10, may suggest typo errors (e.g. parity as 0-1-2-13). Examine those cases, correct plausible typo errors otherwise make no changes.

8.3 Among mothers who have illogical sequence of parity:

 8.3.1 Expected number considerably less than the count may suggest linkage errors (e.g. parity as 0–1–2-0-1). Flag the mother as “exclusion” if expected = 1 and count ≥4, or expected ≥2 and count - expected ≥2.

 8.3.2 Expected number greater than the count may suggest typo errors, especially the expected number ≥ 10. Examine those cases, correct plausible errors.

 8.3.3 Make no changes for other inconsistencies (e.g. parity as 1–2-2, 0–1–2-6-4).

• 422 mothers were flagged as “exclusion”.

• 161 records with plausible typo errors corrected

• 36,244 mothers (4.6%) had inconsistent parity information for which no changes were made

• 34,494 mothers (4.4%) might have intervening births interstates or overseas.

Consistency across different data sources

 

9

Consistency in baby DOB

 

Perinatal,

Hospital,

ED

Validation studies reported accuracy of baby DOB in perinatal data (referred to as perinatal DOB), this variable can be used to assess the baby’s DOB and age in other linked records (referred to as patient DOB). Vice versa, when the baby DOBs across data sources differ, patient DOB can be used to verify baby DOB recorded in perinatal data.

 9.1 Remove duplicates and mothers flagged as “exclusion” from perinatal data;

 9.2 Combine hospital, ED and death records of the children. Remove records with invalid patient YOB (<1920 or >2014), then generate the list of patient DOBs;

 9.3 Compare perinatal DOB with patient DOBs. If the perinatal DOB does not match with patient DOBs, then:

  9.3.1 Identify the mothers and extract maternal hospital admission records.

  9.3.2 Compare perinatal and patient DOBs with dates of maternal admission and separation. If only patient DOB matches (admission date ≤ patient DOB ≤ separation date) consider patient DOB as an alternative DOB for the baby.

  9.3.3 Prior to accepting the alternative DOB, make sure that the maternal admission indicates birth delivery and the alternative DOB is not equal to the DOB of another child born to the same mother (likely linkage errors among children, except plural births), and does not create inconsistencies in pregnancy interval and parity (outlined in Steps 6 and 8).

 9.4 Update the perinatal baby DOB. Merge the updated DOB into the original hospital and ED data to identify erroneous records for which correction of baby’s age or deletion of record is necessary. Correct baby’s age if the updated and patient DOBs contain the same month and year, same month and day, same day and year, or the two DOBs are less than 20 weeks apart; otherwise flag the record for “deletion”.

• Alternative DOB was created for 667 babies. The original and alternative DOBs were 1 day apart (30%), between 2 and 10 days apart (46%), and share the same day and year (16%).

• Baby’s age was corrected in 41,516 hospital and ED records;

• 937 hospital and ED records were flagged “deletion”.

10

Consistency between perinatal and congenital condition data

 

Perinatal,

Congenital (NSW only)

10.1 Identify baby who had a linked birth defect notification;

10.2 Compare baby DOB and birthweight recorded in birth defect data versus those two variables recorded in perinatal data;

10.3 If both pieces of information differ, then identify the mother and review all pregnancy records of the mothers and the related birth defect records.

• 1 mother flagged as “exclusion” (review indicated linkage errors among her children).

 11

Consistency in mother’s year of birth (YOB)

 

Perinatal,

Hospital,

ED,

Death

11.1 Examine and define the range of mother YOBs according to perinatal data;

11.2 Examine the distribution of mother YOBs in other data sets;

11.3 Combine perinatal, hospital, ED and death data sets. Remove records with invalid mother YOBs (<1900 or >2014). For each mother:

 11.3.1 Quantify how many records with an out-of-range YOB (i.e. outside the range defined at Step 11.1);

 11.3.2 Quantify how many different YOBs are recorded;

11.4 Flag the mother as “exclusion” if she has ≥2 records with an out-of-range YOB or >3 different YOBs.

• YOBs in perinatal data ranged between 1941 and 1999, with no invalid values.

• Hospital and ED data contain records with invalid YOBs (<1900 or >2014) and those outside the range.

• 27 mothers flagged as “exclusion”;

12

Mother’s sex as male

 

Perinatal,

Hospital,

ED,

Death

12.1 Combine mother’s perinatal, hospital, ED and death records, then remove records with invalid mother YOBs (<1900 or >2014). For each mother:

 12.1.1 Quantify how many records with sex recorded as “male”;

 12.1.2 Quantify how many different YOBs are record;

 12.1.3 Quantify how many different months of birth are recorded;

12.2 Flag the mother as “exclusion” if she has ≥2 records with sex as male and more than one YOB and/or month of birth.

• Hospital and ED data of the mothers contain records with sex recorded as male;

• 38 mothers flagged as “exclusion”

 13

Mother having births after total hysterectomy procedures

 

Perinatal

Hospital

13.1 Identify mothers who had hospital admissions for hysterectomy procedure(s)j;

13.2 Among these mothers, extract the hospital admissions during which hysterectomy procedure(s) were undertaken, and extract their most recent pregnancy record;

13.3 Compare the most recent date of delivery with date of separation following the hysterectomy procedure. Flag “exclusion” if date of separation is earlier than date of delivery.

• 51 mother flagged as “exclusion”

14

Baby DOB being later than date of discharge from hospital or ED

 

Perinatal,

Hospital,

ED

14.1 Combine babies’ hospital admission and ED data sets.

14.2 Compare the updated baby DOB (Step 9) with date of separation and patient DOB. Identify records where updated baby DOB is later than the date of separation.

14.3 Flag the baby as “exclusion” if the two DOBs are more than 20 weeks apart; otherwise mark the records as “deletion”. When the child is flagged as “exclusion”, further identify and flag the mother as “exclusion”.

• 10 mothers flagged as “exclusion”;

• 42 hospital and ED records marked as “deletion”.

 15

Date of death being earlier than episodes of health service use

 

Death

Perinatal,

Hospital,

ED

15.1 Identify persons (mothers or babies) who have a linked death record;

15.2 Extract and combine hospital, ED and perinatal records of these persons;

15.3 Compare the person’s date of death versus date of discharge from hospital or ED and date of delivery (applicable to mothers);

15.4 Flag the person as “exclusion” if date of death is earlier than date of health service use (allowing for administrative delay of up to 3 days). If it is the case for the child, further identify and flag the mother as “exclusion”.

• 22 mothers flagged as “exclusion”.

16

Date of admission or arrival to ED being later than date of discharge

 

Hospital,

ED

16.1 Identify hospital admission and ED records wherein date of admission to hospital or date of arrival to ED is later than the date of discharge. Mark these records as “deletion”.

• 34 hospital or ED records marked as “deletion”

  1. aIn this study, adding a variable to a data set is referred to as “merge” while adding records is referred to as “combine”
  2. bIt is useful to examine status of patient at discharge and date of discharge in hospital or ED data relating these deaths if dates of death differ
  3. cInformation useful for review: baby DOB, plurality, birth order, birthweight, gestational age, Apgar scores, discharge status, mother’s age, postcode, country of birth and hospital
  4. dPerinatal data cover births that gestation ≥20 weeks or birthweight ≥400 g
  5. eBirth order indicates the order each baby was born (coded as 1st, 2nd, 3rd, etc.). Birth order for singletons is 1st. Multi-fetal pregnancies generated two or more perinatal records which contain baby-specific information including order of birth while maternal information is the same
  6. fIn complicated plural pregnancies, it might be possible that babies born days apart, thus the gap between baby DOBs needs to be consistent with the difference in gestational age
  7. gParity as zero in the second record indicates an error, given parity defined as the number of previous pregnancies ≥20 weeks; but this errors is not always identified through the check of parity sequence (e.g. parity values as missing-0-1)
  8. hThe interval (calculated as in Step 6) between the first and the second record
  9. iThe interval between the record with missing parity and the prior record
  10. jHospital procedures were coded according to Australian Classification of Health Interventions (ACHI). See the Additional file 1 for hysterectomy procedure codes