Site Code Number/Investigator Identifier
• In clinical trial data, place of treatment is usually collected as a site code number/investigator identifier. These site codes should be re-coded to a new random site code (similar to patient code number/identifier). Sites which include few patients may be aggregated to a single site code number/identifier. Countries which include few patients could also be pooled.
Demographics and anthropometry measures
• Date of birth is a direct identifier and should be should be replaced with age. As a general rule, ages above 89 should be set to a category ‘ > 89’; however depending on the disease and the population under consideration, further grouping of age categories should be considered. Consideration should also be given to recoding/grouping other ages at the lower or upper limits. Another consideration, assuming this does not impact data utility adversely, is to group ages (for example into five year age categories). All other patient-related dates including date of death should be removed and replaced either with a derived study day relative to a baseline or reference date or offset by some random interval.
• Gender can be kept and it is recommended that race is mapped into categories (e.g. FDA recommendations: American Indian or Alaska Native, Asian, Black or African American, Native Hawaiian or Other Pacific Islander, and White). Ethnicity is usually removed.
• Anthropometry measures (e.g. weight, height) should be kept in de-identified datasets as they are frequently key variables for dosing (e.g. mg/kg) or present in the calculations of body surface area (BSA) or body mass index (BMI) or used as covariates in data analyses. Consider grouping variables.
• Verbatim (free) text may include information that identifies a patient e.g. names, dates or other personal information. Examples of variables containing verbatim text are adverse events, medications, medical history and general comments. Preferably, variables containing free text are either removed from the dataset or set to blank. Alternatively, the data could be reviewed to assess the risk of patient identification, especially if the data add scientific value to the dataset, and any identifiers at the observational level removed.
• Where a variable has been coded according to a standard dictionary (e.g. adverse events to MedDRA, medications to WHO ATC), the dictionary term(s) should be retained in the dataset and the verbatim text dropped.
Small populations, low frequency and rare events, rare diseases, sensitive data
• Hrynaszkiewicz recommends that for indirect identifiers with small denominators (population size of <100) or very small numerators (event counts of <3), may present a risk if present in combination with other indirect identifiers. However, to exclude such data in all cases may limit the ability of a researcher to perform meaningful analyses, particularly in the case of small numerators for adverse event reporting which may result in the removal of rare events of interest.
• Studies involving rare diseases or including sensitive data should be reviewed on a case by case basis and assessed as to whether sufficient steps can be taken to adequately maintain patient confidentiality.
• Potential indirect identifiers which are important for data utility may be retained and could be recoded/grouped, otherwise they should be removed.