A fundamental component for the successful data analysis and the collaborative development of novel machine learning methods on this rich data set has been the construction of an innovative research informatics framework that can capture the data at regular intervals from Novartis into a secure IT infrastructure at the BDI where data could be integrated, quality controlled and compiled in to a research ready relational database which would then be available to analysts. The development of this framework (Fig. 2) in the BDI and the successful capture of the data (described previously in the Methods) into a versioned research ready dataset is described below.
A secure and collaborative research infrastructure
A critical part of the research collaboration was to develop an IT infrastructure and corresponding information security architecture that was technologically feasible, business viable and foremost user desirable for the project. The design of the information security architecture and controls was governed by the need for the processing of a large amount of clinical data shared between two organisations. In order to design and implement proper information security, it was vital to understand the various stages in the process, the states of the data, the information security goals and the overall risks. After collecting and assessing all of this data, a custom information security architecture was defined and security controls and guidelines applied.
The overall results of the above methodology were that confidentiality was identified as the overall and most relevant information security goal within the project. This decision defined the fundamental principle that governed the design and implementation of the security controls.
-
1.
Isolation from the existing Oxford BDI infrastructure whenever possible, at least for every activity involving non-anonymised data.
-
2.
Encrypted data transfer via dedicated channels between both organisations to ensure confidentiality and integrity.
-
3.
User access to the environment is proxied via a demilitarized zone (DMZ), which contains a certain set of jump hosts that serve as portals to the full environment.
-
4.
Identity and access management is realized within the environment, providing authentication and authorisation services.
-
5.
The log management is done at a central place, together with security information and event management.
Two separate environments were created for anonymisation and analytics work, and both of these were instantiated within a dedicated OpenStack private cloud, as two separate tenants. This ensures network, compute and storage isolation enforced at the hypervisor level. For data processing virtual clusters were created within each tenant, including an instance (virtual machine) with direct access to GPUs for accelerated work. The virtual resources within each tenant were defined and provisioned by means of Ansible roles and playbooks, for consistency and repeatability. Encrypted backups are made to an S3 object store. In short, we have produced a unique research computing infrastructure that provides high levels of security while providing a shared environment where both academic and industrial researchers can jointly work.
Clinical data anonymisation
This section describes the anonymisation of the clinical trial data and the specific methods developed to anonymise MRI data to ensure data privacy.
Clinical trial data – basic principles
The process for anonymising the clinical trial data was intended to ensure that the risk of re-identifying participants in the dataset was below a pre-defined critical threshold. There are three key concepts in this risk-based anonymisation approach:
-
The risk of re-identification can be measured quantitatively. Various models of adversaries and re-identification attacks have been developed and have demonstrated robustness in practice [2]. Metrics quantifying the probability of a successful re-identification have been developed based on these models. The specific metrics that we used are based on strict average risk models. These capture the average risk while ensuring that there are no population unique individuals in the anonymised dataset (i.e., in the context of the General Data Protection Regulation (GDPR), the likelihood of individuals being “singled out” is very small [3].
-
The overall risk measurement takes into account the context of data processing as is illustrated in Fig. 3. For example, if the anonymised dataset will be analysed in a secure compared to a less secure environment then less modification of the data is required to bring the risk of re-identifying a patient to below the targeted threshold. Checklists have been developed and validated to capture this context risk [2].
-
A specific threshold needs to be defined to determine what an acceptably low risk is. There are many precedents for what is deemed to be an acceptable threshold, including from regulators (see the review in [2]). The choice of a specific threshold from the precedent range takes into account the sensitivity of the data and the potential harm if there is a re-identification.
Once a threshold is defined and the re-identification risk is computed, taking into account the context, transformation may be required until the risk is below the defined threshold. The transformations can be performed to the data itself (e.g. by modifying variables that may lead to re-identification such as a patient’s age) or to the context (e.g. by modifying the security of the IT system). After each transformation the overall risk can be re-computed until it is below the threshold.
Justification for threshold
The European Medicines Agency (EMA) has established a policy on the publication of clinical data for medicinal products [4] which requires applicants/sponsors to openly share clinical trial data. The guidelines accompanying the policy recommend a maximum risk threshold of 0.09. Health Canada implemented the same threshold for the sharing of clinical trial data [5]. This is the threshold that is used for the anonymization of the clinical data.
Calculation of risk
The risk of re-identification is calculated only on the quasi-identifiers. The quasi-identifiers are variables that are knowable by an adversary. There are two general types of quasi-identifiers. The first are those which are in the public domain and can be collected from registries such as voter registration lists [6] and lien registries [7]. Examples of these include date of birth and ZIP/postal codes. The second are acquaintance quasi-identifiers, which are known by adversaries who are also acquaintances, such as neighbours, relatives, and co-workers. Acquaintance quasi-identifiers include the public ones as well as things like medical history and key events and dates. Once the quasi-identifiers are determined in a dataset, the probability of re-identification can be calculated.
The calculation of re-identification risk considers three potential attacks on the data, which we shall call T1, T2, and T3.
The first attack, T1, assumes that an adversary deliberately attempts to re-identify individuals in the dataset [8]. This means that the probability of re-identification is conditional on an attempted attack:
$$p\left( re- identification\right)=p\left(\left. re- identification\right| attempt\right)\times p(attempt)$$
(1)
The first term captures the risk in the data and the second term captures the risk from the context. There are multiple estimators that can be used to evaluate data risk which vary in accuracy and scalability [2, 9,10,11,12,13,14].
Context risk has three components: security controls, privacy controls, and contractual controls. The strength of these controls as they were implemented at the BDI were assessed using a checklist. The checklist is reproduced elsewhere [2]. The responses to the checklist are converted into a conservative subjective probability. This means that the exact probability value is not known, but the modeled value is convincingly conservative (over estimates the context risk) but still allows us to model the controls that are in place and account for the benefits of stronger controls.
The premise of the controls for attack T1 is that the existence of stronger security controls (e.g., audit logs that are checked, analyst screening, and limited access), privacy controls (e.g., regular privacy training and a privacy officer), and contractual controls (e.g., all analysts have to sign a confidentiality agreement when working with the data) act as deterrents for an attempted attack and make it more difficult.
A T2 attack pertains to an inadvertent re-identification. This is when an analyst inadvertently or spontaneously recognizes someone that they know in the dataset as they are working on it. This type of risk is given by:
$$p\left( re- identification\right)=p\left(\left. re- identification\right| acquaintance\right)\times p(acquaintance)$$
(2)
An inadvertent re-identification is contingent on an analyst knowing someone in the data. In our case this means that an analyst would know someone who has participated in a trial in this therapeutic area. This is estimated as: p(acquaintance) = 1 − (1 − v)150: where v is the proportion of patients in the current studies compared to all studies in this therapeutic area over the same period and geography, which can be computed by gathering target recruitment data from https://clinicaltrials.gov/ . The 150 value is the Dunbar number, which provides us with an estimate of the average number of individuals that an analyst would know. Dunbar’s has proven to be robust across multiple studies (for a literature review see [2]).
The third attack is when there is a data breach and the dataset is accessed by an adversary. This is modeled as follows:
$$p\left( re- identification\right)=p\left(\left. re- identification\right| breach\right)\times p(breach)$$
(3)
The probability of a breach is computed from published reports on health data breaches and their likelihood that are produced on a regular basis by security companies.
After computing the risk values for the three types of attack, the maximum across them is then taken to reflect the overall risk in the data. If this maximum risk is below the 0.09 threshold, then the dataset is deemed to have an acceptably low risk of re-identification. The same approach is applied to analyse the risk in clinical trial data and in the header information in DICOM files.
Strict average risk
The risk calculation described above gives us the average risk (averaged across all patients). The strict average conditions this on no records in the dataset being unique in the population. The population is defined as all patients who have participated in clinical trials in the same therapeutic area over the same period and geography. There are a number of estimators that can be used for estimating population uniqueness, with a specific one recommended based on a comparative assessment [15].
Application of the basic principles
The basic principles have been operationalized for the anonymisation of clinical trial data as a series of default anonymisation practices, which can then be adjusted to account for study-specific data issues. Patient identifiers (typically consisting of a clinical centre number and the patient’s randomization number with which a patient is identified in a clinical trial) is replaced by an anonymised identifier, a new number specifically and uniquely generated for the use of the data in the context of the collaboration. The link file that connects the original patient identifier from the trial with the new anonymised identifier is securely protected and only accessible to a very small independent team who are working exclusively on the anonymisation of the data but who are not otherwise involved in the collaboration or the subsequent research. The link file is used for the sole purpose of assigning the same anonymised patient ID to both the patient’s clinical data and MRI images, so that the corresponding imaging and clinical data remain together after the anonymization is completed, for downstream analyses. By default, event dates in the dataset are offset into relative dates as defined in the PhUSE standard [16]. Also, variables like age are typically generalized to, for example, five year ranges, or modified by adding uniform noise. The SiteID is suppressed so that the geographic location of a site cannot be determined by looking up recruitment information in public registries. Other variables that may contain information that could lead to the re-identification, such as a patient’s medical history, can be generalised or suppressed. The decision as to which variables are transformed takes the intended research purpose into account to preserve the data as much as possible where critical for the research while still bringing the risk of re-identification below the defined threshold. A detailed report is produced documenting the anonymisation methodology, how it was operationalised for each dataset, and a summary of the anonymisation outcomes (e.g., which variables were transformed and how). This detailed report is crucial for the data wrangling and downstream analysis. All data in the final relational database could be linked to the report which described the steps taken.
Magnetic resonance imaging (MRI) data
In clinical trials MRI images are commonly obtained at hospitals in Digital Imaging and Communications (DICOM) format and provided to Clinical Research Organisations (CROs) who specialise in imaging analysis. In the MS project the DICOM images were transferred from the CROs to the isolated anonymisation computational environment. Within this environment the image data went into a three-stage process of image conversion, defacing and data curation. Each DICOM file represents one slice of the entire scan; one MRI session will generate multiple scans. The DICOMs were re-assembled into a single 3D volume (as per the research needs), using the DICOM conversion software HeuDiConv [17]. The resulting output of this process is a set of files in a different format (JSON – JavaScript Object Notation, and NIfTI – Neuroimaging Informatics Technology Initiative) that exactly preserve original DICOM data values as well as their associated non-identifying DICOM metadata (i.e. meta-data that could contribute to the identification of patients was stripped out during conversion), but organised in a research ready format, developed and used extensively within the Neuroimaging research community, called Brain Imaging Data Structure (BIDS - https://bids.neuroimaging.io/). In addition to a controlled, standardised file-structure, BIDS provides a file naming convention with the same characteristics, including adding scan-type details for ease of processing. During the initial conversion process, scans that failed to convert, or converted with errors, were put aside and evaluated to see if they could be successfully converted. Overall, we were able to convert scans for over 99% of subjects.
Once the data have been converted to NIfTI and is in BIDS format, they were run through a processing pipeline, simply called ‘defacing’. This pipeline has several steps that aim to achieve two key objectives:
-
1.
Remove identifiable facial features (nose, mouth, front of the eyes, ears)
-
2.
Remove identifiable metadata from the scan’s associated JSON
For privacy/security reasons the identifiable facial features were removed (defaced). The facial identifiable elements were selected according to the anonymisation principles used in the UK BioBank project [18], and removed using defacing software from the FSL software library [19]. To ensure the successful anonymisation, all defacing results are visually checked via multiple 3D surface renderings, confirming the removal of facial features and the retention of brain and meninges. Scans were QC checked and classified as either ‘passed’, or as one of four subclasses of defacing issues. Due to this being a multi-site, longitudinal dataset, MRI scans were of variable quality, and initial defacing failure rates were up to 40% in some studies. Scans that failed QC checks were put through additional rounds of re-defacing and subsequent QC checks, where custom defacing parameters – derived from the type of previous QC classifications and scan modality – were applied to the scans, allowing us to achieve high rates of successful defacing (96%). In total over 230,000 MRIs were defaced and manually checked before entering the research ready dataset. The anonymised data is stripped of all metadata except non-identifiable acquisition parameters. Additional checks were also undertaken to ensure that identifiable details had not been erroneously inserted (during acquisition) into the retained metadata fields. Once all QC checks have been completed, this data is copied via a dedicated and automated mechanism from the anonymisation environment to the analytics environment. Additional safeguards have been implemented to ensure confidentiality and integrity of the data.
Data exploration, quality control and integration – research-ready dataset
Non-imaging clinical datasets are anonymised by a third party (Privacy Analytics, Inc) and transferred to the analytics environment at the BDI to begin the data wrangling process. This process has the ultimate goal of providing all data in a relational database, from where streamlined, research ready datasets for the analytics team can be retrieved. A detailed tracking system was developed that is shared across the collaboration to transparently convey the status of each data set within the pipeline. Due to the large number of steps, transformations, and transactions that each dataset goes through, it is essential to track each data point received.
The initial stage in the extract, transform and load (ETL) of data from Novartis to the BDI, was the capture of the clinical trial data as Statistical Analysis Software (SAS) files. For both the MS and IL-17 projects, the BDI team worked closely with the clinical teams at Novartis to ensure full understanding of the data to be downloaded and all related documentation. Each study was downloaded separately in an average of 30 to 50 different tables. These tables contained the primary raw data and study-specific information as described in methods. Each table contained hundreds of thousands of measurements, across hundreds of variables.
Once the data was received, the next critical collaborative step in the process was the data exploration by a dedicated data wrangler to validate that the data received matched the expected data and the protocol documents. This step was extensive and performed in collaboration with the team at Novartis who could review identified queries by exploring the primary data which has not been anonymised. At this stage the tracking of data and related queries between the data generators (Novartis) and the BDI was paramount to ensure all downstream analysis would be reproducible and linkable to a well-defined set of data.
Once the data was agreed to be correct and valid, the data wrangling team developed codebooks (data specifications) which described the structure of the data in a computationally readable format. These codebooks are then utilised by a bespoke software pipeline to homogenise the data into a generic relational database structure. This task had many challenges due to varying trial designs, inconsistencies in data capture between trials, changes in technology throughout history, subjective evaluations, changes in data standards, and anonymisation. Upon successful completion of this part of the pipeline individual data files were imported into the relational database. In conjunction with the import of clinical trial data, the pipeline imports metadata about the relevant additional datasets provided e.g. imaging or omics. The key remit for this level of integration is to ensure that relevant data slices can be provided downstream to the analysts and also to ensure the data outputs from the analytics can be integrated back into the overall architecture. This work was deemed to be important, as it was setup in a way that it can be reproduced as new datasets come onboard and was developed in a manner to manage any clinical trial data not just the current data from this project.
Once the data dictionaries were completed across the projects, the data wrangling team began the overall data quality control process in parallel with the aim of identifying any data quality issues through the data life cycle. The data validation and QC process are innovative as they have been created in a sustainable manner to ensure data is tracked and checked throughout the lifetime of the project and that data provenance is managed at the level of individual data points. The quality of the data from Novartis to the analytical teams was deemed critically important, to ensure data analysts did not have to perform this level of exploratory work and that the results they identified were reproducible. A robust QC pipeline was therefore developed through intensive collaboration that assessed the data at many levels. The quality control pipeline was developed to perform both validation and verification at different stages. For example, source validation ensured the data received matched what was expected and global validation checked the merged data. Validation and verification encompass a large list of checks, from structural checks, assessing levels of missingness, the effects of the anonymisation, and visual checks for potential data anomalies. The anonymisation reports were key to check whether data was missing because it was not captured in the first place, or if it was suppressed due to anonymisation.
The final output of the ETL process was to generate snapshots of data as tracked and versioned data releases for the analysts. A data release is a snapshot of merged datasets available as a relational database or in a data format that can be inputted to analytical tools (e.g. API, table structure etc.). The analytical teams can therefor develop methods which are attributed to the correct version ensuring transparency and reproducibility. As the data analytical methods are being developed, the ETL pipeline is being expanded to ensure that data outputted from new methods can be integrated back into data releases. This is key in a data project.