Due to rapid digitalization, the volume and availability of health data is expanding fast [1]. The availability of these health data will provide new opportunities for medical research. For example, traditional methods for data collection (e.g. randomised controlled trials, cohort studies and surveys) are often expensive, time intensive and can suffer from low response rates [2]. By linking existing registries, time and budget can be saved as recruitment of new patients and collection of new data is not necessary [2, 3]. Besides, as these registries often include unselected populations and not only those who are eligible and agree to participate in a clinical trial, there is less selection bias and populations that are often under-represented in surveys or clinical trials (such as migrants or elderly) can be studied effectively [2, 3]. Furthermore, the volume and length of follow up in these registries may also enable studying patients in subgroups, different stages, with rare (subtypes of) diseases or rare events and for a longer time [2, 3].
Linkage of existing registries can further increase the potential of health data [2, 4, 5]. Most registries are designed for a specific goal, and only data related to that goal is gathered. For example, clinical registries may contain detailed disease-specific information, but lack detailed information about other medical diagnoses or health care utilization. Administrative registries may contain data on health care utilization, but often lack detailed disease-specific information [6]. Patient-level linkage of registries leads to more detailed and extended information per patient, [3] combining the strengths of the original registries and allowing to study a wider range of research questions.
Besides opportunities, linkage of registries also brings challenges. Most health data are spread across organizations, various servers and networks, and data are sometimes purposely separated to prevent traceability to individuals [1, 7, 8]. Linkage of registries therefore brings challenges regarding 1) responsibilities 2) privacy and security and 3) quality of data linkage. These challenges will be discussed in more detail below.
First, when registries from multiple organizations are linked, responsibility and accountability for data need to be arranged. Negotiating the legal, ethical and governance frameworks and requirements may take considerable time and effort, as they often have to be in line with existing structures [3]. Besides, linkage often requires approval from several institutional bodies, who may all have their own requirements and perspectives [2].
Second, patient privacy and data security are of utmost importance when handling health data, especially when data from different sources are linked; enrichment of registries with more details about a unique person increases the risk that a person can be identified. Appropriate measures should be taken to guarantee privacy protection [5, 8, 9]. Besides, obtaining individual informed consent to link data may not be feasible for large datasets [8].
Third, linkage of patient records must be correct when two registries are combined. Ideally, every person is recognized by a unique identifier that can be used to link registries [2, 5]. In most cases, however, a unique identifier is absent and linkage is based on a combination of several variables that are not absolutely unique, such as name, sex, date of birth, or may differ over time, such as postal code. These linkage variables should be chosen sensibly, and validity and accuracy of linkage should be ascertained [2].
In this paper we will describe how we dealt with the challenges related to responsibilities, security and linkage validity when creating a large registry, the Primary and Secondary Cancer Care Registry (PSCCR) – Breast Cancer. This registry was created from two existing registries, a cancer registry with data on diagnosis, tumour, treatment and a primary care registry with data on primary care contacts and diagnoses.