Consent to data linkage in a large online epidemiological survey of 18–23 year old Australian women in 2012–13

Background Consent to link survey data with health-related administrative datasets is increasingly being sought but little is known about the influence of recruiting via online technologies on participants’ consents. The goal of this paper is to examine what factors (sociodemographic, recruitment, incentives, data linkage information, health) are associated with opt-in consent to link online survey data to administrative datasets (referred to as consent to data linkage). Methods The Australian Longitudinal Study on Women’s Health is a prospective study of factors affecting the health and well-being of women. We report on factors associated with opt-in consent to data linkage at the end of an online survey of a new cohort of 18–23 year old Australian women recruited in 2012–13. Classification and Regression Tree analysis with decision trees was used to predict consent. Results In this study 69% consented to data linkage. The provision of residential address by the individual, or not (as a measure of attitudes towards privacy), was the most important factor in classifying the data into similar groups of consenters (76% consenters versus 47% respectively). Thereafter, for those who did not provide their residential address, the incentives and data linkage information that was offered was the next most important factor, with incentive 2: limited-edition designer leggings and additional information about confidentiality showing increases in consent rates over Incentive 1: AUD50 gift voucher: 60% versus 37%. Conclusions In young Australian women, attitudes towards privacy was strongly associated with consenting to data linkage. Providing additional details about data confidentiality was successful in increasing consent and so was cohort appropriate incentives. Ensuring that prospective participants understand the consent and privacy protocols in place to protect their confidential information builds confidence in consenting to data linkage.


Background
Large epidemiological surveys are increasingly seeking consent to link or match survey data with administrative datasets [1]. Linking to these datasets can substantially enhance the utility of the collected data and allow researchers to answer important questions which are not readily answerable through the use of survey data alone [2]. Although some large prospective studies rely on opt-out consent for these linkages, recent practice has seen increasing reliance on opt-in consent to linkage as this greatly reduces the onus on researchers to demonstrate the benefit to public interest in allowing opt-out consent.
A recent systematic review assessing consent proportions to data linkage found that the consent proportion varied from 39 to 97% [3]. However, none of the studies was conducted online. Still there was considerable heterogeneity among the studies reviewed and variations among the methods by which consent was obtained, i.e. ranging from a face-to-face interview to a mailed letter.
Despite this, the method of obtaining consent did significantly affect the consent rate, with both the top (97%) and bottom (39%) scoring studies using face-to-face approaches to elicit consent [3]. A number of previous papers have reported differences between consenters and non-consenters across a range of variables including age, sex, race, area of residence, income, education and health status and, where an interview was conducted, attributes of the interviewer [4][5][6][7][8], although results have not been consistent. Other researchers have determined that likelihood to consent in face-to-face interviews was related to the salience of the linkage request, attitudes to privacy and community mindedness [2]. In a web survey of employment a small increase in consent rates was found when the time-saving benefit of linkage was mentioned [9].
It may be that these conflicting results reflect an underlying lack of understanding about the research process and the secondary use of health data. A number of studies have indicated that participants are more likely to consent if they have been provided with clear, easily understandable information about the importance of data linkage and they understand the issues involved [10,11] . Exploration of any discrepancies between consenters and non-consenters is important to exclude the possibility that systematic differences may exist. The presence of these may compromise the researchers' abilities to draw unbiased inferences from the linked datasets.
The use of the Internet and online technologies (such as Web-enabled phones) for conducting epidemiological surveys has recently been reported [1,12,13] and are both cost-effective and particularly suitable for younger participants [5]. However, there is relatively little information about how this modality affects participants' consents to linking their survey data to administrative datasets. Online surveys in the Netherlands (such as the Longitudinal Internet Studies for the Social Sciences) have included a request for consent to data linkages [1], but these have been based on an opt-out model (implicit consent) rather than opt-in (explicit consent).
Both State and Federal agencies in Australia retain data for administrative purposes. In many instances, these are longitudinal and contain high quality information about large numbers of Australians. The Australian Productivity Commission provides reports to the Australian Government on measures to improve the productivity and economic performance of the country and has recommended that access to administrative data by academics and other researchers should be regarded as a Government priority [14]. Accordingly, when considering survey design for the recruitment of a new cohort of young women born in 1989-95, the Australian Longitudinal Study on Women's Health (ALSWH) included a consent model which incorporated information about data linkage and a request that participants provide consent to directly link survey and individual level administrative records. The ALSWH has well-established privacy protocols covering the linking of participants' data that are in accordance with Australian current best practice [15], and the ability to link survey data to administrative datasets has the potential to deliver substantial benefits while still protecting personal privacy.
Little research exists about differences between consenters and non-consenters to data linkage in online surveys and even less for opt-in consent. This paper tries to fill this gap by evaluating differences between young women who did and who did not provide consent to data linkage via an online survey.

Study design
The ALSWH is a prospective study of factors affecting the health and well-being of women. In 2010, the ALSWH was provided with funding by the Australian Government Department of Health to recruit a new cohort of 18-23 year old women. Women born between 1989 and 1995 (1989-95 cohort) were recruited via online surveys between 2012 and 2013. Open recruiting was conducted using a variety of methods: Facebook (including Facebook advertising), other Web activities (such as Twitter, Instagram, YouTube), referrals (emails, snowballing), traditional media (including flyers, posters, postcards), and a fashion promotion. The recruitment strategies are illustrated diagrammatically in the recruitment paper for this cohort [16].
Two incentives were offered for women to complete the online surveys. Incentive 1: women were offered the chance to win one of a hundred AUD50 gift vouchers. Incentive 2: an intensive advertising campaign offered a chance to win one of 2000 pairs of limited-edition designer leggings with a theme reflecting the respondents' birth period. The leggings were very fashionable and highly desirable at the time of the survey. Implicit consent to the use of survey data was assumed if a woman completed an online survey. However explicit consent was requested to link that data to administrative datasets. All participants were provided with information on the reasons for the data linkage consent request and why the Medicare Australia card number was required. When Incentive 2 was offered, if the respondent did not consent to data linkage additional information popped up giving her a chance to change her mind. This additional information included further reassurances that health records provided via data linkage are confidential, examples of the type of information that the data linkage would provide and a link to an infographic [17] illustrating how data is linked anonymously using keys.
The ethics committees of the University of Newcastle (H-2012-0256) and The University of Queensland (2012000950) approved the research protocol.

Participants
Data for this paper were drawn from women born between 1989 and 1995 who responded to an online survey for the Australian Longitudinal Study on Women's Health. Comparison with the 2011 Australian Census showed that women in the sample were broadly representative of women of the same age nationally (Census 49.0% versus ALSWH 52.6% aged 18-20; Census 74.5% versus ALSWH 75.0% living in major cities excluding missing data) although a higher proportion of women had post-school qualifications (Census 33.8% excluding missing data versus ALSWH 48.5%).

Opt-in consent to data linkage
The outcome of consent examined in this study refers to the consent to data linkage, measured at the end of the online survey. Participants were asked for consent to data linkage with administrative datasets. They were not asked for consent to participate in the online survey, because implicit consent is assumed through the completion of the online survey. A total of 25,541 women completed the online survey, with the consent question at the end of the survey. Of these women, 17,684 (69%) consented and 7857 (31%) refused consent to data linkage.

Recruitment, incentive and information
The method of recruitment was assessed from the question 'How did you hear about the Australian Longitudinal Study on Women's Health survey?' and the responses were classified: 'Facebook', 'other web activities', 'referral', 'traditional media' and 'fashion promotion'. Incentives and information were: AUD50 gift vouchers and basic information about linkage or designer leggings and additional information about linkage.

Sociodemographic factors
The women were asked to provide information on their age, area of residence, highest educational qualification, ability to manage on income, relationship status and if they live with one or both parents, or with other adults. Age was categorized as '18 to 20', '21 to 23'. Area of residence was categorized according to the Australian Statistical Geography Standard (ASGS) Remoteness Areas as 'major cities', 'inner regional', 'outer regional', and 'remote or very remote'. A further category, 'missing' was added as 22% of values were missing for area of residence. Level of education was categorized into four groups: 'less than Year 12', 'Year 12', 'certificate or diploma' and 'university'. Women's ability to manage on their available income was based on responses provided on a five-point scale. Relationship status was categorized as partnered (married or cohabiting) or not partnered, including separated, divorced or widowed.
Health status Assessment of general health was self-reported with the following question "How would you rate your health now?" This question is derived from the SF36 and has been shown to be a valid and reliable indicator of general health status [18]. The Kessler Psychological Distress Scale (K10) [19] is a short screening scale of nonspecific psychological distress in the anxiety-depression spectrum. Consistent with previous usage, [20] K10 scores were categorised as 'low distress' (10 to 15), 'moderate distress' (16 to 21), 'high distress' (22 to 29) and 'very high distress' (30 to 50).
Women were also asked, "Have you ever been diagnosed with or treated for": chronic conditions including diabetes, heart disease, hypertension, asthma, and cancer other than skin cancer. These were categorised as 'no major condition' or 'any major condition'.

Health risk factors
Health risk factors included smoking ('current smoker' or not), alcohol consumption, body mass index and physical activity. Based on usual quantity and frequency of standard drinks consumed, weekly alcohol consumption was categorised as 'never drink alcohol', '1 to 7 drinks', '8 to 14 drinks' or 'more than 14 drinks' [21]. Body mass index was based on self-reported height and weight and categorised as 'underweight' (less than 18.5 kg/m 2 ), 'healthy weight' (18.5-24.9 kg/m 2 ), 'overweight' (25-29.9 kg/m 2 ) or 'obese' (30 kg/m 2 or more) [22]. Level of physical activity was classified as 'inactive', 'low', 'moderate' or 'high' based on how much time was spent walking briskly, and doing moderate and vigorous leisure activities in the last week [23].

Statistical analysis
Percentage of consenters versus non-consenters was compared across recruitment method, incentive and information re consent, socio-demographic, and health status variables using chi-squared tests. The Breiman, Friedman, Olshen and Stone (BFOS) Classification and Regression Tree (CART) [24] method for building decision trees was used, following instructions to approximate this in SAS Enterprise Miner 14.1 [25]. The BFOS method recommends using validation data if the dataset is large enough, hence the data has been partitioned equally into training and validation data. CART starts with the root node containing all individuals in the dataset, with the tree built recursively, then trained and pruned automatically. All variables were included in the analysis. The Gini reduction method was used as the measure of node impurity to determine node splitting. The assessment method was selected to prune the fullygrown tree. This selects the smallest subtree with the best assessment measure value. The misclassification assessment measure, i.e. the lowest proportion of misclassified observations, is used for a categorical target variable.

Results
In this study, 69% of 25,541 women consented to data linkage. Consent differed significantly by method of recruitment, with those women who were recruited via Facebook the least likely to provide consent (67%) while those women who were recruited via the fashion promotion most likely to consent (84%) ( Table 1). Women who were offered leggings and additional information on data linkage were significantly more likely to consent to data linkage than those solely offered a cash incentive (79% versus 61%). Examination of the sociodemographic variables indicated that minor differences existed between consenters and non-consenters. Data were missing for less than 2% of sociodemographic variables with the exception of area of residence (missing for 22.7%). Women who did not provide area of residence were significantly less likely to provide consent (47% versus 76%).
Few differences were observed between health characteristics of consenters and non-consenters (Table 2). Table 3 shows the relative importance of potential splitter variables in the CART. Area of residence was the most important followed by incentive. Other variables held some significance in the unpruned tree construction, i.e. recruitment and managing on income, but were not in the final pruned CART. Figure 1 shows the pruned CART for consent to data linkage. Area of residence was the first splitter followed by incentive for women who did not provide area of residence, resulting in a tree with three terminal nodes. Other variables including recruitment method and managing on income did not lower the misclassification rate (0.27 for validation data) any further and were not included in the pruned tree.

Discussion
More than two-thirds of the women who participated in this online survey provided opt-in consent to data linkage. Women who did not provide their area of residence were less likely to consent to data linkage. This may reflect a more cautious approach to divulging and sharing personal information among these young women. Previous research suggests that attitudes toward privacy and confidentiality are strongly related to non-consent [11]. For these women, it was apparent that consent differed by the incentive offered: women offered leggings and additional information about data linkage were more likely to consent than those offered a cash incentive and basic information about linkage.
In this study the consent question was at the end of the survey. The placement of consent has been identified as influencing response rates [26], although only one study was located which examined the consent placement in an online-administered survey [27]. In that study of German establishments, placement of the consent question at the beginning elicited higher consent rates than when it was placed at either the middle or the end of the survey. In those that utilised a telephone setting (e.g. computer-assisted telephone interviewing) the placement of the consent question at the beginning of the survey elicited higher consent rates [28]. However, the authors go on to aver that most linkage studies place the consent question at the end. A recent study on the placing of consent suggested that when this item is inserted at the beginning of a survey, it may impact on subsequent responses, although these measurement errors were confined to the recall of dates [29].
Some evidence exists that the wording of the linkage consent request influences participants' choices. For example in a web survey of employment, income and expenditure, a small increase in consent rate was found when the time-saving benefit of linkage was mentioned [9]. In addition, a limited body of research suggests that assurances of confidentiality, identifying salient aspects of the linkage to respondents and providing some incentives may make respondents more likely to consent [11,30,31]. Participants in the current survey were provided with either a chance to win an AUD50 gift voucher and basic information about data linkage, or the opportunity to win a pair of leggings and additional information about data linkage if they refused consent. The results clearly indicate that the provision of additional information, together with the leggings, was associated with a higher rate of consent for those with privacy concerns. The leggings were highly desirable; however we were unable to ascertain if the provision of additional information alone or leggings alone would have resulted in a similar increase in consent. We were also unable to compare the use of incentives versus no incentives. This could be usefully explored in future data linkage studies.
Differences in the health characteristics of the consenting and non-consenting women were small, and consistent with earlier findings from a systematic review that reported some differences between consenters and nonconsenters across all outcomes [6]. That same review also noted that there was a lack of consistency in the direction of differences across studies and in the magnitude of the association. The percentage of women who consented to this online survey was consistent with rates reported in a recent systematic review [3]. A potential limitation of the current study is the recruitment strategy, which was not based on a probability sample. However, comparison with the Australian Census suggests the women were broadly representative of Australian women of the same age, although they were more educated.
The relationship between higher education levels and active consent has been highlighted in a number of Missing was less than 2% for consenters and non-consenters for all variables except area of residence. Missing data were no more than 1% of all variables for consenters and non-consenters a defined as any of diabetes, heart disease, hypertension, asthma, cancer other than skin cancer  [32], reported that both higher education and higher socio-economic status were associated with an affinity to consent. This was not consistent with the findings of this study however: while there were no differences on education level between consenters and non-consenters, small but significant differences were evident on the women's ability to manage on available income. One systematic review of participants' attitudes to, and opinion of, linking research data to administrative data suggested that men and older respondents were more likely to provide consent [10]. However, this review also highlighted the general lack of knowledge about the process of data linkage and participants' concerns about misuse and potential commercialisation of their data. This concurs with Australian research, which suggests that people are often not well-versed in the concepts of data linkage or de-identified data [32]. An exploration of reasons to consent or withhold consent found that most participants had a limited understanding of how data linkage worked and why they were being asked to provide consent [1]. In the same study comparison of online or mailed consent requests showed no differences in the percentage of consenters and non-consenters based on the mode of request [1]. A qualitative study of young adults [33] reported some confusion about various types of consent, with assumptions that opt-in consent equated to consent more generally. With opt-in methods, participants are generally provided with information and then asked if their data can be used for a specific purpose, as was the case for the current study. It may be that young people are more likely to consent with this method and this should be considered for future research.

Conclusions
Increasingly, online surveys with data linked to administrative datasets, such as hospital and mortality records, are being utilised for large-scale epidemiological studies because of their cost-effectiveness and acceptability [34]. Despite this, scant research attention has been paid to the way in which consenters and non-consenters may differ and the implication this has for potential bias in survey results. This study contributes to the literature by identifying factors that may increase the rates of consenting to data linkage in young Australian women who participate in an online survey. Consent appears to be related to concerns about privacy and may be tempered by the provision of additional information about the linkage process and a desirable incentive. Ensuring that prospective participants understand what they are consenting to, if they elect to consent to data linkage, and the privacy protocols in place to protect their confidential information, may build confidence in the research process and enable researchers and policy makers to maximise the use of administrative datasets.