Methodology in core outcome set (COS) development: the impact of patient interviews and using a 5-point versus a 9-point Delphi rating scale on core outcome selection in a COS development study

Background As the development of core outcome sets (COS) increases, guidance for developing and reporting high-quality COS continues to evolve; however, a number of methodological uncertainties still remain. The objectives of this study were: (1) to explore the impact of including patient interviews in developing a COS, (2) to examine the impact of using a 5-point versus a 9-point rating scale during Delphi consensus methods on outcome selection and (3) to inform and contribute to COS development methodology by advancing the evidence base on COS development techniques. Methods Semi-structured patient interviews and a nested randomised controlled parallel group trial as part of the Pelvic Girdle Pain Core Outcome Set project (PGP-COS). Patient interviews, as an adjunct to a systematic review of outcomes reported in previous studies, were undertaken to identify preliminary outcomes for including in a Delphi consensus survey. In the Delphi survey, participants were randomised (1:1) to a 5-point or 9-point rating scale for rating the importance of the list of preliminary outcomes. Results Four of the eight patient interview derived outcomes were included in the preliminary COS, however, none of these outcomes were included in the final PGP-COS. The 5-point rating scale resulted in twice as many outcomes reaching consensus after the 3-round Delphi survey compared to the 9-point scale. Consensus on all five outcomes included in the final PGP-COS was achieved by participants allocated the 5-point rating scale, whereas consensus on four of these was achieved by those using the 9-point scale. Conclusions Using patient interviews to identify preliminary outcomes as an adjunct to conducting a systematic review of outcomes measured in the literature did not appear to influence outcome selection in developing the COS in this study. The use of different rating scales in a Delphi survey, however, did appear to impact on outcome selection. The 5-point scale demonstrated greater congruency than the 9-point scale with the outcomes included in the final PGP-COS. Future research to substantiate our findings and to explore the impact of other rating scales on outcome selection during COS development, however, is warranted. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-020-01197-3.


Background
Recently, there has been an increase in the development of core outcome sets (COS) to overcome the heterogeneity in outcome selection across clinical trials for a broad spectrum of health conditions [1]. A COS is a standardised set of outcomes which should be measured and reported, as a minimum, in all studies for a specific health area/condition [2]. The standardised set of outcomes allows for the results across trials to be combined or compared, reduces the potential for reporting bias and ensures that outcomes are meaningful, relevant and useable. Use of COS in trials and systematic reviews, can assist and strengthen the evidence base, resulting in improved quality of care worldwide. The development of a COS is a stepwise process that involves working with relevant stakeholders of a particular health condition/ area to prioritise the core set from a larger list of outcomes which have been identified through earlier work [2]. Guidance for developing and reporting high-quality COS is evolving, however a number of methodological uncertainties still remain [2][3][4][5][6].
Involving patients/health service users from an early stage is recommended in COS development; however, the most appropriate way to facilitate inclusion remains largely unknown [2,7]. Participation in Delphi surveys is the most popular method used for patient inclusion by COS developers, but mixed methods techniques are becoming increasingly popular [7,8]. COS developers using mixed method techniques often conduct patient interviews as an adjunct to a systematic review of the literature to identify an initial list of potential outcomes for inclusion in a Delphi consensus survey [9]. This reflects current COS development guidance which recommends that the initial list of outcomes is identified from multiple sources including systematic reviews of published studies, reviews of qualitative work, examination of items collected in national audit data sets and interviews or focus groups with key stakeholders, such as patients [2]. In addition to helping identify potential outcomes for a COS, patient interviews may also assist research teams in understanding why particular outcomes are so important and also in understanding the language used by patients when referring to these outcomes in other phases of COS development [9]. However, conducting these patient interviews increases the workload and adds additional costs, resources and time for the COS development team in the absence of clear evidence of impact on final outcome selection; as such, research on this topic is recommended [2,10].
After the preliminary list of outcomes has been identified, the Delphi technique is the most commonly used method for rating the importance of these outcomes for including in the COS [2]. The Delphi is an iterative survey method whereby relevant stakeholders are sent a series of questionnaires, known as 'rounds', and are asked to rate the importance of each identified outcome for inclusion in the COS on a scale of some description, usually using a rating scale. The Delphi technique is advantageous because it allows individuals to respond anonymously and can be circulated to a large number of diverse stakeholders without any geographical restrictions [11]. The COMET (Core Outcome Measures in Effectiveness Trials) Initiative provides guidance for using the Delphi technique to prioritise outcomes in developing a COS, but recognises also that there are a number of methodological uncertainties surrounding this method which need to be further explored [2,4,12]. For example, a variety of rating scales have been used in COS development. However, it remains unclear which rating scale is the most appropriate for use in the Delphi phase of a COS development study. Qualitative interviews reported mixed feedback from user experience of different rating scales [13] and only one study has compared the use of two different rating scales, a 3-point and a 9-point scale, for rating preliminary outcomes [14]. The authors of this study reported that the use of the 9-point rating scale resulted in almost twice as many outcomes being rated as important compared with the 3-point rating scale in the first Delphi round. Too many outcomes after each Delphi round is challenging because the goal of this process is to narrow down a larger list into a minimum set and a COS with too many outcomes may not be feasible or may not be ultimately adopted in research and clinical practice. For this reason, we embedded a randomised trial within our Pelvic Girdle Pain Core Outcome Set (PGP-COS) development project to compare the impact of a 5-point versus a 9-point rating scale on preliminary outcome selection and the final agreed COS [10,15].
The objectives of this study were: 1. To determine if including patient interviews as an adjunct to systematic review for identifying the initial list of outcomes influences the final COS. 2. To evaluate the use of a 9-point versus a 5-point rating scale in the Delphi phase of a COS development study on the number of "important" ratings received for each outcome in each round of the Delphi and on the final COS, as well as their impact on attrition rates and their ease and clarity of use. 3. To inform and contribute to COS development methodology by advancing the evidence base on COS development techniques.

Material and methods
This study was embedded within the Pelvic Girdle Pain (PGP-COS) study and its protocol was published prospectively [10]. Ethical approval for the study was granted by the University Research Ethics Committee. The PGP-COS was developed by undertaking initial work to first identify potential outcomes through a systematic review of previous studies and semi-structured patient interviews, followed by inviting stakeholders to rate the importance of these outcomes for inclusion in the PGP-COS in a 3-round Delphi survey, and, finally, by agreeing on the final COS in a face-to-face consensus meeting with key stakeholders. In depth methodological details about the design and analysis of the PGP-COS project, including the systematic review, semi-structured interviews, the Delphi survey and the consensus meeting are available in the study protocol [10], the published systematic review [16] and in the PGP-COS main report (Remus et al: A Core outcome set for research and clinical practice in women with pelvic girdle pain: PGP-COS, Under review). For flow and clarity of the summary details of the initial work leading to the Delphi and the embedded randomised trial are described below.

Steering committee
An International Steering Committee with members from five countries, including researchers, clinicians, and methodologists worked on the development of this COS. The day-to-day conduction of the study was performed by a project team of three people (AR, FW, VS) working at the same institution (Trinity College Dublin, Ireland) who designed and addressed key aspects of the study.
The other members of the Committee were involved in conducting interviews, participated in meetings to discuss the progress and monitor the conduct of the study and provided consultation regarding critical decisions.

Interviews
Interviews of 15 women with experience of PGP, either presently or previously, in three countries; Ireland (n = 5), Sweden (n = 5) and Mexico (n = 5), were undertaken to seek patient's views on their treatment needs and PGP outcomes that were important to them. Participants were recruited via physiotherapy and chiropractic clinics and provided written informed consent for taking part in the interviews. The phase of the study was descriptively qualitative (Remus et al: A Core outcome set for research and clinical practice in women with pelvic girdle pain: PGP-COS, Under review).

Delphi study
The systematic review searched for and extracted the outcomes reported in all previous intervention studies on PGP and lumbar-pelvic pain. One-hundred and seven studies were included in the review, yielding 45 distinct outcomes [16]. These outcomes were then grouped into core domains using the OMERACT filter 2.0 framework [17]. The systematic review and patient interviews collectively generated a list of 53 preliminary outcomes which were entered into a bespoke Delphi questionnaire created in Google forms [18]. Two versions of this questionnaire were created: one with a 5-point rating scale and one with a 9-point rating scale (Fig. 1). Five stakeholder groups including PGP patients, clinicians, researchers, dual role researcher and clinician, and policy makers/service providers were invited to participate via mass invitational emails to patient and professional organisations and through social media (Facebook and Twitter) using snowball sampling methods.

Choosing the rating scales
In meeting objective 2 we first had to choose our comparator rating scales, noting that the type of rating scale used can present different data collection and analytical challenges and may give rise to concerns for data quality especially if the scale has comparable validity and reliability issues [19]. For example, a scale offering too few options, such as 'agree' and 'disagree' will not capture 'neutral' or 'unsure' attitudes. Adding further options, including moderate options at the positive and negative ends of the scale (e.g. 'somewhat agree') will allow for greater differentiation in user judgement resulting, potentially, in more accurate representation of attitudes. Concurrently, if too many options are offered, the clarity of the scale might become compromised, as each additional point is one more point that the user must interpret [19].
Earlier empirical research on rating scales has demonstrated various results with respect to the reliability and validity of scales of different lengths. Alwin and Krosnick identified a 3-point scale as having the lowest reliability, 2, 4, 5, and 7-point scales as having equivalent reliability, and 9-point scales as having maximum reliability [20]. Contrastingly, Scherpenzeel found the highest reliability for 5-point scales and lower reliability for 10-point or longer scales [21]. Other studies also note increasing reliability from 2 to 3 to 5-point scales but equivalent, or minimal increases, thereafter for 7, 9, and 11-point scales [22,23]. In a study that asked participants to rate various objects by marking points on lines with no discrete category divisions and to indicate their range around each judgement, the estimated number of scale points naturally employed was 5 [24], although when more scale points were offered (up to 19), the more points people used, up to about 9 [25].
Although this evidence is informative and supportive for the scales chosen in the current study, ultimately the 9-point scale was chosen because it is recommended by the Grading of Recommendations Assessment, Development and Evaluation (GRADE) working group to assess the importance of evidence and is the scale used in DelphiManager, a web-based system designed by the COMET Initiative for facilitating and managing Delphi surveys [2,26,27]. The 5-point rating scale was arbitrarily chosen as the comparator scale because it had been used previously in COS development Delphi studies [28][29][30], and because 5-point rating scales are relatively common across research studies that include a rating scale.

Intervention
All Delphi participants were randomly allocated to one of the two survey versions when they clicked the link to participate in round 1 of the Delphi study. Random survey allocation was achieved using a random redirect tool published online (www.allocate.monster). When a participant accessed the link created by the tool, the background code selected one of the two versions of the questionnaires at random and redirected the participant to that version. The randomisation method was simple randomisation [31]. Participants used the same rating scale they were initially allocated for all three Delphi rounds.
The Delphi surveys were typical of those previously used in COS development with participants being asked to rate the importance of each outcome for inclusion in the PGP-COS in each round using the scale they were allocated to. After each round all participants were emailed a copy of their responses for reference in subsequent rounds. In rounds 2 and 3, participants were provided with the proportion of participants in each stakeholder group who rated each outcome as "important" [4+ (5-point survey) and 7+ (9-point survey)]. In round 1, participants were also given the opportunity to suggest up to a maximum of three additional outcomes using a free-text response. The additional outcomes identified from both survey versions were combined so that all participants using either survey version had the opportunity rate these outcomes in round 2, in addition to rerating all outcomes included in round 1. After round 2, only outcomes that reached a priori consensus, that was ≥70% participants scoring the outcome as "important" [4+ (5-point scale) or 7+ (9-point scale)] for 3/5 stakeholder groups, inclusive of the patient representative group, were included in the round 3 Delphi surveys. During the Delphi phase, each survey was treated independently. Following round 3, all outcomes that reached a priori consensus on either of the survey versions were included in a preliminary PGP-COS. These two preliminary PGP-COSs were then combined as one list of outcomes and presented at the face-to-face consensus meeting, where key stakeholders (i.e. at least one representative from each of the Delphi survey groups, and 11 stakeholders in total) voted on the outcomes for inclusion in the final PGP-COS. Additionally, at the end of round 3, questions on the ease of use, ease of understanding, and the clarity of the scale were posed to participants, with an opportunity to provide any additional comments about the survey using free-text.

Sample size
The embedded trial was based on an opportunistic sample of all 205 participants involved in the Delphi phase of the main PGP-COS development project. Therefore, no sample size calculations were performed as statistical analysis was intended to be exploratory and formative.

Data analysis Analysis 1: influence of patient interview-derived outcomes
To investigate objective 1, we analysed the following: the number of new outcomes identified in patient interviews, how the interview-derived outcomes were rated in each Delphi round, the number of interview-derived outcomes included in the final COS and the extent to which additional outcomes provided by patients in round 1 of the Delphi survey overlapped with the interview-derived outcomes. All descriptive statistical analyses were performed using Excel (Microsoft Excel 2016).

Analysis 2: influence of Delphi rating scale on final COS
We used descriptive statistics (counts and %) for demographic and survey response data. To investigate objective 2 we analysed the following: the proportion of outcomes reaching a priori consensus in each survey in each round, the proportion of outcomes included in the preliminary COS for each survey, differences between the scales whether the outcomes included in the final COS had reached consensus in each survey and attrition rate for each scale. Z-scores were calculated to test the differences in proportion of outcomes reaching a priori consensus and overall attrition between the surveys with alpha set to 0.05 using the formula below [32]: Whereby p1 = proportion from 5-point survey; p2 = proportion from 9-point survey; n1 = number of possible outcomes from 5-point survey; n2 = number of possible outcomes from 9-point survey; and p = pooled proportion. All statistical analyses were performed using Excel (Microsoft Excel 2016) [33]. Scale "ease of use" and "clarity" responses plus any additional comments were analysed using quantitative content analysis. Figure 2 presents an overview of the PGP-COS phases and the summary results of the embedded studies.

Analysis 1: influence of patient interview-derived outcomes
The fifteen patient interviews identified 23 outcomes for inclusion in the round 1 Delphi questionnaire. Fifteen of these outcomes overlapped with outcomes identified from the systematic review, and eight were new outcomes [16]. The patient interview-derived new outcomes were pain character/type, need for a mobility aid,  Table 1 shows how stakeholders rated the interview-derived outcomes in all three rounds of the Delphi study. Four of the outcomes (50%) were included in the preliminary PGP-COS. None of the patient interview-derived outcomes were included in the final PGP-COS (Remus et al: A Core outcome set for research and clinical practice in women with pelvic girdle pain: PGP-COS, Under review). It should be noted, however, that due to travel complications only one participant representing the patient group was able to attend the face-to-face consensus meeting. However, five participants who identified primarily with a different stakeholder group also identified themselves as patients; i.e. patient/clinician (n = 1); patient/researcher (n = 2); patient/researcher/clinician (n = 1); patient/clinician/service provider (n = 1)). During round 1 of the Delphi study, patients suggested 16 additional outcomes using the free-text option, of which, 11 were considered actual outcomes by the PGP-COS study steering committee representatives (AR, FW) and four (36%) as new outcomes which were subsequently included in round 2 of the Delphi study. Six of the 11 outcomes (55%) overlapped with outcomes identified in the systematic review and 1 outcome (9%) overlapped with a patient interview-derived outcome. None of the additional outcomes suggested by patients were included in the final PGP-COS. Table 2 presents the suggested additional outcomes and decisions pertaining to these.

Analysis 2: influence of Delphi rating scale
Participant demographics for all three rounds of the Delphi study can be viewed in Table 3. An overview of outcome inclusion and exclusion is detailed in Fig. 2. Comparison of outcomes reaching a priori consensus between the two survey versions for each Delphi round are detailed in Table 4.
There was a significant difference in the proportion of outcomes reaching a priori consensus between the two scale versions in round 1 (Z = 2.46, p = 0.01) and round 2 (Z = 2.95, p = 0.00) of the Delphi study. After round 1, consensus was reached on 41 outcomes (77%) on the 5-point survey compared with 29 outcomes (55%) on the 9-point survey. After round 2, consensus was reached on 37 of the 68 round 2 outcomes (54%) on the 5-point survey compared with 20 outcomes (29%) on the 9-point survey. After Delphi round 3, there was a significant difference in the proportion of outcomes included in the preliminary PGP-COS between the two survey versions (Z = 2.55, p = 0.01). The resulting 5-point preliminary PGP-COS included 24 outcomes (35% of all round 3 outcomes); contrastingly the 9-point scale provided 11 outcomes (16%) for the preliminary PGP-COS. Ten of the 11 outcomes (91%) on the 9-point preliminary PGP-COS overlapped with outcomes from the 5-point preliminary PGP-COS. All five outcomes of the final PGP-COS were included in the 5-point preliminary PGP-COS, whereas, four outcomes from the final PGP-COS were included in the 9-point preliminary PGP-COS.
There was no difference in overall attrition between the two surveys (Z = 1.15, p = 0.25). Attrition rate for respondents using the 5-point scale was 25% between round 1 and round 2, 9% between round 2 and round 3 and 32% overall. Attrition rate for respondents using the 9-point scale was 32% between round 1 and round 2, 11% between round 2 and round 3 and 39%. Table 5 details attrition rates for per stakeholder group.
Delphi respondents who completed all 3 rounds only were invited to provide feedback on the ease of use, understanding and clarity of the rating scales by asking participants in round 3 if their allocated scale was easy to use and clear to understand. Feedback on the ease of use of both the 5-point scale and 9-point scale revealed that 64% (45/70) of people said the 5-point scale was easy/very easy to use while this was 51% (33/64) for the 9-point scale. However, for the 9-point scale people had to scroll across to see the all options which was commented on by 10 (16%) people as not being practical.

Discussion
This is the first embedded methodological study to examine the impact of patient interview-derived outcomes and to examine the comparison between a 5point and 9-point rating scale on the development of a COS. Our embedded study found that none of the outcomes derived from the interviews only were included in the final PGP-COS. The rating scale used in the Delphi consensus process did influence the proportion of outcomes rated as "important" in the Delphi rounds and could potentially impact on a final COS whereby more or less outcomes would be available in the final COS consensus meeting.

Patient interview-derived outcomes
It is plausible that conducting patient interviews to identify the initial list of outcomes for inclusion in the Delphi phase of COS development may not be as important as one might have thought, and especially, if -a % of each stakeholder groups that rated the outcome as "important" on each scale (5 point and 9 point); (−) Indicates that the outcome did not go through to the corresponding Delphi survey round and subsequently was not rated by stakeholder groups; b Denotes outcomes that were included in the preliminary COS after round 3 of the Delphi resources and time are limited. At the face-to-face consensus meeting, it was discussed that three of the interview-derived outcomes were not deemed "absolutely critical" to be measured and reported in all trials as unique outcomes, but could be captured by two outcome measures that were included in the final PGP-COS. For example, 'sexual functioning' was considered important but participants agreed that this was covered in the outcome 'functioning/disability/activity limitations' of the final PGP-COS. This is important to consider when identifying the most appropriate instrument to measure these outcomes. It is essential that the instrument used measures all aspects of the outcomes relevant to patients. While patient interview derived outcome data were not included in the final COS in this study, they could still be valuable in providing insight into why outcomes are important to inform the consensus meeting. Overall, this highlights the importance of patient participation throughout the Delphi study and in face-to-face consensus meetings in the development of COS on "what" to measure as well as in later stages such as the development of "how" to measure the COS and suggests that it may not be necessary to conduct patient interviews to identify the initial list of potential outcomes. Additionally, we identified that only one patient interview-derived outcome overlapped with the additional outcomes suggested by patients during round 1 of the Delphi study. However, due to our study design, the true extent of overlap between these outcomes remains unknown, as the initial list of potential outcomes was an amalgamation of the outcomes identified from Country n (%) Brazil 1 (1%) 0 (0%) 1 (1%) 0 (0%) 1 (1%) 0 (0%) Canada 6 (6%) 11 (11%) 4 (5%) 9 (13%) 4 (6%) 9 (14%) Colombia 1 (1%) 0 (0%) 1 (1%) 0 (0%) 1 (1%) 0 (0%) Cook Islands 0 (0%) 1 (1%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) Croatia 1 (1%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) Philippines 0 (0%) 1 (1%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) the systematic review and interviews. We cannot ascertain if the other interview outcomes would have been suggested by patients if they were not included in the initial list that was presented during round 1 of the Delphi. Recommended Delphi methodology includes providing participants with the opportunity to suggest additional outcomes during the first survey round and asking participants to re-rate the initial list of outcomes in the subsequent round. A potential method for future studies to examine the full extent of overlap between these two outcomes would be to exclude any interviewderived outcomes from initial voting in the first round and introduce them in the second round of voting. This may offer more understanding on the extent of overlap between patient interview-derived outcomes and patient suggested additional outcomes in Delphi studies. It is plausible that patient interviews at the initial outcome identification stage are redundant if patients are included in the Delphi survey and in the consensus meeting.

Impact of survey scales
In our study, the 5-point scale resulted in twice as many outcomes in the preliminary PGP-COS after three rounds compared with the 9-point scale. This could potentially be a downside to using a 5-point scale considering that the intention in using the series of survey rounds is reductionist based on outcomes being viewed as 'critically important' to the COS. It could also indicate that use of this scale in this PGP-COS had less discriminatory power for the outcomes offered for rating than the 9-point scale. This result, interestingly, is in direct contrast to that of De Meyer and colleagues who found twice as many outcomes selected as "critical" on the 9point scale compared to the 3-point scale [14]. Although we cannot directly compare our results as scale differences were only explored in the first round of de Meyer's Delphi and our consensus definitions differed, it is plausible that the sizeable difference in rating options between a 3-point and 9-point compared to a 5-point and 9-point may explain the conflicting results between our studies. In spite of the contrasting results, it is evident that the rating scale utilised in Delphi studies does impact on the outcomes made available to the final COS consensus meeting, and supports previous concerns raised for best practice when creating and using scales in survey research [19]. In particular, Bekstead identified that using a scale with too few response options may not allow a respondent to make full use of his or her capacity to discriminate while a scale with too many options may exceed the respondent's capacity to discriminate, contributing to measurement error [34]. Collectively, the greater proportion of outcomes rated as important using the 5-point scale and the exclusion of one outcome from the final PGP-COS from the 9-point preliminary PGP-COS suggests that it may be harder to discriminate outcome importance rating when there are limited options to choose from and that too many options may result in measurement error. This is also supported by the free text responses provided by participants after our Delphi study which indicated that both scales were generally perceived as easy to use, although responses concerning the number of rating options were mixed on the 5-point scale but were generally reported as too many options on the 9-point scale. Additionally, we included combined fully labelled, numeric rating scales in our Delphi questionnaires, as fully labelled scales have been shown to produce more reliable and valid data [34]. In our surveys, our middle rating included the label of "unsure unimportant or important" for option 3 on the 5-point survey and 5 on the 9-point scale (Fig. 1). Participants sometimes expressed a lack of understanding of the middle rating option on both the 5-point and 9-point scales.
As there is currently no established "gold standard" labelling system for rating scales, it is possible that the labels provided for the middle rating may have impacted clarity and understanding of the scale. It is also possible that a lack of understanding of the middle option may    In addition, our results highlight user experience concerns that research teams should consider when incorporating rating scales in COS protocols. For example, a number of our 9-point Delphi survey participants expressed that it was frustrating that they were required to scroll across to see all possible responses on the scale. This highlights the importance of user experience considerations when selecting a rating scale and also the platform in which a research group intends to use to disseminate their Delphi questionnaire, especially when a group intends to utilise e-Delphi surveys that are completed on mobile phones or computers with varying screen sizes. These considerations can be addressed with pilot testing of the surveys amongst end users before sending out the finalised e-Delphi surveys for data collection. It is important to note, however, that the comments regarding participant experience are only reflective of those who completed all three rounds of the Delphi. Additional insights could be derived from those who dropped out in earlier rounds; however, these were not collected. Future work should consider user experience during all rounds of a Delphi study. Our results also indicate the importance of designing a Delphi questionnaire that is user friendly with a goal of maintaining group involvement to completion if responses from this group is of importance. Although there was no difference in retention rates between rating scales, the patient-representative group had the highest rate of drop out on both surveys whilst researchers and clinician/researcher group representatives had the lowest attrition rates on both surveys. It is possible that researchers and clinician/researcher groups may have a stronger understanding of the implications of COS development and/or research methodology compared to other stakeholder groups, as explained by their high retention. These results combined with those from our patient interview analysis suggests that input from patient-group representatives should be considered in the initial questionnaire and provides the opportunity to include patients at an early stage in COS development. Overall, it is evident that COS development teams should also consider user experience when selecting an appropriate scale that best suits the target populations in COS methodology development.

Limitations
This study had several limitations. This study was embedded within one COS development project only. The inclusion of patient interview-derived outcomes in the initial list used in the first Delphi round did not allow us to evaluate the full extent of overlap between the interview-derived outcomes and the additional outcomes suggested by patients. Only two rating scales were used to study the influence of scale selection on outcomes made available to the final COS. We compared a 5-point and 9-point scale only, both of which included scale anchors for all response options, when several other scales and scale formats are used by COS developers. Additionally, participants were only asked about ease of use, understanding and clarity of scales at the end of round 3 of the Delphi. This decision was taken to avoid overwhelming participants with a lengthy survey at each round. As a result, we do not have data on these metrics from participants who dropped out in round 1 and round 2. Future work is warranted to explore the user experience across rounds with different scales. Furthermore, the proportion of outcomes rated as "important" in each round and the preliminary COS on both scales depends on the chosen definition of consensus. Therefore, the reproducibility of this study in other fields of COS development, focussing on other health related topics, different scales, larger samples and different consensus definitions is warranted. Finally, as previously mentioned, only one patient participated in our face-to-face consensus meeting due to circumstances out of our control (i.e. flight cancellations). We did, however, have members who identified with multiple groups (i.e. a researcher/patient and clinician/patient) so the patient voice was not unheard. We do not believe this influenced the final COS as all participants had the opportunity to discuss the outcomes and the facilitator actively encouraged comments on each outcome from all attendees. Equally all attendees voted independently at the consensus meeting thus ensuring stakeholder group input into all outcomes that made up the final COS.

Conclusions
Overall, our results identified that outcomes derived from patient interviews did not directly impact the final COS in this study, but the scoring scale used to prioritise outcomes did, highlighting further methodological considerations and challenges when developing COS protocols. While this study shows that a 5-point scale may be recommended to use in Delphi consensus methods in terms of impact on outcome selection and ease of use and clarity, we acknowledge that this is one study only, and, as such, definitive conclusions cannot be drawn. Future research concerning the impact of patient-derived interviews outcomes on the final COS, overall patient involvement throughout the consensus methods and comparisons of other rating scales on the final outcome selection is needed.