In this study, we attempted to identify factors that influence the difference to be detected in clinical trials as determined by experienced trialists. We tested three factors that in our opinion should influence the difference to be detected (severity of outcome, patient age group, and presence of side-effects in the experimental treatment), and three that should not (baseline level of risk, recruitment difficulties, and cost of treatment). Only two observed effects were in conformity with our expectations: we found that neither recruitment difficulties nor treatment cost had any effect on the difference to be detected. The other four results ran against our hypotheses: baseline risk was a strong determinant of the difference to be detected, the effect of age was opposite to our expectations, and the presence of side-effects in the experimental treatment and outcome severity did not influence the difference to be detected. Of note, these results appeared substantially different when the participants’ responses were expressed on a multiplicative scale, as odds ratios. The main change was that a baseline risk was no longer associated with the difference to be detected (in terms of odds ratios), or only weakly.
We found that participants in this study selected differences to be detected smaller than differences to be detected observed in real life . With only two vignettes it is unclear if this disparity is meaningful. However, it is possible that researchers responded in earnest to this survey, whereas in real life they are compelled to choose larger differences that lead to smaller sample sizes . Alternatively the use of abbreviated hypothetical vignettes resulted in bias, and a consideration of full study protocols would have produced different answers .
The most disturbing finding was that the difference to be detected was considerably larger (by 8–10 %) when the baseline risk was high (50 or 60 %) than when the baseline risk was low (10 %). From an ethical standpoint, we find this difficult to justify. An unfavorable event avoided – whether it is death, cancer recurrence, or persistence of uncontrolled pain – should have the same value to patients and to society, regardless of baseline risk. Nonetheless, when we used a multiplicative scale, baseline risk was no longer associated with the choice of the outcome proportion under new treatment. Odds ratios appeared to be independent of baseline risk: the difference was small (2.0 versus 2.3) in the first vignette and nil in the second vignette (0.64 versus 0.63). The most likely hypothesis is that the respondents have computed mentally a relative risk or odds ratio, such that the same absolute difference appeared more impressive when the baseline risk was low . For instance, in the low risk group the proportion of controlled pain on standard treatment was 90 %, and the respondents selected on average 95 % for the experimental treatment, an odds ratio of 2.11. This is more impressive than the odds ratio of 1.50 obtained in the high risk group, where the proportions of controlled pain were 50 % and 60 %. Another possible explanation would be that the respondents were influenced by the response scales that were proposed, which were more spread out for high baseline risks than for low baseline risks. In other words, the observed difference could be due to ascertainment bias. However, the response scales only reflected reality – risk cannot be reduced by more than 10 points when the baseline risk of unfavorable outcome is 10 %, but can be reduced by much more when baseline risk is 50 %. We believe that the role of baseline risk in choosing the difference to be detected should be addressed by trialists and that an ethically acceptable solution to this issue is needed.
Another unexpected finding was that the participants appeared to take into consideration the age group when selecting the difference to be detected, but the effect was opposite to our expectations. The respondents selected larger differences in both vignettes for children and younger adults than for older adults, which suggests that smaller benefits are less justifiable among younger patients than among the old. This runs against the “fair innings” argument which would lead to the opposite . One possible explanation is that clinical trials are generally conducted at the very early phase of a new drug development where adverse events are not well known. In that case, researchers may be reluctant to include children or younger adults compared to older adults unless the expected clinical benefit is important.
Outcome severity did not influence the difference to be detected in our survey. This negative result might be due to an insufficient contrast between the outcomes that we tested (mortality vs. cancer recurrence). Indeed, in a study based on published trial reports, the difference to be detected was significantly smaller for mortality than for other outcomes . Because the latter study was observational, it did not control for other trial characteristics that might cause confounding, unlike this experimental study.
The severity of side effects of the new drug was expected to influence the difference to be detected but we only found a small difference that was not statistically significant. Balancing potential benefits, harms, and burden of treatment is central to clinicians’ and patients’ decision-making, and estimates of treatment efficacy can only be interpreted contextually, along with potential undesirable outcomes . For example, in life-threatening situations, potential harm is often immaterial, whereas small or uncertain benefit can be outweighed by substantial established harm or burden [20, 21]. A plausible explanation to our findings is that respondents focused on setting a difference to be detected for the primary efficacy outcome of the superiority trial. Although focusing on the primary outcome is necessary for sample size calculation, a more comprehensive determination of harms and benefits could facilitate the translation of research findings into meaningful decision, as increasingly advocated by the GRADE working group .
Two negative results were in conformity with our hypotheses. The respondents were not influenced by anticipated difficulties in patient recruitment. This is an encouraging result; indeed, methodologists frequently report that some researchers negotiate an achievable sample size for their trial by revising upward the difference to be detected . That this did not occur may reflect either the hypothetical nature of our study, or the fact that no feedback about the required sample size was given during the survey. The other reassuring result was the lack of effect of the cost of the new treatment. Thus researchers appear to have an attitude similar to that of clinicians . Arguably efficacy trials should not concern themselves with the cost-effectiveness of the new treatment, especially as they are conducted early in the life-cycle of a drug or device, when treatment costs are at the apex.
Several limitations of our study deserve mention. First, this survey was addressed to researchers who have participated at least one randomized controlled trial. We supposed that the corresponding authors were involved at least in the planning and conduct of the published trial but we do not know their actual role in the determination of the difference to be detected when the trial was planned. Nonetheless, their self-perceived expertise in sample size estimation and in the selection of the difference to be detected was fairly high. Second, we obtained a smaller sample size than planned, but the study was sufficiently powered to reveal several relevant associations. The low participation rate also raises a concern about selection bias. However, the comparisons between respondent subgroups should be internally valid, since the allocation to versions of the vignettes was at random. Third, as with all vignette-based studies, it is uncertain if the observed results would apply equally in real life. In particular, the vignettes described two specific clinical areas that did not necessarily correspond to the clinical expertise of the respondents. This may have caused difficulties for some respondents in selecting an appropriate response. Fourth, participants were likely influenced by the proposed response options, which differred for the low risk and high risk versions of the scenarii. However, this reflects the reality: a low risk cannot be lowered as much as a high risk. If we had used the same response scale for the two situations, the “high risk” group would have been prevented from considering larger reductions in risk that were plausible in their situation, but that were impossible for the “low risk” group. We acknowledge however that our procedure made it impossible to distinguish a true preference for a larger (or smaller) risk difference from ascertainment bias due to the use of a wider (or narrower) response scale. An open “free-response” format would have avoided this problem. However, in a pre-test, we had compared the open “free reponse” format to a list of pre-defined response options, and respondents had more difficulty with the open format. Nonetheless, future studies should explore the influence of the mode of response on the resulting respondent opinions. Finally, we have explored only 6 factors that may influence the choice of a difference to detect in a trial, other factors may be considered, such as the prevalence of the disease (common vs. rare), a range of less severe outcomes (pain relief, quality of life, etc.), or the funding and sponsorship of the study (private vs. public).