In this study of 200 articles published between 2004 and 2009 in 27 leading medical journals, we found that superiority trials were designed to detect larger differences than noninferiority trials. On average, trials that used dichotomous outcomes aimed to detect a mean standardized difference of 0.37 but to rule out a standardized difference of 0.27; trials that used continuous outcomes aimed to detect a standardized difference of 0.56 but to rule out a standardized difference of 0.40. Using Cohen's rule of thumb , superiority trials typically attempt to detect a medium effect, while inferiority trials aim to rule out a small to medium effect. This pattern is consistent with the logical requirement that for a given clinical issue, a difference between treatments that is important for clinical or public health decisions should be greater than a difference that can be deemed compatible with equality. This suggests that, in general, clinical researchers use reasonable assumptions in determining the sample size of clinical trials.
We observed a considerable variability of the differences to detect or to rule out among studies. This may be justified by the medical context. E.g., smaller differences may be relevant in noninferiority trials dealing with cardiovascular diseases than in infectious diseases or oncology, because cardiovascular disease is the main cause of mortality worldwide . But patient availability may also be a factor: the potential for including patients with cardiovascular diseases into clinical trials is grater than for rare diseases, so that researchers can afford to explore smaller clinical differences. The type of outcome may also play a role; e.g., mortality is such an important outcome that trials may justifiably aim to demonstrate or exclude smaller differences than would be the case for less crucial events. However, most other factors that we examined were not associated with the differences to be detected or ruled out. Much of the observed variability remains therefore unexplained .
The variability in the difference to be detected or ruled out contrasts with an overwhelming consensus regarding the other parameters that guide sample size determination. Customarily, statistical tests are bilateral, type I error rates are set at 5%, and the desired power is between 80% and 90%. Thus the main reason why sample sizes vary at all is the difference to be detected (or ruled out). Whether a greater standardization of the difference to be detected is desirable is debatable. On the one hand, each research question is unique and deserves specific consideration. E.g.; a 5% improvement in mortality may not have the same relevance in an elderly population and in children. Any forceful guidelines as to the difference to be detected may promote an unreflective, cookie-cutter approach to study design. On the other hand, the absence of guidelines regarding the difference to be detected opens the door to carelessness or even to manipulation. Instead of reflecting on what would be the smallest important difference (or the largest unimportant difference), investigators may be tempted to engage themselves in the "sample size samba" , by retrofitting the expected detectable difference to the available number of participants. Future guidelines in this area should perhaps not provide numbers that can be plugged into sample size formulae, but rather list the salient parameters of the decision.
Several limitations of our study may be noted. Firstly, for about 20% of the trials, we recalculated the expected detectable differences but did not access the parameters actually used by authors. Secondly, we may have lacked power for subgroups analyses. As for the generalizability of our results, we included only two-arm parallel group trials with a single primary outcome. This may not be fully representative of all randomized controlled trials. As we did not perform a systematic review, but rather focused on trials published in high-profile journals, our results may not apply to all trials that are published, and even less to trials that have not been published, or have not been completed. We chose to use the standardized differences together (standardized effect sizes and standardized increments) in order to allow comparability between continuous and dichotomous outcomes . Finally our study is descriptive, and does not propose a formal procedure to help researchers in the choice of differences to be detected or ruled out.
Reaching consensus on how the difference to be detected or ruled out should be chosen is an important challenge for clinical researchers. Current thinking about how results of clinical trials should be interpreted  (which is not the same as deciding what difference should be detected) may help guide this process. The minimal clinically important difference (MCID) is an important starting point. However, this difference typically varies from one patient to the other according to baseline risk of event, risk of complications, and individual preference ; in contrast, a study planner must settle on a single value. Should he or she select the mean MCID for a given population, or aim for a lower threshold, below the MCID of a large proportion (say, 80%) of the patients? Furthermore, researchers may want to detect smaller differences that would be considered meaningful for individual decision making. If a trial aims to demonstrate the potential of a new class of drugs, even a small effect may be scientifically important; a prevention trial may need to overcome the "prevention paradox", whereby a treatment brings large benefits to the community but offers little to each participating individual . In other trials, researchers may aim to detect a difference that is larger than the MCID for most patients, e.g., when testing a very expensive intervention that would not be deemed cost-effective unless a large clinical benefit was demonstrated. A better consensus regarding the determination of the difference to be detected or ruled out may improve the relevance and utility of clinical trials.