What differences are detected by superiority trials or ruled out by noninferiority trials? A cross-sectional study on a random sample of two-hundred two-arms parallel group randomized clinical trials

Background The smallest difference to be detected in superiority trials or the largest difference to be ruled out in noninferiority trials is a key determinant of sample size, but little guidance exists to help researchers in their choice. The objectives were to examine the distribution of differences that researchers aim to detect in clinical trials and to verify that those differences are smaller in noninferiority compared to superiority trials. Methods Cross-sectional study based on a random sample of two hundred two-arm, parallel group superiority (100) and noninferiority (100) randomized clinical trials published between 2004 and 2009 in 27 leading medical journals. The main outcome measure was the smallest difference in favor of the new treatment to be detected (superiority trials) or largest unfavorable difference to be ruled out (noninferiority trials) used for sample size computation, expressed as standardized difference in proportions, or standardized difference in means. Student t test and analysis of variance were used. Results The differences to be detected or ruled out varied considerably from one study to the next; e.g., for superiority trials, the standardized difference in means ranged from 0.007 to 0.87, and the standardized difference in proportions from 0.04 to 1.56. On average, superiority trials were designed to detect larger differences than noninferiority trials (standardized difference in proportions: mean 0.37 versus 0.27, P = 0.001; standardized difference in means: 0.56 versus 0.40, P = 0.006). Standardized differences were lower for mortality than for other outcomes, and lower in cardiovascular trials than in other research areas. Conclusions Superiority trials are designed to detect larger differences than noninferiority trials are designed to rule out. The variability between studies is considerable and is partly explained by the type of outcome and the medical context. A more explicit and rational approach to choosing the difference to be detected or to be ruled out in clinical trials may be desirable.


Background
A key step in planning a randomized clinical trial is the determination of the smallest difference in the primary outcome that should be detected between the study arms. This difference determines the sample size to be used in the study together with the type I error, power and variance of the primary outcome. In principle, this determination should be made a priori by the researchers [1] based on scientific and public health arguments; only then should the sample size be calculated. In reality, researchers often start with a small difference to be detected, which they subsequently revise until an achievable sample size is obtained [2,3]. The danger of such practice is that trials may end up being sufficiently powered to detect convenient differences, but underpowered to detect clinically meaningful differences. On the other hand, there is currently no objective, scientific method for determining the smallest difference that is important for Science and Society. Given this uncertainty, the description of current practice can provide a useful framework for judging what differences are large or small.
Stating the smallest difference to be detected is necessary in planning a superiority trial. Planning an equivalence or noninferiority, trial (henceforth called noninferiority trial) requires a different input: the largest unfavorable difference that is still compatible with noninferiority. Logically, a noninferiority margin should be smaller than a superiority margin, other things being equal, since the former is compatible with equality between treatments, while the latter excludes equality. A noninferiority margin that is too wide may lead to the conclusion that a new treatment is equivalent to the standard treatment when in fact it is inferior. Of note, an alternative approach exists for testing noninferiority in the main outcome at the same time as superiority in a secondary outcome, such as safety or convenience [4].
There is no consensus regarding the procedures for choosing a difference to be detected or ruled out. In contrast, the type I error rate is often set to 5% and study power to 80% or 90%. Thus the superiority or noninferiority margin is the main cause of variability in sample size between trials. In turn, sample size strongly influences the feasibility and cost of a trial. Surprisingly little is currently known about the distribution and determinants of differences to be detected or ruled out in trials.
Our main goal was to verify whether superiority margins used in planning clinical trials are indeed larger on average than non-inferiority margins used in planning noninferiority trials, and to examine the variability between studies in these parameters. Our second goal was to identify study-related factors that may influence the choice of the difference, such as the nature of the study outcome, the clinical field, the type of treatments compared, etc.
In this article, we reported on a survey of the clinical important differences used by researchers to estimate their sample size in 100 randomly selected superiority trials and 100 noninferiority trials published in 27 leading medical journals between January 2004 and March 2009. In the "Methods" section, we describe our search strategy. The data we provide reflects that superiority trials are designed to detect larger clinical differences and noninferiority trials ruled out smaller differences; the variability observed is considerable and may be partly explained by the medical context and the type of outcome.

Study design and sample
We conducted a cross-sectional study based on 200 randomized clinical trials published between January 1 st , 2004 and March 1 st , 2009. We aimed to assess a priori clinical difference among high-quality, clinically relevant studies in internal medicine, general practice and mental health [4]. We therefore performed a search in Medline (Pubmed) using the following terms [4,5]: "randomized controlled trial" OR ("randomized" AND "controlled" AND "trial"), within publication types, subject headings or text words and restricted our search to trials published in 27 medical journals with high impact factors (Additional file 1). In a second step, in order to retrieve a sufficient number of noninferiority trials, we added the keywords "non-inferiority" OR "equivalence" to the search. We recorded all retrieved citations in a SPSS database then screened full texts of the articles for eligibility. Among the retrieved studies, we randomly selected 200 trials.

Eligibility criteria
We included only two-arm, parallel group trials with a single primary outcome that could be either a continuous or a binary variable. We excluded crossover trials and cluster-randomized trials which yield paired data and require specific calculations to estimate the sample size. We also excluded nonrandomized trials mislabeled as randomized trials, ancillary analyses of previously published studies, and studies that used time-to-event variables as outcomes. Two readers (AGA, KB) verified inclusion and exclusion criteria, and for eligible papers, extracted relevant data. Uncertainties were discussed and discrepancies in the assessment of relevant articles were resolved by consensus.

Data extraction
From the full published report, we recorded the journal, the year of publication, the medical specialty, the interventions compared (pharmacological, vaccine, surgical, medical devices and strategies, including diagnostic, medical care management, rehabilitative interventions), the type of primary outcome (dichotomous, continuous) and whether it was related to mortality or not, whether the trial was multicenter or not, whether a research methodologist (statistician or epidemiologist) was associated (retrieved from the authors' affiliation list and acknowledgement section), whether the trial was supported by industry, or by another source of funding (institutional or private grant) or lacking any financial support. We also classified trials in four subgroups based on the targeted study population: children below 18 years, mother and child, adults, or elderly. Finally we made subgroups on the major medical context (cardiovascular, infectious diseases or oncology versus other medical specialties).
Since all articles were published after the revised CONSORT statement in 2001 [6][7][8], details of a priori sample size computation should always be reported. We collected parameters used for this calculation: type I error, one or two-tailed test, type II error, and estimated sample size. For dichotomous outcomes, we retrieved event proportions in the control and active group (P 1 , P 2 ) or treatment effect of interest (difference of proportions). For continuous outcomes, we retrieved the difference in means (m 2 -m 1 ), and standard deviation, or the effect size.

Outcomes
When outcomes were expressed as proportions, we calculated a standardized difference in proportions as (P 2 -P 1 )/√(P(1-P)) where P is the weighted mean of P 1 and P 2 . This index is analogous to that used for contrasting two means [9]. When the difference in proportions was available but proportions in the treatment arms (P 1 , P 2 ) were lacking, P 1 and P 2 were recalculated using the formulae for sample-size calculation adapted for a χ 2 test or Fisher's exact test or those adapted for bioequivalence trials [10,11].
For continuous variables, we calculated the standardized difference in means as the difference of means divided by the pooled standard deviation: (m 2 -m 1 )/SD. When the standard deviation was not given in the methods section, we used the standard deviation described in the results section and verified the sample size using formulae for the Student's t test or for bioequivalence trials [10,11]. In analyses that pooled the two types of studies, we used the standardized difference in outcomes, regardless of the type of outcome.

Independent variables
The main predictor was the type of the trial: superiority versus noninferiority. Other independent variables were: mortality (single or composite outcome) or not, medical context (cardiovascular, infectious diseases, oncology, or other medical specialties), age-group of the study population (neonates and children below 18 years of age, adults and elderly), type of intervention (pharmacological versus other), involvement of a statistician/epidemiologist, year of publication (2004-06 versus 2007-09), funding source (industry, institution versus no fund or not stipulated) and finally a single-center or multicenter recruitment of participants. The target sample size (≤200, 200-400, 400-800 or >800 participants defined following quartiles) was also studied in relation to standardized differences, even though sample size is a consequence of this difference, not its determinant.

Sample size estimation
We sought to detect a moderate difference of the mean standardized differences between superiority and noninferiority trials, which we defined as half a standard deviation [12]. This difference would require 84 trials per group using a power of 90% and a type I error of 5% (two-tailed). We rounded-off the sample size to 100 superiority and 100 noninferiority trials.

Statistical analysis
We first compared the characteristics of the included superiority and noninferiority trials, using the Mann-Whitney test for continuous variables and Chi-square or Fisher's exact test for categorical variables. Then we examined the distributions of the standardized differences in superiority and noninferiority trials, and compared their means using the Student t test. We performed subgroup analyses, separately for superiority and noninferiority trials, comparing the standardized differences according to various study characteristics using t tests or one-way analysis of variance. We also tested for interaction between the main effect (superiority/noninferiority type of the trial) and each factor; if the interaction was not significant, a single P-value for the factor was presented in the last column of the corresponding Table. Finally we used an analysis of variance assessing the variance of the standardized differences by several predictors of interest identified in this preliminary analysis of variance (noninferiority versus superiority trials, mortality versus nonmortality trials and the medical context) and we presented the estimated mean standardized difference, its standard error and the associated P-value.

Selection of articles
The initial search yielded 6933 citations. We randomly selected a total of 580 articles in order to obtain our goal sample of 200 articles that contained information on the difference to be detected or ruled out (Figure 1). Of note, for 135 (40.7%) among 334 eligible trials, there was no value in full reports that described the difference to detect. Additional file 2 lists the 100 superiority and 100 noninferiority trials included.

Trial characteristics
Superiority and noninferiority trials did not differ in the journal of publication (Table 1). Justification for the choice of the treatment difference used was found at a significantly higher rate in superiority compared to noninferiority trials. Noninferiority trials dealt more often with infectious diseases, examined pharmacologic interventions, and used dichotomous outcomes more often than superiority trials. Superiority trials were mainly conducted in other medical context than cardiovascular, infectious diseases or oncology (neuro-psychiatry, n = 17; rheumatology, n = 6; internal or general medicine, n = 4; medical education, n = 3; other, n = 47). Noninferiority trials required larger sample sizes, were more often conducted at multiple centers than superiority trials and were more often funded by the industry.

Differences used to estimate sample size
In 161 articles (80.5%), the standardized difference in proportions or in means could be obtained directly from information provided in the methods section of the article. For 39 trials (19.5%), we imputed the standardized difference in proportions (n = 25) or in means (n = 14) using additional information (such as the sample size or observed variance estimates; see Methods).
Overall, the mean standardized difference was 0.45 for superiority trials and 0.29 for noninferiority trials ( Table  2). This difference was seen for standardized differences in both means (0.56 versus 0.40) and proportions (0.37 versus 0.27). All these differences were statistically significant. The spread of standardized difference was wider in superiority trials than in noninferiority trials ( Figure 2); this is also apparent from the standard deviations of the standardized differences (Table 2).
Trial characteristics associated with clinical differences used to estimate sample size Subgroup comparisons of the standardized differences were conducted separately for superiority and noninferiority trials ( Table 3). As expected, smaller detectable differences were associated with larger sample sizes and with a multicenter recruitment. Mean detectable differences were similar across years of publication, the nature of the intervention (pharmacologic or otherwise), patient age-groups, statistician involvement, and funding source, both in superiority and in noninferiority trials. However studies that used mortality as their primary outcome used smaller differences than other studies, again in both types of trials. Trials studying cardiovascular diseases used lower standardized differences than infectious diseases or other trials, particularly in nonininferiority trials. We observed the same gradient in superiority trials but due to smaller sample sizes, there was no statistical significance.
Analysis of variance confirmed that the standardized difference was significantly smaller in noninferiority

Discussion
In this study of 200 articles published between 2004 and 2009 in 27 leading medical journals, we found that superiority trials were designed to detect larger differences than noninferiority trials. On average, trials that used dichotomous outcomes aimed to detect a mean standardized difference of 0.37 but to rule out a standardized difference of 0.27; trials that used continuous outcomes aimed to detect a standardized difference of 0.56 but to rule out a standardized difference of 0.40. Using Cohen's rule of thumb [12], superiority trials typically attempt to detect a medium effect, while inferiority trials aim to rule out a small to medium effect. This pattern is consistent with the logical requirement that for a given clinical issue, a difference between treatments that is important for clinical or public health decisions should be greater than a difference that can be deemed compatible with equality. This suggests that, in general, clinical researchers use reasonable assumptions in determining the sample size of clinical trials. We observed a considerable variability of the differences to detect or to rule out among studies. This may be justified by the medical context. E.g., smaller differences may be relevant in noninferiority trials dealing with cardiovascular diseases than in infectious diseases or oncology, because cardiovascular disease is the main cause of mortality worldwide [13]. But patient availability may also be a factor: the potential for including patients with cardiovascular diseases into clinical trials is grater than for rare diseases, so that researchers can afford to explore smaller clinical differences. The type of outcome may also play a role; e.g., mortality is such an important outcome that trials may justifiably aim to demonstrate or exclude smaller differences than would be the case for less crucial events. However, most other factors that we examined were not associated with the differences to be detected or ruled out. Much of the observed variability remains therefore unexplained [14].
The variability in the difference to be detected or ruled out contrasts with an overwhelming consensus regarding the other parameters that guide sample size determination. Customarily, statistical tests are bilateral, type I error rates are set at 5%, and the desired power is between 80% and 90%. Thus the main reason why sample sizes vary at all is the difference to be detected (or ruled out). Whether a greater standardization of the difference to be detected is desirable is debatable. On the one hand, each research question is unique and deserves specific consideration. E.g.; a 5% improvement in mortality may not have the same relevance in an elderly population and in children. Any forceful guidelines as to the difference to be detected may promote an unreflective, cookie-cutter approach to study design. On the other hand, the absence of guidelines regarding the difference to be detected opens the door to carelessness or even to manipulation. Instead of reflecting on what would be the smallest important difference (or the largest unimportant difference), investigators may be tempted to engage themselves in the "sample size samba" [2], by retrofitting the expected detectable difference to the available number of participants. Future  guidelines in this area should perhaps not provide numbers that can be plugged into sample size formulae, but rather list the salient parameters of the decision. Several limitations of our study may be noted. Firstly, for about 20% of the trials, we recalculated the expected detectable differences but did not access the parameters actually used by authors. Secondly, we may have lacked power for subgroups analyses. As for the generalizability of our results, we included only two-arm parallel group trials with a single primary outcome. This may not be fully representative of all randomized controlled trials. As we did not perform a systematic review, but rather focused on trials published in high-profile journals, our results may not apply to all trials that are published, and even less to trials that have not been published, or have not been completed. We chose to use the standardized differences together (standardized effect sizes and standardized increments) in order to allow comparability between continuous and dichotomous outcomes [15]. Finally our study is descriptive, and does not propose a formal procedure to help researchers in the choice of differences to be detected or ruled out.
Reaching consensus on how the difference to be detected or ruled out should be chosen is an important challenge for clinical researchers. Current thinking about how results of clinical trials should be interpreted Table 3 Factors explaining the mean (±standard deviation) of the standardized differences in superiority and noninferiority trials and single p-value after adjustment on the type of trial 0.20 (0.14) 28 >800 *P-value for each factor after adjustment on the type of trial. **There was a significant interaction between superiority/noninferiority type of the trial and sample size.
[1] (which is not the same as deciding what difference should be detected) may help guide this process. The minimal clinically important difference (MCID) is an important starting point. However, this difference typically varies from one patient to the other according to baseline risk of event, risk of complications, and individual preference [16]; in contrast, a study planner must settle on a single value. Should he or she select the mean MCID for a given population, or aim for a lower threshold, below the MCID of a large proportion (say, 80%) of the patients? Furthermore, researchers may want to detect smaller differences that would be considered meaningful for individual decision making. If a trial aims to demonstrate the potential of a new class of drugs, even a small effect may be scientifically important; a prevention trial may need to overcome the "prevention paradox", whereby a treatment brings large benefits to the community but offers little to each participating individual [17]. In other trials, researchers may aim to detect a difference that is larger than the MCID for most patients, e.g., when testing a very expensive intervention that would not be deemed cost-effective unless a large clinical benefit was demonstrated. A better consensus regarding the determination of the difference to be detected or ruled out may improve the relevance and utility of clinical trials.