BMC Medical Research Methodology

Background: The interpretation of the results of active-control trials regarding the efficacy and safety of a new drug is important for drug registration and following clinical use. It has been suggested that non-inferiority and equivalence studies are not reported with the same quantitative rigor as superiority studies.


Background
Equivalence and non-inferiority randomized controlled trials are the standard research methodology to demonstrate that a new treatment is equivalent or non-inferior to standard therapy (active-control) in term of efficacy. While an equivalence trial would use the 2-sided 95% confidence interval of the difference between the 2 trial arms, the non-inferiority trial would usually use the 90% confidence interval of the difference, if a 1-sided 5% rather than 2.5% significance test was considered a priori acceptable [1]. Because it is impossible to prove exact equality, the goal in a non-inferiority trial, in situations where the effect compared to placebo is large, is to rule out differences of clinical importance in the primary outcome between the two treatments.
Issues, difficulties and controversies surrounding noninferiority trials have long been well recognized and extensively reported in many medical settings, including human immunodeficiency virus infection (HIV) [2,3]. Highly active antiretroviral therapy (HAART) delays progression of the acquired immunodeficiency syndrome (AIDS) and increases survival among HIV infected patients. With efficacy rates of 70% [4] and 75% [5] respectively, the space for better antiretroviral agents efficacy has become very tight. However, long term toxicities, pill burden and genotypic resistance call for treatment simplification and alternative new agents. As a consequence, the number of non-inferiority trials has been growing in the recent years in the AIDS therapy literature. Some authors chose to use interchangeably the terms "equivalence" and "non-inferiority", regardless of the hypothesis of the study. Given that the question of interest is not symmetric, we think that they are better described as "non-inferiority" trials [6].
Because efficacy in viral suppression remains the major outcome, new drugs should first prove non-inferiority with respect to prolonged control of HIV replication, as the primary endpoint. Second, the new drugs should provide other advantages. Inevitably, there may have been some tension between marketing purposes and scientific issues in the published reports of those trials. In this paper, our objective was to verify the validity of recently published non-inferiority AIDS trials regarding the primary endpoint.

Study selection and methodological standards
Our aim was to consider a cohort of equivalence or noninferiority trials published in the area of HIV/AIDS, after HAART became available. We performed a MEDLINE search using the terms equivalence OR non-inferiority AND random* AND HIV (1) and abacavir AND random* (2). 64 (1) and 136 (2) articles were identified. 5 (1) and 5 (2) were selected because they fulfilled the following requirements: randomized controlled clinical trial with 48-week minimum follow-up, initially designed as a noninferiority or equivalence trial with a prespecified noninferiority margin, virological primary endpoint and publication in New England Journal Medicine, JAMA, Lancet, AIDS, Clinical Infectious Diseases, Journal of Infectious Diseases and Journal of Acquired Immune Deficiency Syndrome between 2001 and 2006. Eight additional articles were identified by examining cross-references or by authors' knowledge of their existence.

Statistical analysis
Intent-to-treat (ITT) or on-treatment (OT) analysis 95% confidence interval of the treatment difference were computed using the normal approximation, based on available data included in the flow chart, results section and figures. Two selected studies (ALIZE and SEAL) predefined a 90% confidence interval of the treatment difference, but their conclusions were not affected by the use of the 95% confidence interval (which was used in this paper for homogeneity). Two other selected studies (BMS-045 and CONTEXT) defined the primary endpoint as the log 10 reduction in HIV viral load, using a time-averaged difference method. For homogeneity with other studies, we considered the more pertinent criteria (closer to the clinical practice) of the percentage of patients with undetectable viral load (< 50 copies/ml or < 400 copies/ml) at week 48 (reported as secondary endpoint).
In case of missing data, the corresponding author of the paper was contacted. When only percentages were available with several possibilities for the numerator due to rounding, we choose on a worst case basis. If original data were censored, we used the cumulative incidence of the primary endpoint in each arm.
Significance testing in establishing non-inferiority between the two arms of a study was computed by the use of the continuity-corrected chi-square of Dunnett and Gent [29] for non-inferiority in intent-to-treat or on-treatment analysis, also on a worst case basis. Briefly, π 1 and π 2 represent the true proportions of patients with treatment success according to the primary outcome in a random sample of the 2 populations of patients receiving the control treatment and the new drug, respectively. In case of non-inferiority, the expected estimates of π 1 and π 2 are given by: where x and y are the observed number of success, n 1 and n 2 are the 2 sample sizes in the control and the experimental study groups, respectively and Δ the pre-specified margin for non-inferiority.π The continuity-corrected chi-square of Dunnett and Gent [29] (reproduced with written permission) for non-inferiority is given by: where m = x + y and = 1 n 1 If Δ is the maximal acceptable difference in success rates between the 2 treatment arms and δ is the observed difference between the experimental and control arms, the equivalence hypothesis can be formulated as pair of onesided hypothesis: and The type I error probability α for H 0 rejection corresponds to H 01 ∪ H 02 . Therefore, the P-value for equivalence is the lower chi-square value associated with max (α 1 , α 2 ). In a non-inferiority hypothesis, only (1) is necessary. More details have been published elsewhere [30].
To avoid confusion between the P-values of superiority tests and the P-values of non-inferiority tests (both are reported in this paper), the latter have been renamed "Dvalues". When the normal approximation is a valid hypothesis, there is a general consistency between the two-sided 95% confidence interval approach (non-inferiority at α/2 < 2.5%) and the non-inferiority chi-square (Dvalue < 5%), as shown in Figure 1. D-values and P-values < 0.05 were considered statistically significant.

Efficacy of the active control and similar outcome
All of the antiretroviral trials outlined in Table 1 were conducted with active-controls which have previously shown efficacy. 16 studies used a composite endpoint including virologic failure, clinical progression to AIDS or death in compliance with the other new AIDS clinical trials, whereas 2 studies used log 10 reduction in HIV viral load. However, they reported virologic failure as secondary endpoints.
Correspondence between 95% confidence interval of the difference in effect, superiority P-value and non-inferiority D-value Figure 1 Correspondence between 95% confidence interval of the difference in effect, superiority P-value and non-inferiority D-value. * NS indicates non-significance for superiority or non-inferiority. Case A shows significant superiority of the new drug and necessarily non-inferiority Case B shows significant non-inferiority, but superiority of the new drug is uncertain (inconclusive result) Case C shows both, significant inferiority of the new drug (or superiority of active-control) but nonetheless significant non-inferiority Cases D and E failed to demonstrate non-inferiority (inconclusive result) but E demonstrated significant inferiority (or superiority of active-control).

Rationale for the non-inferiority margin
All studies identified a pre-specified non-inferiority margin (criterion for selection). As shown in Figure 2, however, only 4/18 studies reported justification for their choice. In the CNAAB3005 study, the choice of the noninferiority margin was based on discussion with clinical investigators and with the Food and Drug Administration. The margin of 12% was considered as the largest difference clinically acceptable. In the 903 study, the authors considered that the margin of 10% was a more stringent and conservative non-inferiority criterion. The authors of the CNA30024 commented that it was the appropriate measure for distinguishing the clinical effectiveness of 2 study treatment. Finally, the CNA30024 authors' choice relied on HIV clinicians' judgement as well as on discussion with independent reviewers. Other studies did not comment on their choice, which ranged from 10% to 15% (median: 12%). CONTEXT and BMS-045 considered a non-inferiority margin of -0.5 log 10 reduction in HIV viral load, without justification. Other issues regarding design are reported in Table 1.

Confidence interval and superiority testing
All but two trials reported results using the confidence interval approach. In the BEST study, the authors predefined their non-inferiority margin for sample size calcula-tion, but the confidence interval was neither defined nor reported. In the NEFA study, although the confidence interval approach was clearly defined in the statistical analysis section of the article, none was provided in the results section. NEFA, BEST, 2NN, FTC-303, ESS40013 and SHAART studies reported non-significant superiority tests for efficacy to reinforce non-inferiority. The ALIZE and 934 studies switched from the non-inferiority to the superiority hypothesis to declare that the experimental treatment had superior efficacy in the ITT analysis set (for secondary and primary endpoints, respectively), as appropriate.   [11] 90 No (a) NVP BID (155) EFV QD (156) (b) ABA BID (149) BEST [12] 90 No IDV/RITO BID (162) IDV TID (161) 2NN [13] 80 No (a) NVP QD (220) EFV (400) (b) NVP BID (387) (c) NVP+EFV (209) 903 [14] 80 Yes TNF (299) Stavudine (301) SOLO [15] 85 concluded non-inferiority inappropriately on the basis of their pre-specified margin. In accordance, their non-inferiority D-values were above 5%, as shown in Table 2.

Intent-to-treat and on-treatment analysis on the primary
BMS-2004 concluded that the two drugs were as efficacious (suggesting equivalence), while the ITT lower bound of the 95% confidence interval (-11.7%) exceeded 10% in favour of the experimental drug. The main BMS-2004 hypothesis (non-inferiority of the experimental drug at 10%) was demonstrated with a D-value = 0.043 (OT analysis). In our analysis of the ESS40013 study (OT), thenoninferiority margin exceeded the pre-specified non-inferiority margin. Finally, CONTEXT and BMS-045 studies provided a conclusion in accordance with their non-inferiority margin (data not shown).

Discussion
Trials that assess non-inferiority require rigorous methods for their design, analysis and interpretation. Although the design and the sample size were appropriate for AIDS non-inferiority and equivalence trials, there is room for substantial improvement regarding statistical analysis and interpretation of the results.
Patients with HIV infection would be harmed by deferral of therapy. Consequently, the use of placebo would be unethical [2]. Even if placebo-controlled of HAART therapy are not available, a conclusion about efficacy can be reached because the great majority of patients (about 70%) will not be controlled without treatment [4,5]. Because significant inferiority to active-control would be a major problem for patients, the non-inferiority margin for a new drug should be smaller than the difference between active-control and placebo. Because this effect size is so large, only the clinically chosen margin is really an issue, but is also highly subjective. As a result, this margin varied from the conventional 10% up to 15%. Even the same study group chose different margins in studies 903 (10%) and 934 (13%). A small decrease in margin provides greater assurance of satisfactory effect, but the cost of the study will increase because more patients are required. In the 903-study, the authors could not demonstrate noninferiority at 10% but they point out in their discussion that this margin was more stringent than the 12% chosen in CNAAB4005. However, if the authors had chosen the less powerful 12% as the maximal limit for non-inferiority, the 95% confidence interval would have been wider, possibly beyond the 12% limit. Consequently, datadriven discussion about the non-inferiority margin after completion of the study is pointless.
Blinding has been described as less efficient in non-inferiority than superiority trials, in particular if the primary endpoint is subjective [31]. For example, a blinded investigator could bias the results toward a preconceived belief Quality report assessment of non-inferiority trials adapted from Le Henanff et al. [28]

Figure 2
Quality report assessment of non-inferiority trials adapted from Le Henanff et al. [28]. 1. Report the margin and the justification for its choice 2. Appropriate sample size calculation 3. Report both on-treatment and intent-to-treat analysis for the primary endpoint 4. Report 1-sided or 2-sided confidence intervals of treatment difference 5. Conclusion 5.1 Conclude non-inferiority or equivalence only if both ITT and OT analyses permit that or provide separate conclusions. 5.2. Restate the prespecified margin in the abstract 5.3 Make interpretation according to the margin of equivalence or non-inferiority regarding of the primary endpoint 5.4 Conclude with standard and appropriate vocabulary in accordance with the aim and the results of the trial (ie, "non-inferior to" or "equivalent to").  13 (NI) -3.5 0.5 < 0.001 (OT) < 0.005 Fulfilled criteria for non-inferiority and proved superior § Bold UBCI exceeded the pre-specified non-inferiority margin *A positive δ corresponds to a higher efficacy of the active-control group, as compared with the experimental group. We choose the OT or ITT in a worst case basis, unless the authors reached separate conclusions for OT and ITT. The numbers may differ from original reports because original reports were stratum-adjusted or used 90% confidence interval. ** If patients who never started treatment were excluded, P = 0.03 ***Based on secondary endpoints Abbreviations: δ : Observed difference between the % of success observed in the control arm minus the % of success observed in the experimental arm ; UBCI : upper bound of the 95% confidence interval; E: equivalence; NI: non-inferiority in equivalence by assigning similar ratings to the treatment responses of all patients, giving a "bias toward the null". Even when the primary outcome is objective (viral failure, clinical progression or death), however, we believe that blinding is important to protect against bias. Unblinded investigators may provide other effective therapies to patients in the arm that they believe superior or equivalent, such as more regular appointment or adherence support. In addition, patient or physicians may overinterpret subjective endpoints such as side-effects in openlabel studies. Finally the absence of blinding can distort the comparability of the groups regarding study withdrawal or patients' adherence, since patients participating in a non-inferiority trial may prefer to receive the simpler therapy. Among the studies observed, significantly more patients discontinued the ALIZE study medication in the control arm for personal reasons, as compared with the simpler, once-a-day experimental group (11% versus 2%, P < 0.0004). This may influence outcome, particularly in an ITT analysis, where withdrawals are considered as failure. Another example comes from the results of the 934 study, where adherence to treatment differed significantly between groups. The conclusion about superior efficacy of the experimental arm in the 934-study may be in part the consequence of greater exposure to the experimental drug. On the other hand, blinding can stand in the way of an optimal drug dispensation in non-inferiority and equivalence trials, in particular if the aim is to simplify antiretroviral therapy. For example, if the purpose is to offer simpler dosage or fewer pills as compared to standard therapy, blinding may require similar regimens in both arms so that any advantages of simplification would be eliminated.
Exclusion of patients after they have been randomized sacrificed the validity of "on-treatment" analysis because it may cause major bias regarding group comparability. For this reason, intention-to-treat analyses has been recognized as the most appropriate and conservative strategy to analyse data of double-blinded trials. However, in case of non-inferiority and equivalence trials, it is well known that this method lacks of robustness since not conservative. For this reason, the study interpretation should also be complemented by "on-treatment analysis" [1,8,9]. If there are discrepancies in the results regarding equivalence or non-inferiority, this should be reported and acknowledged. The CNAAB3005 illustrated how apparent equivalence can be the consequence of a dilutional effect of comparing 2 treatments in the ITT (527 patients) when only 54% of the patients where on-treatment. The same could apply to the ESS40013 study. The use of an "overall" log-rank testing superiority within the 3 arms in the NEFA study may also have blurred the lower efficacy of one study arm, as demonstrated by the "head-to-head" comparison between abacavir and efavirenz.
Like in superiority trials, the choice of the primary outcome is also critical in non-inferiority trials. The BMS-045 illustrated how statistical non-inferiority for viral log difference can be compatible with up to 20.4% of additional virologic failure in the experimental arm, a percentage much larger than non-inferiority margins usually selected for this outcome in this setting.
Finally, the majority of the studies concluded that the effect of at least one experimental arm, based on their prespecified margin, was similar to the control. However, only half of these studies actually demonstrated non-inferiority. Prespecifying the non-inferiority or equivalence margin is necessary but not sufficient to guaranty methodologic quality and appropriate conclusion. We confirmed that AIDS trialists had low adherence to non-inferiority and equivalence methodological standards, as it is the case in other fields [28]. An antiretroviral drug may not prove non-inferiority in term of efficacy but nonetheless be a good alternative because the observed difference is small and the new drug demonstrates better tolerance. This interpretation should, however, be left to the reader. To allow a risk-benefit assessment to be made, the report has a particular obligation to be as clear as possible, using standard statistical vocabulary for non-inferiority and equivalence trials, in compliance with the CONSORT statement.

Conclusion
Conclusions about non-inferiority should be drawn on the basis of an appropriate confidence interval using a predefined criterion for non-inferiority, shown in both OT and ITT in compliance with the non-inferiority and equivalence extension of the CONSORT statement [1]. We describe how failure to do so will lead to erroneous conclusions. A claim of non-inferiority with a non-inferiority chi-square D-value above 5% is as incorrect as a claim of superiority with traditional null hypothesis testing Pvalue above 5%. Although the 95% confidence approach is sufficient to reject the null hypothesis, the non-inferiority chi-square provides additional information about the actual degree of significance. Of note, the revised CON-SORT statement for superiority trials, item 12a [32] recommends the report of the actual P-values for statistical significance rather than the imprecise threshold "P < 0.05". The additional use of the continuity-corrected noninferiority chi-square may contribute to avoid misleading interpretation by non-statisticians, for whom significance testing may have a higher impact than confidence intervals. The clinical relevance of the primary outcome on which non-inferiority rely should also be assessed. Reviewers and Editors need to reinforce their standards for acceptance of non-inferiority and equivalence randomized controlled trial. Finally, the importance of critical appraisal has implications for both curricular planning in