We assume that we have a two-arm trial in which there are M primary outcomes. We are interested in testing the null hypotheses Hj (j = 1, … , M) that there is no intervention effect on the nominated outcomes. The test statistics tj are used to test the null hypotheses Hj. Further suppose that there is an overall null hypothesis \( H(M)={\bigcap}_{j=1}^M{H}_j. \) Under this overall hypothesis, the joint test statistic (t1, … , tM) has a M-variate distribution. We denote pj as the marginal, unadjusted p-values obtained from the appropriate statistical test associated with analysing each outcome separately in a univariate framework. For example, when analysing continuous outcomes, an unpaired Student’s t-test may be used or when analysing binary outcomes a Chi-squared test may be used to investigate the intervention. To control the FWER a correction method is then applied to the unadjusted p-values (pj). We compare the following commonly used adjustment methods in this paper: Šidák, Bonferroni, Holm, Hochberg and Hommel. In addition, we consider the Dubey/Armitage-Parmar (D/AP) adjustment and Stepdown minP resampling procedure which take account of the pairwise correlation between the outcomes.
The method proposed by Šidák is defined as \( {p}_j^{\overset{\check{} }{\mathrm{S}}\mathrm{i}}=1-{\left(1-{p}_j\right)}^M \). Equivalently, the significance level could be adjusted to \( {\alpha}^{\overset{\check{} }{\mathrm{S}}\mathrm{i}}=1-{\left(1-\alpha \right)}^{1/M} \), where α is the unadjusted significance level. Under the assumption that the outcomes are independent, the adjustment can be derived as
$$ {\displaystyle \begin{array}{c}P\left( no\ Type\kern0.28em I\kern0.28em error\kern0.28em on\kern0.28em \mathbf{1}\kern0.28em test\right)=1-{\alpha}^{\overset{\check{} }{\mathrm{S}}\mathrm{i}},\\ {}\to P\left( no\ Type\kern0.28em I\kern0.28em error\kern0.28em on\kern0.28em \mathbf{M}\kern0.28em test s\right)={\left(1-{\alpha}^{\overset{\check{} }{\mathrm{S}}\mathrm{i}}\right)}^M,\\ {}\to P\left(\boldsymbol{atleast}\ on e\ Type\kern0.28em I\kern0.28em error\kern0.28em on\kern0.28em \mathrm{M}\kern0.28em test s\right)=1-{\left(1-{\alpha}^{\overset{\check{} }{\mathrm{S}}\mathrm{i}}\right)}^M=\alpha .\end{array}} $$
The Bonferroni method is the most common approach to account for multiplicity due to its simplicity. In this method, the unadjusted p-values pj are multiplied by the number of primary outcome. The Dubey/Armitage-Parmar (D/AP) is an ad-hoc method based on the Šidák method, which takes into account the correlation between the outcomes [11]. The adjusted p-value is \( {p}_j^{adj}=1-{\left(1-{p}_j\right)}^{g(j)} \) where g(j) = M1 − mean ρ(j) and mean ρ(j) is the mean correlation between the jth outcome and the remaining M − 1 outcomes. When using this method in the analysis of multiple outcomes, the mean correlation may be estimated from the data. There has been little theoretical work to assess the performance of this approach [11].One of the nice properties of the D/AP procedure, which may have contributed to its development, is that when the average of the correlation coefficients is zero, the D/AP adjustment is according to the Bonferroni test, and when the average correlation coefficient is one, the D/AP adjusted and the unadjusted p-values are the same. The Holm method [8] involves a step-down method, whereby the unadjusted p-values are ordered from smallest p(1) to largest p(M) and each unadjusted p-value is adjusted as \( {p}_{(k)}^{Holm}=\left(M-k+1\right)\ {p}_{(k)} \), where k = 1, … M is the rank of the corresponding p-value. Then starting with the most significant p-value (smallest p-value), each adjusted p-value is compared to the nominal significance level, until a p-value greater than the significance level is observed after which the method stops [21]. The Hochberg step-up method [9] is similar to the Holm step-down method but works in the other direction. For this method, the unadjusted p-values are ranked from largest p(1) to smallest p(M) and adjusted as \( {p}_{(k)}^{Hoch}=\left(M-k+1\right)\ {p}_{(k)} \). Starting with the least significant p-value (largest p-value), each adjusted p-value is compared to the pre-specified significance level, until a p-value lower than the significance level is observed after which the method stops [21]. Contrary to the Šidák based approaches, this is a semiparametric method meaning the FWER is only controlled when the joint distribution of the hypotheses test statistics is known, most commonly multivariate normal [22]. The Hommel method [10] is another data-driven stepwise method. For this method, the unadjusted p-values are ranked from largest p(M) to smallest p(1). Then let l be the largest integer for which \( {p}_{\left(M-l+j\right)}>\frac{j\alpha}{l} \) or all j = 1, … l. If no such j exists then all outcomes can be deemed statistically significant; otherwise, all outcomes with \( {p}_i\le \frac{\alpha }{j} \) may be deemed statistically significant, where j = 1, … , M; i = 1, … , M. To control the FWER, the Hommel method requires that the joint distribution of the overall hypothesis test statistic is known.
Another step-down method to adjust p-values is the ‘Stepdown minP’ procedure [23, 24]. Unlike the previous methods, it does not make any assumptions regarding the distribution of the joint test statistic. Instead it attempts to approximate the true joint distribution by using a resampling approach. This method takes into account the correlation structure between the outcomes and therefore may yield more powerful tests compared to the other adjustment methods [25]. The Stepdown minP adjusted p-values are calculated as follows: 1) calculate the observed test statistics using the observed data set; 2) resample the data with replacement within each intervention group to obtain bootstrap resamples, compute the resampled test statistics for each resampled data set and construct the reference distribution using the centred and/or scaled resampled test statistics; 3) calculate the critical value of a level α test based on the upper α percentile of the reference distribution, or obtain the raw p-values by computing the proportion of bootstrapped test statistics that are as extreme or more extreme than the observed test statistic [26]. That is, the Stepdown minP adjusted p-value for the jth outcome is defined as [24, 26] \( {p}_j^{minP}={\max}_{k=1,\dots, j}\left\{\kern0.5em \Pr \left(\left(\ {\min}_{l=k,\dots, M}\kern.45em {p}_l\le {p}_k\kern0.5em \right|\kern0.5em H(M)\right)\right\}, \) where pk is the unadjusted p-value for the kth outcome, pl is the unadjusted p-value for the lth outcome (l = k, … , M), and H(M) is the overall null hypothesis.
Although, the resampling based methods have previously been recommended for clinical trials with multiple outcomes they are not widely used in practice [25]. The Stepdown minP has been shown to perform well when compared to other resampling procedures [26] and was therefore investigated in this paper.
We perform a simulation study to evaluate the validity of these methods to account for potentially correlated multiple primary outcomes in the analysis and sample size of RCTs. We focus on two, three and four outcomes as a review of trials with multiple primary outcomes in the psychiatry and neurology field found that the majority of the trials had considered two primary outcomes [4]. Additionally, it has been recommended that a trial should have no more than four primary outcomes [27]. We estimate the family wise error rate (FWER), the disjunctive power to detect at least one intervention effect and the marginal power to detect an intervention effect on a nominated outcome in a variety of scenarios.
Simulation study
We used the following model to simulate values for two continuous outcomes Yi = (Yi, 1, Yi, 2),
$$ {\boldsymbol{Y}}_{\boldsymbol{i}}={\boldsymbol{\beta}}_{\mathbf{0}}+{\boldsymbol{\beta}}_1{x}_i+{\boldsymbol{\epsilon}}_{\boldsymbol{i}} $$
(2)
where xi indicates whether the participant i received intervention or control, β1 = ( β11, β12 )T is vector of the intervention effects for each outcome, ϵi are errors which are realisations of a multivariate normal distribution \( {\boldsymbol{\epsilon}}_{\boldsymbol{i}}={\left({\epsilon}_{i,1},{\epsilon}_{i,2}\ \right)}^T\sim N\left(\left(\genfrac{}{}{0pt}{}{0}{0}\right),\left(\begin{array}{cc}1& \rho \\ {}\rho & 1\end{array}\right)\ \right), \) and ρ ϵ {0.0, 0.2, 0.4, 0.6, 0.8}. The model was also extended to simulate three and four continuous outcomes. When simulating three and four outcomes we specified compound symmetry, meaning that the correlation between any pair of outcomes is the same. We explored both uniform intervention effect sizes and varying effect sizes across outcomes. For the uniform intervention effect sizes, we specified an effect size of 0.35 for all outcomes, that is β1 = (0.35, 0.35)T, β1 = (0.35, 0.35, 0.35)T or β1 = (0.35, 0.35, 0.35, 0.35)T for two, three and four outcomes scenarios respectively. This represents a medium effect size, which reflects the anticipated effect size in many RCTs [28]. For the varying intervention effect sizes, we specified that β1 = (0.2, 0.4)T, β1 = (0.2, 0.3, 0.4)T or β1 = (0.1, 0.2, 0.3, 0.4)T for two, three and four outcomes scenarios respectively. We also explored the effect of skewed data by transforming the outcome data with uniform intervention effect sizes to have a gamma distribution with shape parameter = 2 and a scale parameter = 2. The gamma distribution is often used to model healthcare costs in clinical trials [29, 30] and may also be appropriate for skewed clinical outcomes.
We set the sample size to 260 participants, with an equal number of participants assigned to each arm. This provides 80% marginal power to detect a clinically important effect size of 0.35 for each outcome, using an unpaired Student’s t-test and the significance level is unadjusted at 0.05. We introduced missing data under the assumption that the data were missing completely at random (MCAR). When simulating two outcomes, 15 and 25% of the observations in outcome 1 and 2 are missing respectively, and on average approximately 4% of the observations would be missing for both outcomes. When simulating three outcomes, 15% of the observations are missing in one outcome and 25% of the observations are missing in the other two outcomes. When simulating four outcomes, 15% of the observations are missing in two outcomes and 25% of the observations are missing in the other two outcomes. This proportion of missingness in outcomes is often observed in RCTs [31,32,33,34].
We estimated the FWER and disjunctive power by specifying no intervention effect (β1j = 0) and an intervention effect (β1j ≠ 0), respectively, and calculating the proportion of times an intervention effect was observed on at least one of the outcomes. The marginal power was similarly estimated but we calculated the proportion of times an intervention effect was observed on the nominated outcome. For each scenario we ran 10,000 simulations. The simulations were run using R version 3.4.2. The Stepdown minP procedure was implemented using the NPC package.
We calculated the sample size based on disjunctive power using the R package “mpe” [35] and we calculated the sample size based on the marginal power using the R package “samplesize” [36]. The statistical methodology used for the sample size calculation in these packages is described in the Additional file 1.