 Research Article
 Open Access
 Open Peer Review
 Published:
A Pvalue model for theoretical power analysis and its applications in multiple testing procedures
BMC Medical Research Methodology volume 16, Article number: 135 (2016)
Abstract
Background
Power analysis is a critical aspect of the design of experiments to detect an effect of a given size. When multiple hypotheses are tested simultaneously, multiplicity adjustments to pvalues should be taken into account in power analysis. There are a limited number of studies on power analysis in multiple testing procedures. For some methods, the theoretical analysis is difficult and extensive numerical simulations are often needed, while other methods oversimplify the information under the alternative hypothesis. To this end, this paper aims to develop a new statistical model for power analysis in multiple testing procedures.
Methods
We propose a stepfunctionbased pvalue model under the alternative hypothesis, which is simple enough to perform power analysis without simulations, but not too simple to lose the information from the alternative hypothesis. The first step is to transform distributions of different test statistics (e.g., t, chisquare or F) to distributions of corresponding pvalues. We then use a step function to approximate each of the pvalue’s distributions by matching the mean and variance. Lastly, the stepfunctionbased pvalue model can be used for theoretical power analysis.
Results
The proposed model is applied to problems in multiple testing procedures. We first show how the most powerful critical constants can be chosen using the stepfunctionbased pvalue model. Our model is then applied to the field of multiple testing procedures to explain the assumption of monotonicity of the critical constants. Lastly, we apply our model to a behavioral weight loss and maintenance study to select the optimal critical constants.
Conclusions
The proposed model is easy to implement and preserves the information from the alternative hypothesis.
Background
Power analysis is a key technique in the experimental design to reveal an effect of a given size. Traditional power calculation usually assumes a single hypothesis test, but it is quite common for researchers to test several hypotheses simultaneously. Clinical trials often require two or more hypotheses to be tested, and studies which involve comparing treatments using multiple outcome measures happen frequently in medical research [8]. The development of highthroughput biology leads to a dramatic increase in the number of hypothesis tests in genomics [20]. However, there are a limited number of studies on power analysis in multiple testing procedures. For scientific studies with multiple hypotheses, in order to correctly control the false positives, multiplicity adjustments to pvalues should be taken into account in power analysis. A consequence of multiplicity adjustments is the loss of power [21] and the change in sample size requirements [16].
The pvalue is a tail probability given the null hypothesis is true. Under the null hypothesis, the pvalue is a uniformly distributed random variable between 0 and 1. If the null hypothesis is false, the pvalue’s distribution depends on the alternative hypothesis, which usually satisfies the inequality
i.e., the random variable P is less than a standard uniform random variable in the stochastic order, where the random variable P is the probability of rejecting the null hypothesis when the alternative hypothesis is true.
In order to perform a pvalue based power analysis, certain distribution models are needed to describe the behavior of pvalues under the alternative hypothesis. In general there are two approaches. The pvalue models based on the original test statistics [17] or based on copulas [27] are usually with complex expressions, and further calculations or evaluations require integrations. So the theoretical analysis is difficult and numerical simulations are often needed. The pvalue models based on Dirac function [10, 24] are oversimplified, which limits their application areas.
In this paper, we propose the stepfunctionbased pvalue models under the alternative hypothesis, which are simple enough to perform theoretical power analysis, but not too simple to lose the information from the alternative hypothesis. Two applications in multiple testing procedures are shown and one application in weightloss treatment is given.
Methods
A widely used pvalue model assumes that the statistic under the null hypothesis follows N(0,1^{2}), and the statistics under the alternative hypothesis follows N(δ,1^{2}). The pvalues are calculated based on onesided test [17].
The density function of the normaldistributionbased pvalue model is
with mean and variance
Alternatively, we propose a stepfunctionbased pvalue model under the alternative hypothesis, which has a density function
with mean and variance
where 0≤g≤1≤f. The parameter (f,g) indicates the deviation of the random variable P under the stepfunctionbased pvalue model from a standard uniform random variable.
For comparison between the normaldistributionbased pvalue model and the stepfunctionbased pvalue model, the parameters (f,g) of the stepfunctionbased pvalue model were chosen to match the means and the variances for the normaldistributionbased pvalue model with parameter δ, as shown in Table 1. For the simplified stepfunctionbased pvalue model with parameter f, means are matched with the normal model.
In addition, a simplified stepfunctionbased pvalue model with a single parameter f∈[1,+∞) is achieved when assuming g=0. The corresponding density function is
with mean and variance
For the simplified stepfunctionbased pvalue model, a larger parameter f corresponds to a larger effect size. As a special case, when f=1, the distribution is uniform [0,1]. The probability density functions and the cumulative distribution functions of the normaldistributionbased and the stepfunctionbased pvalue models are compared in Fig. 1. The stepfunctionbased pvalue models serve an approximation to the pvalue models based on the original test statistics.
Based on the univariate model, the corresponding multivariate pvalue model has a density function
where different h_{ i }(·)’s may have different parameters f_{ i }’s and g_{ i }’s. In the following sections, the simplified stepfunctionbased pvalue model is applied to two problems in multiple testing procedures.
Results
Application: optimal choices of the critical constants
Assume n hypotheses \(\left \{H_{i}\right \}_{i=1}^{n}\) with pvalues \(\left \{p_{i}\right \}_{i=1}^{n}\). Sort pvalues as p_{(1)}≤⋯≤p_{(n)}, and H_{(1)},⋯,H_{(n)} are the corresponding null hypotheses. Consider testing the global null hypothesis \(\cap _{i=1}^{n} H_{i}\) under the control of type I error [15]
The global test compares p_{(n−i+1)} with its corresponding critical constant c_{ i }α for every i=1,⋯,n. If for some i’s, p_{(n−i+1)}≤c_{ i }α, then the global null hypothesis is rejected. Simes [25] proposed a test with c_{ i }=(n−i+1)/n, and other choices of c_{ i }’s were proposed by Rom [22], Cai and Sarkar [7], Gou and Tamhane [11].
Among different choices of critical constants c_{ i }’s, people usually run simulations [19] or rely on numerical calculations [14] to make power comparison in order to choose suitable sets of critical constants. Besides the existing computationally intensive methods, the stepfunctionbased pvalue model is an alternative choice to theoretically calculate the powers and make comparisons between different multiple testing procedures.
In this section we first show how the most powerful critical constants can be chosen using our proposed method for a global test with two hypotheses, and then we can apply the successive recursion process to calculate the power for a global test with n hypotheses. For a global test with two single hypotheses, the control of type I error under independence requires
where c_{1}≥c_{2}. This equality is equivalent to
From (6) we get \(c_{2} = \frac {1  {c_{1}^{2}}\alpha }{2\left (1  c_{1} \alpha \right)}\), and the derivative \(\frac {d c_{2}}{d c_{1}} =  \frac {\left (c_{1}  c_{2}\right)\alpha }{1  c_{1}\alpha }\), which is less than zero. So c_{2} decreases when c_{1} increases.
By using the simplified stepfunctionbased pvalue model, without loss of generality, we assume 1≤f_{1}≤f_{2}, the probability of rejecting the global hypothesis is
When the global hypothesis contains two single hypotheses (n=2), there are two configurations of true and nontrue null hypotheses where the global null hypothesis is false: (1) one true null hypothesis (n_{0}=1) and one false null hypothesis (m=1), and (2) two false null hypotheses (m=2).
First, assume that one null hypothesis is true and the other is false, say, f_{1}=1 and f_{2}=f, then the power is
When \(f \leq \frac {1}{c_{1}\alpha }\), from (6) it follows that c_{1}(c_{1}−2c_{2})α=1−2c_{2}, therefore
hence power increases when c_{2} decreases (c_{1} increases).
When \(\frac {1}{c_{1}\alpha } \leq f \leq \frac {1}{c_{2}\alpha }\), from (6) we get \(c_{1} c_{2} \alpha = \frac {1}{2}\left (2c_{2}  1 + {c_{1}^{2}}\alpha \right)\), consequently
Note that fαc_{1}≥1, so power increases when c_{1} decreases (c_{2} increases).
We calculate the maximal power for different f’s. When we assume the alternative hypothesis has a small effect size, where \(f \leq 1/\sqrt {\alpha }\), the maximal power is achieved when c_{2}=0 and \(c_{1} = 1/\sqrt {\alpha }\). When we assume the alternative hypothesis has a moderate effect size, where \(1/\sqrt {\alpha } < f \leq \left (1 + \sqrt {1  \alpha }\right)/\alpha \), the maximal power is achieved when c_{1}=1/(fα) and c_{2}=(f^{2}α−1)/(2αf(f−1)), so when f increases, we can follow the strategy to decrease c_{1} (increase c_{2}) to achieve the maximal power. When we assume the alternative hypothesis has a large effect size, where \(f > \left (1 + \sqrt {1  \alpha }\right)/\alpha \), the maximal power is achieved when we choose a test with c_{2}≥1/(fα), so when f is large enough, different tests have similar power.
Second, assume that both null hypothesis are false with the same effect size, say, f_{1}=f_{2}=f, then the power is
When \(f \leq \frac {1}{c_{1}\alpha }\), from (6) we get c_{1}(c_{1}−2c_{2})α=1−2c_{2}, then
power increases when c_{2} decreases (c_{1} increases).
We calculate the maximal power for different f values. When we assume both hypotheses are false and with a small effect size, where \(f \leq 1/\sqrt {\alpha }\), the maximal power is achieved when c_{2}=0 and \(c_{1} = 1/\sqrt {\alpha }\). When we assume both false hypotheses have a big effect size, where \(f \geq 1/\sqrt {\alpha }\), the maximal power is achieved when the test satisfies c_{1}≥1/(fα). By taking both \(f\leq 1/\sqrt {\alpha }\) and \(f \geq 1/\sqrt {\alpha }\) into account, it follows that \(c_{1} = 1/\sqrt {\alpha }\) and c_{2}=0 is the uniformly best choice when both null hypotheses are false.
In general, for a global test with n single hypotheses, where m of them are true significances (false null hypotheses), define the probabilities as
where Pr_{n,m} indicates the probability for n hypotheses, where m is the number of the true significances, and n_{0}=n−m is the number of the true nulls.
Since
we have the recurrence relation for i=n
Similarly, for general i, since
we have the general recurrence relation for i
Finner and Roters [9], Cai and Sarkar [7], and Gou and Tamhane [11] defined a special case of the probability B_{n,m,i} for m=0 to calculate the type I error under the global null hypothesis. They proved a recurrence relationship among B_{n,0,i}’s. We generalize this result to B_{n,m,i}’s and have this recurrence relationship (10) under the simplified stepfunctionbased pvalue model for power analysis.
By starting from
and using the recurrence relation (10), the power is calculated by
Note that
and the control of type I error is satisfied if
When f is specified and the set of critical constants \(\left \{c_{i}\right \}_{i=1}^{n}\) is given, the exact power can be calculated by using (11). Since only arithmetic calculations are needed, the power can be computed very fast.
For theoretical analysis, we consider the situation where f is not too small.
The largest possible c_{n+1−m} is achieved by using the set of critical constants which satisfies c_{1}=c_{2}=⋯=c_{n+1−m}=c, and c_{n−m+2}=⋯=c_{ n }=0. The control of type I error requires that
the largest possible c_{n+1−m} can be solved from (12). If we only take the leading term, we have an approximate solution
So when
the maximal power is achieved when the test satisfies c_{n+1−m}≥1/(fα).
Note that when m is relatively large (e.g., more than n/2), the bound \(\sqrt [m]{{{n \choose m}}/{\alpha }}\) is small, and the global tests with large c_{1},⋯,c_{n+1−m} and small c_{n+2−m},⋯,c_{ n } tend to have large power. Similar observations were reported by Gou and Tamhane [11] based on simulations.
Note that in this application power is simply the probability of rejecting \(H_{0} = \cap _{i=1}^{n} H_{i}\) where at least one H_{ i } is false. For testing multiple hypothesis, powers can be of different types: individual, average, disjunctive, and conjunctive, and the appropriate power concept is determined on a casebycase basis [4]. These power definitions can also be used in the proposed method by using the stepfunctionbased pvalue model.
Power analysis can be complex when multiple hierarchical objectives are involved. Alosh and Huque [1] discussed the power for testing hierarchically ordered endpoints. The stepfunctionbased pvalue model can be applied to various clinical trials, e.g., group sequential designs [18], graphical procedures [5, 6].
Application: monotonicity of the critical constants
For multiple test procedures, critical constants are often required to satisfy [7, 12]
This requirement is called the monotonicity assumption of critical constants.
Suppose that we have a set of critical constants \(c_{1}^{*}, \cdots, c_{n}^{*}\), and \(c_{k}^{*} < c_{k+1}^{*}\), so the monotonicity assumption is not satisfied. Note that
So if a test with critical constants \(c_{1}^{*}, \cdots, c_{k1}^{*}, c_{k}^{*}, c_{k+1}^{*}, \cdots, c_{n}^{*}\) controls type I error below α, then another test with critical constants \(c_{1}^{*}, \cdots, c_{k1}^{*}, c_{k+1}^{*}, c_{k+1}^{*}, \cdots, c_{n}^{*}\), which satisfies the monotonicity assumption, also controls type I error below α, and has the same power with the previous test which does not satisfy the monotonicity assumption. Hence, only the set of critical constants which satisfies the monotonicity assumption needs to be considered.
Many multiple tests have critical constants which satisfy a strict monotonicity assumption [11, 22, 25]
Some multiple tests satisfy the monotonicity assumption (13), but do not satisfy the strict monotonicity assumption (14) [3, 26]. In general, these tests are not as powerful as the tests which satisfy the strict monotonicity assumption (14) [11]. Our stepfunctionbased pvalue models can explain that this assumption is necessary because the corresponding tests are generally more powerful than other tests which do not satisfy this assumption.
For multiple tests with two single hypotheses, by using the simplified stepfunctionbased pvalue model, we have several observations: (1) when there is one true null hypothesis and one false null hypothesis, only if \(f = \left (1 + \sqrt {1\alpha }\right)/\alpha \), the test which does not satisfy (14) can be more powerful than or as powerful as all the tests which satisfy (14), (2) when there are two false null hypotheses, the test does not satisfy (14) is less powerful than some of the tests which satisfy (14) for all f values. In general, on the parameter space of the effect size (under simplified stepfunctionbased pvalue model, the effect size is a function of parameter f), the tests which do not satisfy (14) have more power than all other tests which satisfy (14) only at a zero measure subspace of the effect size. This fact explains that usually people prefer multiple tests which satisfy the strict monotonicity property (14), because these tests are generally more powerful than the tests which do not satisfy (14).
A worked example
Annesi et al. [2] evaluated behavioral weightloss treatments. They recruited 110 women whose BMI’s are between 30 and 40 kg/m^{2}, and randomly assigned the participants to a comparison treatment with a print manual and telephone followups, or an experimental treatment of the coach approach exercisesupport protocol. The selfefficacy for controlled eating (SEeating) is one of the psychological predictors of behavioral changes. Annesi et al. [2] reported that during the weightloss phase (month 06), the SEeating increases in the experimental group were significantly greater than the increases in the comparison group with t=2.88, and there was no significant betweengroup difference during the weightloss maintenance phase (month 624) with t=−0.48.
The increases of the selfefficacy for controlled eating were evaluated both during the weightloss phase and during the weightloss maintenance phase, so the multiplicity adjustment is advised to apply. To choose the optimal multiplicity correction based on the estimated effect size from the pilot study, we recommend our proposed stepfunctionbased pvalue model because it is easy to implement and preserves the information from the alternative hypothesis. If we take the weightloss study by Annesi et al. [2] as a pilot study, we have the information that the standardized SEeating increase during the weightloss phase is normally distributed with mean δ=2.88 and variance 1, and the increase during the weightloss maintenance phase is normally distributed with mean δ=0 and variance 1. To match the mean for the normaldistributionbased pvalue model with parameter δ=2.88, the parameter f of the simplified stepfunctionbased pvalue model is 23.979. From (7) and by using the significance level α=0.05, the maximal power is (1+f^{2}α)/(2f)=62 %, and the optimal choice of critical constants is (c_{1},c_{2})=(1/(fα),(f^{2}α−1)/(2αf(f−1)))=(0.8341,0.5036). So the larger pvalue is compared with 0.8341α and the smaller pvalue is compared with 0.5036α, and if any pvalue is less than the corresponding critical value, the global null hypothesis will be rejected.
The stepfunctionbased pvalue models for power analysis simplify the theoretical analysis that is difficult in many situations. At the same time, information of loss remains at an acceptable level. Finner and Gontscharuk [10] and Sarkar et al. [24] used a tool called the Dirac–uniform configuration for power analysis, where all pvalues under the false null hypotheses follow a Dirac distribution with point mass at 0. When the Diracuniform configuration is applied to Annesi et al.’s [2] study, the information of δ is lost, and any choice of positive critical constants (c_{1},c_{2}) will result a claim of significance. Under the pvalue model based on Dirac function, all choices of critical constants have the same power, and the optimal choice is unable to be located.
The Diracuniform configuration is too brief to include necessary information to choose the optimal critical constants. Hung et al. [17] discussed a pvalue model based on normal distribution. By using Annesi et al.’s [2] research as a pilot study, one test statistic is N(δ,1^{2}), and the other test statistic is N(0,1^{2}). The probability of rejecting the global null hypothesis is
The optimal choice of critical constants is followed by solving
This optimization problem has no explicit solution. When δ=2.88 and α=0.05, the optimal solution is (c_{1},c_{2})=(1.0076,0.4998). While the stepfunction based pvalue model and the normaldistribution based pvalue model produce similar choices of critical constants (c_{1},c_{2}), the model based on step function has an explicit solution of critical constants and requires little computational effort.
Discussion and conclusions
We have given a stepfunctionbased pvalue model and its simplified version. These pvalue models are simple and concise to perform theoretical power analysis. In addition, different test statistics, for example, t, chisquare or F, can be transformed to the pvalue scale. These models can be applied to the field of multiple testing procedures to explain the assumption of monotonicity of the critical constants. We also use these pvalue models to choose suitable sets of critical constants with more power. In this paper, we consider the independent pvalues for the multivariate cases. Dependence structures can be brought into these pvalue models, like Sarkar et al. [23] or Gou and Tamhane [13], and we will report the dependent multivariate pvalue models in a separate paper. Finally, there are many applications of the stepfunctionbased pvalue models, and multiple testing procedure is an example in point.
Abbreviations
 BMI:

Body mass index
 SEeating:

Selfefficacy for controlled eating
References
 1
Alosh M, Huque MF. A consistencyadjusted alphaadaptive strategy for sequential testing. Stat Med. 2010; 29:1559–71.
 2
Annesi JJ, Johnson PH, Tennant GA, Porter KJ, McEwen KL. Weight loss and the prevention of weight regain: evaluation of a treatment model of exercise selfregulation generalizing to controlled eating. Permanente J. 2016; 20:1–15.
 3
Bauer P. Sequential tests of hypotheses in consecutive trials. Biom J. 1989; 31:663–76.
 4
Bretz F, Hothorn T, Westfall P. Multiple Comparisons Using R; 2010.
 5
Bretz F, Maurer W, Brannath W, Posch M. A graphical approach to sequentially rejective multiple test procedures. Stat Med. 2009; 28:586–604.
 6
Burman CF, Sonesson C, Guilbaud O. A recycling approach for the construction of Bonferronibased multiple tests. Stat Med. 2009; 28:739–61.
 7
Cai G, Sarkar SK. Modified Simes’ critical values under independence. Stat Probab Lett. 2008; 78:1362–78.
 8
Feise RJ. Do multiple outcome measures require pvalue adjustment. BMC Med Res Methodol. 2002; 2:1–4.
 9
Finner H, Roters M. On the limit behavior of the joint distribution of order statistics. Ann Inst Stat Math. 1994; 46:343–9.
 10
Finner H, Gontscharuk V. Controlling the familywise error rate with plugin estimator for the proportion of true null hypotheses. J R Stat Soc Ser B. 2009; 71:1031–48.
 11
Gou J, Tamhane AC. On generalized Simes critical constants. Biom J. 2014; 56:1035–54.
 12
Gou J, Tamhane AC, Xi D, Rom D. A class of improved hybrid HochbergHommel type stepup multiple test procedures. Biometrika. 2014; 101:899–911.
 13
Gou J, Tamhane AC. Hochberg procedure under negative dependence. 2015. Technical report, Department of Statistics, Northwestern University, Evanston, Illinois.
 14
Hayter AJ, Tamhane AC. Sample size determination for stepdown multiple test procedures: orthogonal contrasts and comparisons with a control. J Stat Plan Infer. 1991; 27:271–90.
 15
Hochberg Y, Tamhane AC. Multiple Comparison Procedures. New York: John Wiley; 1987.
 16
Hsu JC. Sample size computation for designing multiple comparison experiments. J Comput Stat Data Anal. 1988; 7:79–91.
 17
Hung HM, O’Neill RT, Bauer P, Köhne K. The behavior of the pValue when the alternative hypothesis is true. Biometrics. 1997; 53:11–22.
 18
Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. New York: Chapman and Hall/CRC; 2000.
 19
Jung S, Bang H, Young S. Sample size calculation for multiple testing in microarray data analysis. Biostatistics. 2005; 6:157–69.
 20
Lazzeroni LC, Ray A. The cost of large numbers of hypothesis tests on power, effect size and sample size. Mol Psychiatry. 2012; 17:108–14.
 21
Maxwell SE, Kelley K, Rausch RJ. Sample size planning for statistical power and accuracy in parameter estimation. Annu Rev Psychol. 2008; 59:537–63.
 22
Rom DM. A sequentially rejective test procedure based on a modified Bonferroni inequality. Biometrika. 1990; 77:663–5.
 23
Sarkar SK, Fu Y, Guo W. Improving Holm’s procedure using pairwise dependencies. Biometrika. 2016; 103:237–43.
 24
Sarkar SK, Guo W, Finner H. On adaptive procedures controlling the familywise error rate. J Stat Plan Infer. 2012; 142:65–78.
 25
Simes RJ. An improved Bonferroni procedure for multiple tests of significance. Biometrika. 1986; 73:751–4.
 26
Šidák Z. Rectangular confidence regions for the means of multivariate normal distributions. J Am Stat Assoc. 1967; 62:626–33.
 27
Stange J, Bodnar T, Dickhaus T. Uncertainty quantification for the familywise error rate in multivariate copula models. AStA Adv Stat Anal. 2015; 99:281–310.
Acknowledgments
We thank the editor and referees for their comments which helped to improve the paper.
Funding
The researchers did not receive external sources of funding.
Availability of supporting data
Not applicable.
Authors’ contributions
FZ and JG conceived and designed the study. FZ conducted the statistical analysis and drafted the manuscript. JG developed methods and revised the manuscript. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Critical constants
 Multiple testing procedures
 Power analysis
 pvalue