A systematic approach to designing statistically powerful heteroscedastic 2 × 2 factorial studies while minimizing financial costs

Background The 2 × 2 factorial design is widely used for assessing the existence of interaction and the extent of generalizability of two factors where each factor had only two levels. Accordingly, research problems associated with the main effects and interaction effects can be analyzed with the selected linear contrasts. Methods To correct for the potential heterogeneity of variance structure, the Welch-Satterthwaite test is commonly used as an alternative to the t test for detecting the substantive significance of a linear combination of mean effects. This study concerns the optimal allocation of group sizes for the Welch-Satterthwaite test in order to minimize the total cost while maintaining adequate power. The existing method suggests that the optimal ratio of sample sizes is proportional to the ratio of the population standard deviations divided by the square root of the ratio of the unit sampling costs. Instead, a systematic approach using optimization technique and screening search is presented to find the optimal solution. Results Numerical assessments revealed that the current allocation scheme generally does not give the optimal solution. Alternatively, the suggested approaches to power and sample size calculations give accurate and superior results under various treatment and cost configurations. Conclusions The proposed approach improves upon the current method in both its methodological soundness and overall performance. Supplementary algorithms are also developed to aid the usefulness and implementation of the recommended technique in planning 2 × 2 factorial designs. Electronic supplementary material The online version of this article (doi:10.1186/s12874-016-0214-3) contains supplementary material, which is available to authorized users.


Background
The factorial designs are the most common formulation for assessing the existence of interaction and the extent of generalizability of two or more factors in medical research. Notably, systematic reviews and practical guidelines have been presented in Green, Liu, and O'Sullivan [1], Kahan [2], Kent et al. [3], McAlister et al. [4], and Montgomery, Astin, and Peters [5]. The prominent advantages of factorial designs over a series of singlefactor studies can be easily seen with the 2 × 2 factorial design. Particularly, a wide range of research problems associated with interactions, main effects, and various mixtures can be examined in terms of a linear combination of mean effects. It is noteworthy that the designated linear comparison represents the substantive hypothesis of interest and reveals essential information that cannot be obtained from single-factor studies. Comprehensive exposition and further information can be found in Kutner et al. [6] and Maxwell and Delaney [7]. In addition, useful flowcharts of two-factor studies were presented in Figure 19.11 of Kutner et al. [6] and Figure  7.2 of Maxwell and Delaney [7] for appropriate approaches to evaluating effects either in the presence or the absence of an interaction.
It follows from the independence, normality, and homogeneity of variance assumptions, that the inference for a linear combination of mean effects can be conducted with a t statistic. However, numerous studies have found that the homogeneous variances assumption is frequently untenable in many research areas. It was explicitly stressed in Golinski and Cribbie [8], Grissom [9], Keselman et al. [10], and Ruscio and Roche [11] that variances are often extremely different across treatment groups. Therefore, it is prudent to adopt proper approaches that are superior to the traditional methods under various circumstances of heterogeneous variances. For testing a hypothesis of a linear combination of group means, the approximation suggested independently by Satterthwaite [12] and Welch [13] is the most widely recommended technique to correct for variance heterogeneity (Kirk [14]; Maxwell & Delaney [7]). The procedure is sometimes referred to as the Welch-Satterthwaite test and provides a simple and robust t-solution with approximate degrees of freedom.
A research study typically requires adequate statistical power and sufficient sample size to detect scientifically credible effects. Within the context of medical trials, the power and sample size implications for subgroup analysis of treatment combination and interaction in factorial studies were noted in Beck et al. [15], Brookes et al. [16], Gonen [17], Natarajan et al. [18], and Wolbers et al. [19]. To extend the applicability of the Welch-Satterthwaite procedure in planning research designs, Shieh and Jan [20] presented two approaches to power and sample size calculations for the Welch-Satterthwaite test. The approximate method presents a particularly convenient technique for general use. Alternatively, the exact formulation is noticeably more effective in maintaining the power performance in some situations. However, the prescribed approaches for choosing sample size to provide adequate power do not consider the cost issues of different sample size choices. But the cost for treating a subject often varies with treatment groups and it is impossible for researchers to overlook budget constraints in practice. Bacchetti [21] and Bacchetti, McCulloch, and Segal [22] also emphasized that the conventional sample size procedures do not take cost into account and can produce costinefficient sample size choices. Alternatively, Allison et al. [23] advocated designing statistically powerful studies while minimizing costs.
Recently, Luh and Guo [24] studied the problem of efficient sample size allocation to reduce budget for the Welch-Satterthwaite test within the context of 2 × 2 factorial design. The optimal ratio of sample sizes suggested in Luh and Guo [24] is proportional to the ratio of the population standard deviations divided by the square root of the ratio of the unit sampling costs. Despite the presented argument and demonstration, their method is susceptible to three critical issues. First, the two-step procedure of Luh and Guo [24] involved several approximations including the use of quantiles of a standard normal distribution and a t distribution with approximate degrees of freedom. Unlike the two approaches of Shieh and Jan [20], Luh and Guo [24] did not employ the feature of a noncentral t distribution in their computation. More importantly, they did not explicitly define the nonnull distribution and the power function of the Welch-Satterthwaite test. Therefore, the resulting explication is incomplete by its absence of vital technical formulations.
Second, the optimal sample size ratios in Luh and Guo [24] were obtained with the simplified assumption that the test statistic has a normal distribution with known variances. Note that the particular formula has been adopted in Guo and Luh [25] to determine the presumably optimal sample size ratios for designing statistically powerful two-sample studies while minimizing financial costs. However, it was shown in Jan and Shieh [26] that such an allocation scheme generally do not give the optimal solution. Hence, it is arguable that Luh and Guo's [24] claim of optimal allocations requires further clarification. Third, the sample sizes need to be integer values in reality. The final sample sizes in Luh and Guo [24] are determined by rounding up the outcomes of the two-step calculations to the next largest integers. Conceivably, the use of discrete numbers induces some inexactness into the optimal evaluation. It is unlikely that a direct integer rounding process will always give the optimal result under power and cost considerations even that the optimal sample size ratios have been implemented. Instead, a systematic power calculation and cost assessment needs to be conducted to find the proper result.
In order to make a useful and well-supported recommendation on optimal sample size allocations, this article presents an alternative approach for designing statistically powerful heteroscedastic 2 × 2 factorial studies while minimizing financial costs. A detailed account of the Welch-Satterthwaite test is presented next to document its theoretical characteristics and computational requirements. Moreover, the optimization processes of the proposed procedure with power and cost constraints are described. To provide definitive evidence, extensive empirical investigations were conducted to demonstrate the advantages of the suggested approach over the potentially defective method of Luh and Guo [24] under a variety of model configurations. Essentially, this study contributes to the literature of sample size methodology for the Welch-Satterthwaite test in two aspects. First, the suggested allocation scheme extends the design strategy of Shieh and Jan [20] to accommodate both budgetary constraints and power assessments. Second, the proposed optimization technique offers prominent improvement over Luh and Guo's [24] method to obtain the true optimal sample sizes. For computing the attained power and optimal allocation scheme in planning research designs, the SAS computer algorithms are offered to facilitate the recommended procedure.

Methods
Consider the statistical model of a 2 × 2 heteroscedastic factorial design: where X ijk represents the independent and normally distributed response variable with expected values μ ij and variances σ ij 2 , μ ij is the population mean, and σ ij 2 is the error variance at level i of the factor A and level j of the factor B for i and j = 1 and 2, and k = 1, …, N ij . In general, a linear combination of mean parameters is defined as where L ij are the linear coefficients. A contrast is a special case of a linear combination in which the coefficients of the means add up to 0. Notably, the contrasts associated with the main effect A, the main effect B, and the interaction effect between A and B can be expressed as and respectively. An unbiased estimatorψ of the linear combination ψ is obtained by simply replacing each population mean in Equation 2 by the corresponding sample mean: where X ij ¼ X ijk =N ij for i and j = 1 and 2. The hypothesis testing of H 0 : ψ = ψ 0 versus H 1 : ψ ≠ ψ 0 can be conducted with the following statistic where ψ 0 is a specified constant,ω 2 ¼ is the sample variance estimator of σ ij 2 for i and j = 1 and 2. Under the null hypothesis H 0 : ψ = ψ 0 , it was demonstrated in Satterthwaite [12] and Welch [13] that the statistic T* given in Equation 5 has a convenient approximate distribution where t(ν) is a t distribution with degrees of freedom ν and v ¼ For inferential purposes, the term of degrees of freedom ν is replaced by its counterpartv with direct substi- Hence, the null distribution of T* is modified as and the Welch-Satterthwaite procedure rejects H 0 at the significance level α if T Ã j j > tv ; a=2 where tv ; a=2 is the upper 100(α/2) percentile of the t distribution tv ð Þ.
Moreover, it was noted in Shieh and Jan [20] that the statistic T* has the general approximate distribution where t(ν, δ) is a noncentral t distribution with degrees of freedom ν and noncentrality parameter It immediately follows from the noncentral t distribution given in Equation 9 that the power function of the Welch-Satterthwaite test can be approximated by Accordingly, Shieh and Jan [20] noted that the approximate power function π(δ) provides a useful expression because of its theoretical implications and practical applications. The numerical computation of the power level requires the evaluation of the cumulative distribution function of a noncentral t variable with respective to the quantile of a regular t distribution. Since all related functions are embedded in major statistical packages, the actual computations can be readily conducted with current computing capabilities. A SAS/IML (SAS Institute [27]) program is presented in Additional file 1 for computing π(δ) with the designated sample sizes and model configurations. More importantly, the empirical examinations presented later reveal that the approximate power function is sufficiently accurate for general purposes.

Optimal allocation scheme
The determination of an adequate and efficient allocation of sample sizes is a vital aspect in the planning stage of research studies. It is often sensible to consider the sample size issues in the presence of funding constraints. The total cost of a 2 × 2 factorial study can be represented by the overhead cost and sampling costs through the following linear cost function where C O is the fixed overhead cost associated with the study and C ij reflects unit sampling cost of each subject in group (i, j) for i and j = 1 and 2. It is important to note that the consideration of the total number of subjects can be viewed as a special case of the cost function C T , with C O = 0 and C ij = 1 for i and j = 1 and 2. From the cost perspective, a question arises naturally in choosing the optimal sample sizes: What is the least cost for a research study to maintain its desired power level? To develop a systematic search of the optimal solution to ensure the nominal power performance while minimizing the total cost C T defined in Equation 11, the suggested approach is conducted in two steps. With the specifications of the significance level α, the desired power level 1 -β, the null effect size ψ 0 , and the model parameters of group means and variance components, the first step computes the preliminary sample sizes {N P11 , N P12 , N P21 , N P22 } via an optimization process by minimizing the objective cost function with the constraint that the attained power is equal to or greater than the designated level. Note that the attained power is computed with the approximate power function π(δ) of the Welch-Satterthwaite test defined in Equation 10. A closed form solution rarely exists for most situations and therefore the minimization typically requires iterative and extensive computations.
Accordingly, the NLPQN subroutine of the SAS/ IML package provides an efficient approach to finding the optimal solution for cost minimization with the power function as a nonlinear constraint. It must be emphasized that the sample sizes are treated as continuous variables in the optimization process. Then the optimal allocation {N 11 * , N 12 * , N 21 * , N 22 * } is found through a screening of the sample size combinations that attain the desired power while giving the least cost. If more than one combination yields the same magnitude of least cost, the one producing the largest power is reported. Note that this fine-tuning procedure can be considered as a safeguard to ensure that the nearly optimal and integer sample sizes {N M11 , N M12 , N M21 , N M22 } is the true optimal solution. Unfortunately, the conducted numerical calculations revealed that the sample sizes {N M11 , N M12 , N M21 , N M22 } are rarely the correct optimal allocation. This finding justifies the suggested screening technique and notifies the deficiency of the rounding process in Luh and Guo [24]. A special purpose SAS/IML computer program is presented in Additional file 2 for performing the necessary computation.
On the other hand, Luh and Guo [24] presented a two-step procedure for obtaining the optimal result. First, using the simplified normal assumption, they showed that the optimal sample size ratio is proportional to the ratio of the population standard deviations divided by the square root of the ratio of the unit sampling costs: i and j = 1 and 2. The initial sample sizes {N Z11 , N Z12 , N Z21 , N Z22 } are obtained with N Zij = N Z11· r ij , i and j = 1 and 2, where =r ij ; and zα /2 and zβ are the upper 100(α/2)th and 100 · βth percentiles of the standard normal distribution, respectively. Then, to account for the approximate degrees of freedom of the Welch-Satterthwaite test, they suggested a modified process by using the sample sizes {N Z11 , N Z12 , N Z21 , N Z22 } to yield the second set of sample sizes {N T11 , N T12 , N T21 , N T22 } where N Tij = N T11· r ij , i and j = 1 and 2, Because the values of {N T11 , N T12 , N T21 , N T22 } are most likely fractional, a direct rounding process is applied to give the final sample sizes {N LG11 , N LG12 , N LG21 , N LG22 } where N LGij = [N Tij ] + 1 for i and j = 1 and 2. In contrast to the proposed fine-tuning algorithm, Luh and Guo [24] overlooked the inexactness issue caused by integer sample sizes in power evaluation and cost minimization. This is one of possible causes that their method does not guarantee to produce the optimal sample sizes as shown in the subsequent numerical comparisons.

Numerical assessments
To illustrate the advantage of the proposed optimal procedure over the existing method of Luh and Guo [24], numerical appraisals were performed to assess the optimal sample size calculations of the two methods under a wide variety of model configurations. The empirical investigation consists of two studies with real and hypothetical data that correspond to the model settings in Tables 2 and 3 of Luh and Guo [24].

Study I
For the purposes of comparison, the illustrative example for the 2 × 2 factorial study of attack context and panic fear in Luh and Guo [24] is reexamined here. The data was obtained from the investigations of frequency and cost of emergency service use in Barnett and Nurmagambetov [28] and Greaves et al. [29]. To exemplify a typical research scenario most frequently encountered in the planning stage of a study, the reported findings are employed to provide planning values of the model parameters and design characteristics for future asthma study.
Specifically, the mean effects and variance components are designated as {μ 11  Then the proposed allocation procedure was employed to find the optimal sample sizes needed to achieve the nominal power 1 -β = 0.8 for three contrast effects and two cost structures. Throughout this empirical study, the significance level is set as α = 0.05 and the null value is ψ 0 = 0. Overall, a total of six different sets of sample sizes were obtained.
The optimal allocation, total cost, and total sample size are summarized in Table 1. Unlike the presented procedure, the numerical outcomes reported in Table 2 of Luh and Guo [24] are based on the sample size ratios in Equation 12. For ease of illustration, the corresponding results are also presented in Table 1. In addition, the attained powers for the computed sample sizes were computed by the suggested approximate power function π(δ) given in Equation 10. Due to the approximate nature of the suggested power calculations, Monte Carlo simulation of 10,000 independent data sets was also conducted to obtain the simulated powers. Accordingly, the adequacy of the approximate power function can be evaluated by the difference between the simulated power and approximate power. In addition to the prescribed optimal sample allocation, total cot and total sample size, the approximate power, simulated power and difference are also listed in Table 1.

Study II
To further explicate the optimal behavior and profound implication of the two sample size procedures, additional numerical assessments were performed with different variability patterns and cost structures. In this study, the simulation design of Luh and Guo [24] is adopted as a patterns were chosen to represent as much as possible the extent of characteristics that are likely to be obtained in actual applications. Similar to the illustration in Study I, the main settings are assigned as the significance level α = 0.05, nominal power 1 -β = 0.8, and the null value ψ 0 = 0. Also, Monte Carlo simulation of 10,000 independent data sets was again conducted to obtain the simulated powers. The computed sample sizes, total cost, total sample size, approximate power, simulated power, and associated error of the two competing approaches are presented in Table 2 for six different cost functions.

Results
As was pointed out above, Luh and Guo [24] did not explicitly describe the power function for the Welch-Satterthwaite test and only the simulated powers were reported in their numerical demonstration. Note that the approximate powers presented in Table 1 of both sample size procedures are computed with respect to the approximate power function π(δ) for the reported sample sizes. The appraisal and implication of power Table 1 Computed sample size, total cost, total size, simulated power, and error for the approaches of the proposed approach and Luh and Guo's (2016) method, when α = 0.05, 1 -β = 0.8, and {σ 11   It can be readily seen from the reported sample sizes in Table 1 that the two allocation procedures do not agree with the optimal sample size settings. The computed sample sizes of Luh and Guo's [24] method are larger than or equal to those of the suggested technique. The last scenario associated with ψ B and C E is the only one case where the two sets of sample sizes are identical. Specifically, the sample sizes associated with the interaction effect ψ I and cost coefficient set C E are {11, 16, 13, 19} and {12, 17, 14, 19} for the proposed approach and Luh and Guo's [24] method, respectively. In this case, the maximum difference of cell sample sizes between the two procedures is only one. This indicates that a simple rounding process may distort an optimal solution even the sample size ratios {r 11 , r 12 , r 21 , r 22 } provide a nearly optimal result under normal approximation. Moreover, for the main effect ψ A and cost coefficient set C U , the computed sample sizes for the two contending procedures are {10, 13, 12, 16} and {10, 15, 13, 17}, respectively. Hence, the cell sample sizes incur the largest difference of two units. The same situation also occurred with the setup of the main effect ψ B and cost coefficient set C U . In both scenarios, the discrepancy of more than one unit in sample size determinations reveals that the normality-based sample size ratios r ij given in Equation 12 is also responsible for the suboptimal behavior of Luh and Guo's [24] method. Consequently, the total cost incurred by the proposed approach is always no larger than that of the Luh and Guo [24] procedure. These numerical evidences showed that their method does not warrant optimal sample size allocation for minimizing the total cost.
With the different variability patterns and cost structures in the second empirical study, it is clear from the marginal differences between the approximate power and simulated power in Table 2 that the approximate power function π(δ) maintains a accurate solution for power calculations. Specifically, all the absolute errors are less than 0.01 for all 12 cases. Just as in the preceding study, all the approximate powers or attained powers associated with both optimal allocation methods satisfy the desired power performance 0.8 for all six cost structures. However, the computed sample sizes in Table 2 indicate that Luh and Guo's [24] method consistently give greater total cost and larger total sample size than the suggested technique. Notably, the accuracy of their method deteriorates as the variability of the unit costs increases. For the identical unit costs {1, 1, 1, 1}, the computed sample sizes {20, 40, 60, 79} and {20, 40, 60, 80} of the two approaches are nearly the same. However, the corresponding optimal sample sizes and total costs become prominently different when the unit costs are {1, 1, 2, 5}, {5, 2, 1, 1}, and {1, 3, 3, 1}. Accordingly, the performance of the existing method of Luh and Guo [24] is sensitive to model settings and cost schemes. In view of the potentially diverse treatment and cost configurations in factorial designs, their formula does not serve as a robust procedure for general use. Consequently, the allegedly optimal sample sizes calculations of Luh and Guo [24] are actually suboptimal, and their claim of developing the most efficient allocation for heteroscedastic 2 × 2 factorial designs is incorrect.

Discussion
The 2 × 2 factorial design is widely used in different fields of research for assessing the interaction between two factors. However, violation of the homogeneity of variance assumption has been the target of criticism in applications of standard factorial ANOVA. For testing a hypothesis of a linear combination of group means, the Welch-Satterthwaite procedure emerges as a robust alternative to heteroscedasticity when distributions are normal. For the ultimate aim of selecting the optimal sample size allocation, the analytical argument and empirical performance of an optimization technique must be well examined before it can be adopted as a general methodology in practice. The large sample theory shows that in order to ensure the nominal power while minimizing total cost, an optimal ratio of sample sizes is proportional to the ratio of the population standard deviations divided by the square root of the ratio of the unit sampling costs. At first sight, Luh and Guo's [24] sample size procedure is easy to use and seem to give practically useful results. However, it is unlikely that a direct implementation of the simple allocation formula with a simple integer rounding will give the optimal solution under cost considerations. Therefore, there is a need to provide a systematic and detailed process to calculate the final optimal sample sizes. Evidently, the proposed procedure and the current method of Luh and Guo [24] are prominently different in fundamental principles and demand varying computational efforts. Due to the complexity of the sample size optimization problem under power and cost considerations, a complete analytic examination is not feasible. Instead, numerical assessments were conducted to examine their unique feature and underlying discrepancy in order to better understand the selection of an appropriate approach for optimal sample size determination in 2 × 2 factorial studies. Detailed appraisals showed that Luh and Guo's [24] procedure generally do not give the optimal solution. Alternatively, the described approach provides a superior solution for optimal sample size allocation.

Conclusions
To enhance the usefulness of the Welch-Satterthwaite procedure in planning research designs, this article addresses the corresponding problem of designing statistically powerful heteroscedastic 2 × 2 factorial studies while minimizing financial costs. The suggested approach outperforms the current method in both its methodological soundness and overall performance. The presented sample size optimization methodology can be useful for the advocated practice of planning research design under both power and cost considerations. Computer algorithms are also developed to facilitate the implementation of the recommended power and sample size calculations in planning 2 × 2 factorial designs.