Overview
We conducted an analysis examining the impact of how different distributions of a dichotomous prognostic factor affect the power (and sample size needed to obtain 80% power) to detect an interaction between the prognostic factor and treatment in RCTs. We also studied the impact of misspecifying (positive and negative misspecifications) the distribution of the prognostic factor on power and sample size. We varied the magnitude of the interaction effect, the distribution of the prognostic factor, and the magnitude of the misspecification. Lastly, we compare three methods for ensuring appropriate overall power and type I error under the misspecification of the distribution of the prognostic factor. These methods are quota sampling, modified quota sampling, and sample size re-estimation using conditional power.
Specification of key parameters used in the paper
Treatment variable
The treatment variable was distributed as a binomial variable (active vs. placebo) with probability of 0.5. For the purposes of this paper the treatment variable was assumed to always have a balanced distribution (i.e. 50% on level 1 receiving active treatment; 50% on level 2 receiving placebo). For illustration purposes, we assumed that APM was the active treatment and that PT was the placebo.
Prognostic factor
The prognostic factor was defined as dichotomous variable with k
j
representing the j
th level of the prognostic factor. When referring to the distribution of the prognostic factor we indicated the percentage in the k
1
level of the prognostic factor, defined as p
1
. We varied p
1
from 10% to 50% in 10% increments.
Misspecification of the prognostic factor
The misspecification of the prognostic factor was defined by the parameter q. The misspecification could be positive or negative with negative misspecification implying less balance (more skew) and positive misspecification implying more balance (less skew). For example, if the planned distribution of the prognostic factor was 20% and the actual distribution of the prognostic factor was 25%, then the misspecification of the prognostic factor (q) was +5%. Possible values of q were −15%, -5%, 0 (i.e. no misspecification), +5%, and +15%.
Outcome variable
We assumed that our outcome variable was continuous and normally distributed. In our example, the outcome can be interpreted as the improvement in function after APM or PT as measured by a score or scale. We specified the mean improvement for all four possible combinations of treatment and the prognostic factor. We considered two different values (25 and 15) for the mean improvement in the active/k
1
treatment/prognostic factor combination (i.e. APM/mild knee OA severity). The mean improvement in the active/k
2
, placebo/k
1
, and placebo/k
2
groups were held constant at 5, 5, and 0 respectively. We assumed a common standard deviation (σ) of 10 for all four combinations.
Magnitude of the interaction
We defined the magnitude of the interaction between prognostic factor and treatment effect according to the method by Brookes and colleagues [6, 7]. Let μ
ij
be mean improvement in the i
th treatment and j
th level of the prognostic factor. We then defined the treatment efficacy in the j
th level of the prognostic factor as:
We then defined the interaction effect (denoted as θ) as follows:
(2)
Thus θ, which served as the basis of our choice of mean improvement values, varied as μ
11 varied. The magnitudes of the interaction effect that we considered were 15 and 5. The estimate of the interaction effect was defined as follows:
(3)
Then, the variance of the interaction effect under balanced treatment groups and distribution of the prognostic factor p
1
can be derived as follows (note that N equals the total sample size for the trial):
(4)
It is clear from the equation as the prevalence of the prognostic factor (p
1
) increases, the variance decreases, which would imply that the power increases for a fixed sample size.
Initial sample size for interaction effects
The sample size required for the i
th treatment and j
th prognostic factor level to detect the interaction effect described under a balanced design (i.e. p
1
= 0.5) with a two-sided significance level of α and power equal to 1– β has been previously published by Lachenbruch [9].
(5)
In these formulas z
1–β
represents the z-value at the 1–β (theoretical power) quantile of the standard normal distribution and z
1–α/2
represents the z-value at the 1–
α
/
2
(probability of a type I error) quantile of the standard normal distribution. Under a balanced design with p
1
= 0.5 we can just multiply n
ij
by four to obtain the total sample size since there are four combinations of treatment and prognostic factor. A limitation of this formula is that it uses critical values from the standard normal distribution rather than the Student’s t-distribution as most statistical tests of interaction are performed using a t-distribution. To account for this we calculated the total sample size required to detect an interaction effect with a two-sided significance level of α and power equal to 1– β:
-
1.
Use formula 5 (above) to calculate the sample size required for each combination of treatment and prognostic factor under a balanced design.
-
2.
Calculate a new sample size required for each combination of treatment and prognostic factor under a balanced design using the following formula 6 below. In this formula the z-critical values have been replaced with t-critical values with n
ij
degrees of freedom.
(6)
-
3.
Set n
ij
equal to and repeat step 2.
-
4.
Repeat step 3 until converges. This will usually occur after 2 or 3 iterations.
-
5.
Lastly, to correct for imbalance in the prognostic factor, multiply by to obtain the final total sample size N.
Effect of misspecifying the distribution of the prognostic factor
The effect of misspecifying the distribution of the prognostic factor was evaluated using power curves. The formula used by Lachenbruch was extended to incorporate the Student’s t-distribution [9]. Power for the interaction test, where Ψ is the cumulative distribution function of the Student’s t-distribution, by actual prevalence of the prognostic factor (p
1
+ q) and magnitude of the interaction effect was calculated using equation 7 below.
(7)
Strategies for accounting for the misspecification of the distribution of the prognostic factor
Quota sampling
The quota sampling approach was performed using the following steps. First, for a given set of parameters, we would determine the sample size needed to detect an interaction effect with 80% power. We then fixed the number of participants to be recruited for each level of the prognostic factor. For example, if the final total sample size was 200 and the planned distribution of the prognostic factor was 30% in the k
1
group and 70% in the k
2
group then exactly 60 subjects would be recruited in the k
1
group and 140 in the k
2
group. This method removes the variability in the sampling distribution and ensures that the sampled prognostic factor distribution always matches what was planned for. Because of this approach, the observed distribution of the prognostic factor in the trial will always match the planned distribution and there will be no misspecification. However, this method may require turning away potential subjects because one level of the prognostic factor is already filled, delaying trial completion. Also, it may reduce the external validity of the overall treatment results as the trial subjects can become less representative of the unselected population of interest. Because of these limitations we also considered a modified quota sampling approach.
Modified quota sampling
The modified quota sampling approach was performed using the following steps. First, as in the quota sampling approach, the sample size needed to detect an interaction effect with 80% power was determined for the pre-specified parameters. Next, the simulated study enrolled the first N
/
2
subjects. After the first N
/
2
subjects were enrolled we tested to see if the sampling distribution of the prognostic factor was different from what was planned for using a one-sample test of the proportion. If this result was statistically significant at the 0.05 level then a quota sampling approach was undertaken for the second N
/
2
subjects to be enrolled to ensure that the sampling distribution of the prognostic factor matched the planned distribution exactly. If the result was not statistically significant then the study continued to enroll normally, allowing for variability in the distribution of the prognostic factor.
Sample size re-estimation using conditional power
The last method for accounting for the misspecification of the distribution of the prognostic factor used the conditional power of the interaction test at an interim analysis to re-estimate the sample size. We modified the methods of Denne to carry out this procedure [10]. We assumed that the interim analysis occurred after the first N
/
2
subjects were enrolle
2
) were determined by the O’Brien-Fleming alpha-spending function [11] using the SEQDESIGN procedure in the SAS statistical software package. We also used the SEQDESIGN procedure to calculate a futility boundary at the interim analysis (b
1
). Since these critical values are based on a standard normal distribution and not the student’s t-distribution we converted the critical values to those based on the student’s t-distribution. First, we converted the original critical values to the corresponding percentile of the standard normal distribution. We then converted these percentiles to the corresponding critical value of the Student’s t-distribution with N-4 degrees of freedom.
At the interim analysis, if the absolute value of the interaction test statistic was less than the futility boundary (t
1
< b
1
) then we stopped the trial for futility and considered the result of the trial to be not statistically significant. If the absolute value of the test statistic was greater than the interim critical value (c
1
) then we stopped the trial for efficacy and considered the result of the trial to be statistically significant. If absolute value of the test statistic was greater than b
1
but less than c
1
then we evaluated the conditional power and determined if sample size re-estimation was necessary. The following paragraphs outline this procedure.
The following is the conditional power formula proposed by Denne for the two group comparison of means:
(8)
Here, c
2
is the final critical value, n
2
is the sample size at the final analysis, n
t
is the originally planned total sample size, z
1
is the test statistic for the interaction at the interim analysis, n
1
is sample size at the interim analysis, δ is the difference in means, and σ is the common standard deviation for the two groups. We updated the formula by replacing z
1
with t
1
(because the interaction test uses the Student’s t-distribution), δ (difference in means between groups) with θ (magnitude of the interaction effect), and Φ (cumulative distribution function of a standard normal distribution) with Ψ (cumulative distribution function of a student’s t-distribution). Recall that p
1
is the proportion in the k
1
group and σ is the common standard deviation:
(9)
Initially n
2
= n
t
as conditional power is calculated as if you were to not re-estimate the sample size. The values of θ, σ, and p
1
for the conditional power formula were estimated at the interim analysis. If the conditional power was less than 80% then a new n
2
was estimated such that conditional power was 80% and a new final critical value, c
2
, was calculated as a function of the original final critical value, , and the interim test statistic t
1
using the following formula:
(10)
In equation 10, and so that the final critical value is also a function of n
1
, n
2
(the new total sample size), and n
t
(the original total sample size). Since all values except n
2
are fixed, we can calculate the new critical value c
2
for new final sample sizes n
2
. According to Denne, this method for re-estimating the sample size maintains the overall Type I error rate at α (equal to 0.05 in our case) [10]. The final sample size n
2
and final critical value c
2
were chosen so that the conditional power formula shown in equation 9 was equal to 80%. If the conditional power was greater than 80% at the interim analysis then we used the originally calculated n
t
as the final sample size (n
2
= n
t
) so that the final sample size was only altered to increase the conditional power to 80%.
Validating the conditional power formula
To ensure that the modification the conditional power formula (formula 9) was appropriate, we performed a validation study using simulations. For each combination of prevalence of the prognostic factor and magnitude of the interaction ran 10 trials to obtain 10 interim test statistics for each combination of parameters. At the interim analysis we calculated the conditional power based on the hypothesized values of θ, σ, and p
1
. For each trial, the second half of the trial was simulated 5,000 times to obtain the empirical conditional power. Since there were 10 different combinations of prevalence of the prognostic factor and magnitude of the interaction effect and 10 trials for each combination, the plot generated 100 points. We generated a scatter plot of the empirical conditional power based on 5,000 replicates against the calculated conditional power (Figure 1). Values that line up along the y = x line demonstrate that formula provided an accurate estimation of the conditional power.
Simulation study details
Five thousand replications were performed for each combination of the interaction effect and proportion at level k
1
. We first evaluated the empirical power for detecting the interaction effect without accounting for misspecification of the distribution of the prognostic factor. We varied the misspecification of the prognostic factor at −15%, -5%, 0%, +5%, and +15%. For the quota sampling method we did not vary the misspecification of the distribution of the prognostic factor because the definition of the method does not allow for misspecifications. While we did not expect the quota sampling method to have power or type I error estimates that differ from the traditional one-stage sampling design under no misspecification, we conducted the simulation study for this study design method to confirm there was no impact on power and type I error. For the modified quota sampling method and sample size re-estimation using conditional power we used the same misspecifications as described above.
We calculated the overall empirical power for the interaction effect for all three methods. This was defined as the percentage of statistically significant interaction effects across the 5,000 replicates. Empirical type I error was calculated in a similar fashion for these three methods, but the interaction effect was assumed to be zero and the sample size we used was the sample sizes for the planned interaction effects of 15 and 5. For the sample size re-estimation method we also calculated the empirical conditional power. This was defined as the number of statistically significant interaction effects detected at the 0.05 level among trials that re-estimated the sample size. Because the sample size could change, we also calculated the mean and median final sample size for the entire procedure.
The margin of error for empirical power and type I error was calculated using the half width of the 99% confidence interval based on a binomial distribution with a sample size of 5,000. Since trials were planned with 1– β = 0.80 α = 0.05 this led to margins of error equal of 0.015 and 0.008 when assessing empirical power and type I error respectively.