 Research
 Open Access
 Published:
Comparison of analysis methods and design choices for treatmentbyperiod interaction in unidirectional switch designs: a simulation study
BMC Medical Research Methodology volume 22, Article number: 294 (2022)
Abstract
Background
Due to identifiability problems, statistical inference about treatmentbyperiod interactions has not been discussed for stepped wedge designs in the literature thus far. Unidirectional switch designs (USDs) generalize the stepped wedge designs and allow for estimation and testing of treatmentbyperiod interaction in its many flexible design forms.
Methods
Under different forms of the USDs, we simulated binary data at both aggregated and individual levels and studied the performances of the generalized linear mixed model (GLMM) and the marginal model with generalized estimation equations (GEE) for estimating and testing treatmentbyperiod interactions.
Results
The parallel group design had the highest power for detecting the treatmentbyperiod interactions. While there was no substantial difference between aggregatedlevel and individuallevel analysis, the GLMM had better point estimates than the marginal model with GEE. Furthermore, the optimal USD for estimating the average treatment effect was not efficient for treatmentbyperiod interaction and the marginal model with GEE required a substantial number of clusters to yield unbiased estimates of the interaction parameters when the correlation structure is autoregressive of order 1 (AR1). On the other hand, marginal model with GEE had better coverages than GLMM under the AR1 correlation structure.
Conclusion
From the designs and methods evaluated, in general, parallel group design with a GLMM is, preferred for estimation and testing of treatmentbyperiod interaction in a clustered randomized controlled trial for a binary outcome.
Background
The family of unidirectional switch designs (USDs) [1, 2], which includes the parallel group design (PGD), stepped wedge design (SWD), and delayed start design (DSD) [3] and allows for unidirectional switches from a control to a new treatment, can be used as a randomized controlled trial for estimation of treatment effect. A “completepattern” USD with 5 time points is visualized in Fig. 1 (see also [3]). In general, different switching moments from control to intervention create different treatment patterns and the family of USD is a collection of designs that randomly allocate (group of) participants to some, but not necessarily all, patterns according to prespecified probabilities. Different variations of USDs utilize different patterns. For instance, a PGD can be considered as a variation of USD which only uses the pure control and treatment pattern.
Efficiencies of different USD variations, in terms of estimation and hypothesis testing for the treatment effect, have been established in literature [2,3,4,5]. These results are all derived under the assumption of a constant treatment effect across the period.
Due to the sequential nature of treatment switching, violation of a homogeneous treatment effect may cause a substantial bias in the estimation when not addressed properly. For instance, if treatment had a higher effect on the outcome at a later period, then not taking into account such interaction may lead to an overestimated overall treatment effect since the treatment is less allocated at earlier periods than later periods. Therefore, it is important to investigate the presence of timedependent treatment effects before making the assumption of a stationary treatment effect. This is usually addressed via hypothesis testing for the treatmentbyperiod interactions. However, so far little work has been published on this subject for the family of USDs (except for PGDs). Thus the main objective of this study is to investigate different analysis approaches for estimating and testing treatmentbyperiod interactions for USDs. Of course, not all variations of USD have identifiable treatmentbyperiod interaction effects (e.g., SWD) and different designs also lead to different efficiencies. Therefore, we study the analysis approaches for each variation of the USD that has estimable treatmentbyperiod interaction and compare them across different designs as well. Nevertheless, there might exist more complex forms of treatmentbyperiod interaction such as the timeontreatment effect [6] where the treatment effect depends on the time since the onset of the intervention. In this study, we focus on the simple twoway interaction with no additional modifiers.
Furthermore, we focus on binary clinical outcomes when clusters of participants are being randomized (clustered randomized trial), and we assume that at each period a new set of participants from the cluster enters the trial (crosssectional study). And we only consider “standard” designs that all clusters are being observed during the same (calender) periods [7]. We study the performances of different test statistics with simulations for different USDs. We compare statistical tests (Type III tests for the treatmentbyperiod interaction effects) from the generalized linear mixed model (GLMM) and from the marginal model with generalized estimating equations (GEE), both at aggregated and individuallevel data.
The definition of the USD will be briefly explained in Section 2 and the estimation problem of the SWD for treatmentbyperiod interaction will be illustrated. We will list the USD variations that we will study with MonteCarlo simulations. Furthermore, our simulation settings and the four analysis methods will be discussed and briefly reviewed in Section 2 as well. In Section 3, simulation results will be presented and discussed followed by a summary and discussion in the final section.
Methods
Based on the idea of the stepped wedge design, switching treatments (from control to the treatment of interest) at different periods leads to different sequences of treatment patterns, or patterns in short (See Fig. 1). USDs refer to a family of designs that consists of all or a few of these patterns. The probabilities \(p_0, p_1, \dots , p_{T}\) for each of the patterns are the probabilities of randomly allocating clusters or groups of participants to the corresponding patterns where the index represents the total number of treated periods in each pattern. By setting some of the probabilities to zero, certain patterns would be excluded from the trial. Note that we assume here (in line with literature on SWDs) that the clinical outcome is collected on each participant at the end of each period. The PGD is a USD with pure intervention and control patterns \(p_0 + p_{T}=1\) (see Fig. 1). Although no switching is present in this design, it may be viewed as a special case of the USD where switching just happened before the trial started and after the trial ended for the pure treatment and control pattern, respectively. The commonly used form of SWDs is a USD with only “middle” patterns as illustrated in Fig. 1 (\(p_0=p_{T}=0\)). The DSD is either \(p_{T} + p_s =1\), or \(p_{T} + p_s + p_0=1\), where \(s \in \{1,2,\dots ,T1\}\). Girling and Hemming’s hybrid design [2] can also be considered as a special case of the complete pattern USD where \(p_1=\cdots =p_{T1}\) and \(p_0=p_T >0\). Optimal choices for these probabilities for the estimation of the treatment effect (when no treatmentbyperiod interaction exists) has been addressed in literature [2, 3].
Identifiability of all treatmentbyperiod interaction effects requires participants with control and with treatment at each period. This is not the case for SWDs. To see why SWD is not identifiable, consider a SWD study with \(T=4\) periods and \(T1=3\) switch moments as shown in Table 1, where the population average response at each cell is expressed in terms of combinations of a general mean \(\mu _0\), the jth period effect \(\beta _j\) for \(j=1,\dots ,T\), an overall treatment effect \(\theta _0\), and a periodspecific treatment effect \(\delta _j\) as a difference with respect to the overall treatment effect at period j. There are \(2T  2\) unique cells (out of \(T^2T\) cells) but \(2T+1\) parameters are specified. Even if a certain period, say period T, is considered as the reference period which is common in most statistical software packages, and the corresponding parameter \(\beta _T\) and \(\delta _T\) is set to 0, there is still not enough degrees of freedom unless further restrictions are being made on the parameters. The way to mitigate the problem is to require designs to have both treatment arms present at all periods. In that setting, the number of unique cells is 2T (from \(T^2 + T\) cells) and there will be \(2T+2\) parameters. The restriction \(\beta _T=\delta _T =0\), which are essential and common parameter constraints, solves the identifiability problem. Indeed, in this extended design, one additional parameter, namely \(\delta _1\), is introduced but two additional unique cells are introduced as well, which compensates for the degrees of freedom deficiency. Note that Table 1 is more general than a specific parametric model with certain distributional assumptions for the outcome variables. For instance, considering an outcome \(Y_{ijk}\) of subject \(k, (k=1,\dots ,n)\) at period \(j (j=1,\dots ,T)\) from switch pattern \(i, (i=1,\dots ,T1)\), any generalized linear (mixed) model with the following specification of the marginal mean \(\mu _{ijk}=\mathbb {E}(Y_{ijk})\):
where g is any monotonic mapping between the mean and the linear predictor, \(\beta _j\) is the effect of the jth period, and \(x_{ij}\) is a treatment indicator that equals 1 when treated and 0 otherwise with corresponding timedependent treatment effect \(\theta _j = \theta _0 + \delta _j\), conforms to the specification of Table 1. Such visualization might become convenient when investigating more complicated identifiability problems such as the case of threeway interactions or treatmentbyperiod interaction with a clusterspecific fixed effect.
Choices of USD
Since it is required to have both treatment arms present at each period, designs that satisfy this condition will be considered in the simulation. Such a condition suggests choosing designs with the pure intervention (or control) pattern combined with any other patterns. We will study the uniform PGD (\(p_0=p_T =1/2\)), uniform (3pattern) DSD (\(p_0 = p_s = p_T = 1/3\)) (in short, DSDu), and the uniform (fullpattern) USD (\(p_0=p_1=\cdots =p_T=1/(T+1)\)) (USDu). We will also study the optimal DSD (DSDo), and the optimal USD optimized for the estimation of the overall treatment effect under the assumption of no treatmentbyperiod interactions for continuous outcomes. The probabilities \(p_i\)s are calculated according to [3]. For the optimal DSD they are \(p_0=(1\rho + n\rho s)/[2(1\rho +n\rho T]\), \(p_s=n\rho T/[2(1\rho + n\rho T)]\), and \(p_T = [1\rho + n\rho (Ts)]/[2(1\rho +n\rho T)]\), where \(\rho\) is the intraclass correlation coefficient, n the number of subjects per cluster per period, T the total number of periods, s the index of the “middle” switching pattern in DSD. For the optimal USD, the probabilities are given by \(p_0 = p_T = (1\rho + n\rho )/[2(1\rho + n\rho T)]\), and \(p_1 = \cdots =p_{T1} = n\rho /(1\rho + n\rho T)\). It should be noted that these optimal designs are developed for continuous outcomes with the assumption of a large number of clusters. For binary outcomes, the optimal designs are unknown and we expect the aforementioned optimal designs to remain approximately optimal for large sample sizes. Furthermore, \(\rho\) for the binary outcome is calculated differently as detailed in Section 2.3. An additional complication, in the simulation study with a finite number of clusters, is the allocation of the clusters according to the exact optimal probabilities will lead to fractional numbers of clusters. We consider 2 different rounding options for the optimal USD. Following the advice of [3], we first consider rounding options for the pure control and treatment pattern, this leads to two options (up or down). Rounding up the optimal proportions yields one unique USD (denoted as USDo1), while several options are possible when we round down. After rounding down patterns 0 and 5, we need to either round up patterns 1 and 4 or round up patterns 2 and 3 (options leading to nonsymmetric allocations are excluded). Since the asymptotic relative efficiencies for estimating the treatment effect is negligible between the two options [3], we decide to choose the former option (denoted as UDSo2), since we’d like to put more clusters to patterns that are closer to the pure treatment and control patterns. The numbers of clusters allocated to each pattern for all design candidates are listed in Table 2 (Section 3).
Brief review of analysis methods
Statistical analysis of USDs can be broken down into two general approaches. Either the data can be analysed at an aggregated level by taking summary measures at clusterperiod combinations, or the data can be analysed at an individual level. The aggregated measure is denoted by \(Y_{ij} \equiv M(Y_{ij1},\cdots ,Y_{ijn})\) with \(M(\cdot )\), for instance, the average or sum, where \(Y_{ijk}\) denotes the kth observation in cluster i of period j. The two candidate methods considered in this study can be applied either on an aggregated measure or on the individual outcomes. In the first method, namely the marginal model with GEE [8], variations between clusters are treated as nuisances and the cluster functions as the analysis unit with repeated observations (individuals) and both period and treatment enter the model as fixed effects. The focus of the analysis is on the fixed effects averaged over the clusters [9]. The second analysis method is a subjectspecific mixed effects model with clusters as random effect and period and treatment again as fixed effects [2, 10]. Generally speaking, aggregation of the outcome does not preserve the withincluster correlation structure for binary/binomial outcomes with a nonlinear link function. This is especially true for the working correlation of GEE. For instance, suppose the working correlation matrix at the individual level follows the Toeplitz correlation structure. This correlation structure is no longer true at the aggregated level. Reassuringly, this is less of a concern because the covariance is considered as nuisance parameters in GEE. Nevertheless, discussions on the relative merits of the two methods for estimation of a common treatment effect (without treatmentbyperiod interactions) are welldiscussed in literature [11,12,13,14]. In general, the marginal model does not rely on the assumption of the correlation structure and is robust against misspecification. However, when the number of clusters is small, the empirical “sandwich” estimator [15,16,17] used in the model underestimates the true (co)variances of the parameters and the Waldtype test is subject to inflated Type I error. On the other hand, the mixed effects model approach is sensitive to the specification of the covariance structure and the coefficients are more difficult to interpret on a population level.
Simulation settings
A crosssectional clustered randomized controlled trial that consists of M clusters (\(M\in \{48, 192\}\)^{Footnote 1}) in total and observed over \(T = 5\) periods with subjects per period per cluster set to \(n=50\). The outcome of interests \(Y_{ijk}\) is assumed to be a Bernoulli random variable for subject \(k\, (k=1,\dots , n)\) at period \(j\, (j=1,\dots , T)\) from cluster \(i\, (i=1,\dots , M)\). The following GLMM is used to generate the outcome \(Y_{ijk}\):
where \(\Phi\) is the cumulative distribution function of the standard normal distribution, \(\mu _0\) is a general intercept, \(b_{0i} {\mathop {\sim }\limits ^{iid}} \mathcal {N}(0, \sigma _c^2)\) a random effect of cluster i, \(\beta _{j}\) is the fixed effect of period j (\(\beta _j=0.2 (j3)\)), \(\theta _j = \theta _0 + \delta _j\) the periodspecific treatment effect at period j and with \(\theta _0 = 0.4\) the average treatment effect over periods. The \(\delta _j\) is the difference with the average treatment effect at period j, two sets of \(\delta _j\) is considered, namely a linear effect (\(\delta _j = 0.1 (j3)\)) and a symmetric effect (\(\delta _j=0.1 j3\)). Finally, \(x_{ij}\) is the treatment indicator for cluster i at period j taking the value of 1 if the cluster is allocated to the intervention, and 0 otherwise. In addition, a set of three different values \(\{0.2, 0.5, 0.8\}\) is chosen for \(\sigma _c^2\) and the ICC \(\rho\) used to calculated each optimal designs is derived as
Besides the exchangeable correlation structure used in the aforementioned data generating process, we also considered a simulation setting with an autoregressive order 1 (AR1) correlation structure between time points at the cluster level. That is, the random effect \(b_{0i}\) of cluster i is now replaced by a random effect \(b_{ij}\). This random term is assumed to be multivariate normal with mean 0 and variance equal to \(\sigma _c^2\) but now with additional correlation between two time points j and k as \({\text {corr}}(b_{ij},b_{ik}) = r^{jk}\). Here, the correlation parameter is set to \(r \in \{0.5,0.8\}\). Only aggregatedlevel analysis will be considered for AR1 since the correlation matrix at the individual level is cumbersome to specify and the true correlation structure is usually unknown in practice. Even at the aggregated level, the AR1 correlation structure specified at the latent variable level is not preserved at the outcome level and an AR1 working correlation structure will still be a misspecification for the true correlation structure. We thus only used exchangeable working correlation for GEE in this setting.
In addition, for all the simulation settings, the parameters are specified according to a subjectspecific model. To translate these regression coefficients or parameters (denoted by \(\varvec{\theta }_S\)) into marginal model parameters (denoted by \(\varvec{\theta }_M\)), the following relationship is used:
Note that this relationship is exact since we used the probit link function \(\Phi\) (i.e., the cumulative distribution function of the standard normal distribution) in combination with a normally distributed random effect for clusters in the simulation process [18].
The same GLMM model (1) was used to fit each simulated data at the individual level. For the marginal model with GEE, the following marginal mean structure was considered:
where \(\mu _M\), \(\beta _{Mj}\), \(\theta _{Mj}\) are the aforementioned marginal model parameters counterparts to the subjectspecific model parameters \(\mu _0\), \(\beta _{0j}\), and \(\theta _j\), respectively. Furthermore, an exchangeable working correlation structure was specified. On the other hand, for the aggregated analysis, the number of events \(Y_{ij} = \sum \nolimits _{k=1}^n Y_{ijk}\) per cluster and period was summarized first, and a similar GLMM model
was used. Similarly, a marginal model with GEE was used to fit the clusterperiod event counts.
Bias and coverage probabilities were summarized based on 1000 simulations. All simulations were conducted in SAS 9.4. For the marginal model with GEE, PROC GENMOD was used and coverage probabilities were based on the Waldtype confidence intervals. For GLMM, PROC GLIMMIX was used with Laplace approximation of the likelihood for estimating the parameters. The coverage probabilities were derived based on the Waldtype confidence intervals (asymptotically equivalent to the confidence interval based on tdistribution with the denominator degrees of freedom assumed to be infinite). Powers of the hypothesis test for the treatmentbyperiod interactions were calculated by the generalized score test based on empirical standard errors for the marginal model with GEE and based on the \(\chi ^2\) test for GLMM. Type I errors were calculated under the simulation setting of \(\delta _j=0\) for \(j=1, 2,\dots , 5\). The default parameterization in SAS is to take the last period as reference, we changed this to the third period, since we have chosen \(\beta _3=0\) and \(\delta _3=0\). This means that the intercept \(\mu _0\) has the interpretation of the event probability at period 3. For other variables, we followed the default parametrization in SAS. We also calculated the MonteCarlo standard error (MCSE) for the simulation results.
Results
To be economical with the presentation, most of the simulation results are deferred to the Additional files (Additional files 1, 2, 3 and 4). In this section, we will show some important findings.
Bias and coverage probability
Simulation results with regard to the bias and coverage probability are presented in Figs. 2, 3, 4, 5, 6, and 7 under the settings of \(M=48\), linear interaction effect \(\delta _j=0.1(j3)\), and \(\sigma _c^2 \in \{0.2, 0.5, 0.8\}\) with an exchangeable correlation structure. All four analysis methods estimate the model parameters without bias across all USD designs. Furthermore, GLMM had closetonominal coverage probabilities at both aggregated and individual levels, while the marginal model with GEE had lower coverage probability among all designs at both levels. The coverages were especially poor for GEE when the estimand was the period effect and the treatmentbyperiod effect at period 5. Such phenomenon was even more severe in designs with extreme unbalanced treatmentcontrol ratios at the first and last period (namely, USDo1 and USDo2). Note that Wald’s confidence interval used with GEE is wellknown to fall under the nominal level in general and it is further lowered by unbalanced samples (unequal proportions of treatment and control within the period) [19].
On the other hand, when the simulated correlation structure is AR1 (\(r=0.8)\) with \(M=48\), taking PGD (Fig. 8) and USDo2 (Fig. 9) with linear interaction effect as an example, aggregatedlevel GEE was not able to estimate certain parameters without bias and consequently had liberal coverages as well. These parameters are period effects at the first and the last period, and treatmentbyperiod effect at the last period under designs that have extremely unbalanced treatmentcontrol ratios at these periods (namely, USDo1 and USDo2). On the contrary, aggregatedlevel GLMM did not have issues with the bias of its estimates but suffered from the problem of less stable coverages. Depending on the parameter, the coverages were either liberal or conservative. Nevertheless, as shown in Fig. 10 for the optimal USDo2, the poor performances of GEE diminished when the number of clusters increased while the suboptimal coverages for the parameters remained for the aggregatedlevel GLMM analysis.
Type I error and power
The proportion of pure treatment patterns plays an important role in both Type I error and the powers of the four methods as shown in Tables 2 and 3 for the simulation setting with \(M=48\), \(r=0.8\) for AR1, and a linear interaction effect. First of all, Type I errors of the four methods became more conservative when the proportion of pure treatment patterns decreased from the highest proportion of pure treatment patterns (PGD) to the lowest proportions (USDo2). In this aspect, GEE was more sensitive to the changes in the pure treatment pattern proportions compared to GLMM as the Type I error of GEE was more conservative compared to GLMM among all designs except for the PGD. Nevertheless, no difference was observed between analysis at an aggregated level and individual level for both GEE and GLMM approaches.
Furthermore, the proportion of pure treatment patterns also had a substantial effect on the powers of the four methods. The highest power was obtained under the PGD which has the highest proportion of pure treatment patterns. The power of the four methods decreased as the proportion of pure treatment patterns decreased. In addition, GLMM always had higher power compared to GEE, especially when the betweencluster variation is large. Increases in \(\sigma _c^2\) also reduced the power of the test within the method and no differences were found between individual and aggregated level analysis for the same method. This phenomenon was less apparent when the simulated correlation structure is AR1. Type I errors and Powers, in general, were worse under AR1 correlation structure for both methods. Considering a MCSE of \(\sqrt{0.05\times 0.95 / 1000} \approx 0.007\), Type I errors for the aggregatedlevel marginal model with GEE were conservative except for PGD while GLMM had a liberal type I error for all designs. The powers of the test were low compared to the power of testing for the overall treatment effect which had more than 50% power in the worstcase scenario. The largest MCSE for the power would be equal to 0.016 when the true power is 50%. Comparing the two analysis methods, GLMM still had higher powers compared to GEE consistently across different designs. However, the power difference between the two did not increase as \(\sigma _c^2\) increased.
Similar results were observed when the interaction effect is symmetric and all simulation results for the symmetric interaction effect can be found in Additional files (Additional files 3 and 4).
Discussion
Simulation results demonstrated that the parallel group design is the best design choice among all studied candidate USDs in terms of the standard error of the estimate for treatmentbyperiod interactions which consequently results in the highest power for hypothesis testing. Furthermore, GLMM has higher power than the marginal approach with GEE across all studied designs and obtains a better coverage of the true values. Nonetheless, no difference between aggregated and individuallevel analysis was found for both GLMM and GEE. Although power comparison between different designs and analysis approaches is not ideal when not all Type I errors are maintained, the differences in the testing powers as demonstrated by the simulation results were substantial and cannot be explained entirely by the difference in Type I errors. Therefore, it is reasonable to conclude that different choices of designs will lead to a systematic difference in the power of hypothesis testing for the treatmentbyperiod interactions.
The consensus in the current literature is that USD (and its less efficient stepped wedge design variants) is more efficient compared to the PGD in terms of estimating treatment effect under the common settings when the intraclass correlation coefficient is substantial [2, 3, 20, 21]. Our simulation results demonstrate that this does not hold for the treatmentbyperiod interactions. Parallel group design is by far the most efficient design for treatmentbyperiod interaction estimations compared to other forms of the USDs. Heuristically, it is not a surprise since none of the candidate designs except for the PGD admits to a balanced treatment allocation within each period. For instance, the uniform USD with an odd number of periods has exactly one period, namely the middle period, that has balanced intervention and control arms. When the number of periods is even, no periods contains balanced treatment allocation. Furthermore, it is known that the power function of the \(\chi ^2\)test (of the treatment effect) for a variance component model with the continuous outcome varies inversely with the quantity \(\frac{1}{K_0} + \frac{1}{K_1}\) for \(K_0\) and \(K_1\) the number of clusters exposed to the control and intervention treatment, respectively [22]. Consequently, a balanced allocation ratio is an optimal solution that maximizes the power function. When it comes to testing for the treatmentbyperiod interaction, it has been shown in the present study that the allocation ratio between the control and intervention arm also plays a similar role on the power such that the parallel group design is the best design choice among all candidates.
In practice, testing for the treatmentbyperiod interactions is usually conducted after the trial completion with a certain design already chosen [23]. Our simulation results have shown that testing for the treatmentbyperiod interactions usually won’t have enough power, especially when the correlation structure is not exchangeable. The safest choice when planning a trial when the treatmentbyperiod interaction is potentially present is to use the PGD. Furthermore, design choices are frequently evaluated for the purpose of estimating the treatment effect alone and does not take into account the treatmentbyperiod interactions. This has been studied in the literature for randomized trials: the inflation factor for the interaction test to have the same power as the one for the overall treatment effect can be as large as 16 when the interaction is half the size of the overall effect [23]. We recommend also taking into account treatmentbyperiod interactions when it comes to evaluating the family of USDs.
As for analysis methods, it is preferred to use GLMM rather than GEE. For GEE, estimating the treatmentbyperiod interactions without bias, when the correlation structure is more complex than exchangeable requires a large number of clusters which may not be feasible in practice. Nevertheless, the difference in power between the GLMM and the marginal model with GEE is mainly due to the differences in the hypothesis testing approaches (i.e., the generalized score test for the marginal model with GEE and Wald’s test for the GLMM). Moreover, the estimator for the covariances of the parameter estimate (i.e., modelbased and empirical sandwich estimator) also plays an important role. In GEE, the robust sandwich estimator was used for both the generalized score test and the Wald test while in GLMM either the robust sandwich estimator or the modelbased estimator can be adopted. Additional simulation results can be found in the additional file (Additional file 5) where we show the powers of the GLMM and the marginal model with GEE at the aggregated level with AR1 correlation structure (\(\sigma _c^2=0.8, r=0.8\)) under USDo2 using different hypothesis testing approaches in combination with both modelbased and robust sandwich standard error estimators. In line with previous study [24], the robust Wald test was more liberal and had higher power than the generalized score test for the marginal model with GEE. It had similar power compared to the power of the \(\chi ^2\) test used in GLMM with the modelbased estimator. Meanwhile, using the robust estimator for GLMM parameter estimates in place of the modelbased estimator brought the power of the test down to the same level as the power of the generalized score test. Some authors have performed comparative work in this area [24, 25], but it is largely a work in progress which falls outside the scope of our presented study.
One of the limitations of the presented study is that we only considered a handful of settings and models. For instance, we did not consider the timeontreatment effects [6] which might be considered as a threeway interaction between treatment, period, and cluster. Therefore, it is important to study the problem of treatmentbyperiod interaction from a more theoretical perspective and consider different models under the family of USDs. Another limitation is that we only consider binary outcomes. For continuous and timetoevent outcomes, the impact of treatmentbyperiod interaction is yet to be investigated.
Conclusions
In the present study, we compared four analysis methods for estimating and testing the treatmentbyperiod interactions under various unidirectional switch design choices. Via Monte Carlo simulation, it has been found that the parallel group design is the most efficient in terms of estimating and testing the treatmentbyperiod interactions. Whilst no substantial difference is observed between analysis at an aggregated level and individual level, GLMM has higher efficiencies and better point estimates compared to the marginal model with GEE under different designs. Its coverages were worse than the marginal model with GEE when the correlation structure is AR1 even when the number of clusters is as high as 192.
Availability of data and materials
The datasets generated and/or analysed during the current study are available from the corresponding author on reasonable request. All SAS and R codes used in this study can be found at https://github.com/jujae/USDinteraction.
Notes
Here we picked \(M=48\) because we choose the number of clusters to be divisible by the number of patterns \(T+1 = 6\) and to make the differences between USDu, USDo1, and USDo2 substantial
Abbreviations
 USD:

Unidirectional switch design
 GLMM:

Generalized linear mixed model
 GEE:

Generalized estimating equations
 PGD:

Parallel group design
 SWD:

Stepped wedge design
 DSD:

Delayed start design
 AR1:

Autoregressive order 1
References
Zhan Z, de Bock GH, van den Heuvel ER. Statistical methods for unidirectional switch designs: Past, present, and future. Stat Methods Med Res. 2018;27(9):2872–82. https://doi.org/10.1177/0962280216689280.
Girling AJ, Hemming K. Statistical efficiency and optimal design for stepped cluster studies under linear mixed effects models. Stat Med. 2016;35(13):2149–66.
Zhan Z, de Bock GH, van den Heuvel ER. Optimal unidirectional switch designs. Statistics in medicine. 2018;37(25):3573–88.
Lawrie J, Carlin JB, Forbes AB. Optimal stepped wedge designs. Stat Probab Lett. 2015;99:210–4.
Van den Heuvel ER, Zwanenburg RJ, Van RavenswaaijArts CM. A stepped wedge design for testing an effect of intranasal insulin on cognitive development of children with PhelanMcDermid syndrome: a comparison of different designs. Stat Methods Med Res. 2017;26(2):766–75.
Hughes JP, Granston TS, Heagerty PJ. Current issues in the design and analysis of stepped wedge trials. Contemp Clin Trials. 2015;45:55–60.
Hemming K, Taljaard M, Forbes A. Analysis of cluster randomised stepped wedge trials with repeated crosssectional samples. Trials. 2017;18(1):1–11.
Zeger SL, Liang KY, Albert PS. Models for longitudinal data: a generalized estimating equation approach. Biometrics. 1988;44(4):1049–60.
Scott JM, deCamp A, Juraska M, Fay MP, Gilbert PB. Finitesample corrected generalized estimating equation of population average treatment effects in stepped wedge cluster randomized trials. Stat Methods Med Res. 2017;26(2):583–97.
Hussey MA, Hughes JP. Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials. 2007;28(2):182–91.
Zeger SL, Liang KY. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986;42:121–30.
Zeger SL, Liang KY. An overview of methods for the analysis of longitudinal data. Stat Med. 1992;11(14–15):1825–39.
Molenberghs G, Verbeke G. Models for Discrete Longitudinal Data (Springer Series in Statistics). New York: Springer; 2005.
Li F, Turner EL, Preisser JS. Sample size determination for GEE analyses of stepped wedge cluster randomized trials. Biometrics. 2018;74(4):1450–8. https://doi.org/10.1111/biom.12918.
Huber PJ, et al. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. vol. 1. Berkeley: University of California Press; 1967. p. 221–33.
White H. A heteroskedasticityconsistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica J Econ Soc. 1980;48(4):817–38.
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
Griswold, ME, Zeger, SL. On marginalized multilevel models and their computation. Johns Hopkins University, Dept. of Biostatistics Working Papers. 2004. Working Paper 99.
Fagerland MW, Lydersen S, Laake P. Recommended confidence intervals for two independent binomial proportions. Stat Methods Med Res. 2015;24(2):224–54.
Woertman W, de Hoop E, Moerbeek M, Zuidema SU, Gerritsen DL, Teerenstra S. Stepped wedge designs could reduce the required sample size in cluster randomized trials. J Clin Epidemiol. 2013;66(7):752–8.
Hemming K, Girling A. The efficiency of stepped wedge vs. cluster randomized trials: stepped wedge studies do not always require a smaller sample size. J Clin Epidemiol. 2013;66(12):1427–1428.
Liu X. Statistical power and optimum sample allocation ratio for treatment and control having unequal costs per unit of randomization. J Educ Behav Stat. 2003;28(3):231–48.
Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, Peters TJ. Subgroup analyses in randomized trials: risks of subgroupspecific analyses; power and sample size for the interaction test. J Clin Epidemiol. 2004;57(3):229–36.
Guo X, Pan W, Connett JE, Hannan PJ, French SA. Smallsample performance of the robust score test and its modifications in generalized estimating equations. Stat Med. 2005;24(22):3479–95.
Thompson J, Hemming K, Forbes A, Fielding K, Hayes R. Comparison of smallsample standarderror corrections for generalised estimating equations in stepped wedge cluster randomised trials with a binary outcome: a simulation study. Stat Methods Med Res. 2021;30(2):425–39.
Acknowledgements
The authors would like to thank the anonymous reviewers for making excellent suggestions to improve our article.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
GHdB and ERvdH conceived the study and all authors contributed to its design. ZZ designed and wrote the computer code and ran and analysed the simulations. ZZ wrote the first draft of the manuscript, with contributions from GHdB and ERvdH. The author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Simulation results (Point estimates, Coverage probabilities, Type I error, and Power) for linear interaction effects with \(M=48\).
Additional file 2.
Simulation results (Point estimates, Coverage probabilities, Type I error, and Power) for linear interaction effects with \(M=192\).
Additional file 3.
Simulation results (Point estimates, Coverage probabilities, Type I error, and Power) for symmetric interaction effects with \(M=48\).
Additional file 4.
Simulation results (Point estimates, Coverage probabilities, Type I error, and Power) for symmetric interaction effects with \(M=192\).
Additional file 5.
Simulation results (Power) for linear interaction effects with \(M=48\) of GLMM and GEE at aggregated level with AR1 correlation structure \(\sigma_{c}^{2}=0.8,\ r=0.8\) under USDo2 with different hypothesis testing methods and with different estimators for the standard error of the parameter estimates.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhan, Z., de Bock, G.H. & van den Heuvel, E.R. Comparison of analysis methods and design choices for treatmentbyperiod interaction in unidirectional switch designs: a simulation study. BMC Med Res Methodol 22, 294 (2022). https://doi.org/10.1186/s12874022017659
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874022017659
Keywords
 Stepped wedge design
 Hybrid design
 Clustered randomized controlled trial
 Power analysis
 Monte Carlo simulation