Statistical analysis of two arm randomized pre-post designs with one post-treatment measurement

Background Randomized pre-post designs, with outcomes measured at baseline and after treatment, have been commonly used to compare the clinical effectiveness of two competing treatments. There are vast, but often conflicting, amount of information in current literature about the best analytic methods for pre-post designs. It is challenging for applied researchers to make an informed choice. Methods We discuss six methods commonly used in literature: one way analysis of variance (“ANOVA”), analysis of covariance main effect and interaction models on the post-treatment score (“ANCOVAI” and “ANCOVAII”), ANOVA on the change score between the baseline and post-treatment scores (“ANOVA-Change”), repeated measures (“RM”) and constrained repeated measures (“cRM”) models on the baseline and post-treatment scores as joint outcomes. We review a number of study endpoints in randomized pre-post designs and identify the mean difference in the post-treatment score as the common treatment effect that all six methods target. We delineate the underlying differences and connections between these competing methods in homogeneous and heterogeneous study populations. Results ANCOVA and cRM outperform other alternative methods because their treatment effect estimators have the smallest variances. cRM has comparable performance to ANCOVAI in the homogeneous scenario and to ANCOVAII in the heterogeneous scenario. In spite of that, ANCOVA has several advantages over cRM: i) the baseline score is adjusted as covariate because it is not an outcome by definition; ii) it is very convenient to incorporate other baseline variables and easy to handle complex heteroscedasticity patterns in a linear regression framework. Conclusions ANCOVA is a simple and the most efficient approach for analyzing pre-post randomized designs. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01323-9.


Introduction
Two arm parallel randomized design has been widely used to compare the clinical effectiveness of competing treatments in improving patients' health outcomes.In these trials, continuous outcomes of interest are routinely measured at baseline and one follow up time point after treatment.Common statistical methods used in analyzing pre-post designs include: one way analysis of variance model (ANOVA), 1 analysis of covariance model adjusting for the baseline measurement (ANCOVA I) [2][3][4][5] and ANCOVA including the baseline measurement by treatment interaction (ANCOVA II) on post-treatment measurements 2,3 , ANOVA on change score from the baseline (ANOVA-Change), 6,7 repeated measures models (RM) and constrained repeated measures models (cRM) on the baseline and post-treatment measurements as joint outcomes [8][9][10] .Despite of the simplicity and wide application of pre-post designs, which method is the best analytic approach has been a debated topic and many methodological studies have been performed to compare different statistical methods for past decades.However, it is challenging for applied researchers to evaluate this vast, but often conflicting, amount of information in current literature and make an informed choice.
The primary purpose of designing a pre-post randomized study is to answer the scientific question of interest: is treatment A more effective than treatment B? To assess the difference in the treatment effectiveness between two treatments, we need to select a study endpoint and quantify a treatment effect.
Common study endpoints include post treatment score, change score from baseline measurement to post treatment measurement, percent change from baseline, and rate of change from baseline.The difference between two arms on selected study endpoint is called treatment effect.Few studies examine the links between these different metrics of treatment effect.These underlying connections are critical in understanding the equivalence among some statistical methods that may appear to be very different at the first sight 2 .We need to be certain about the type of treatment effect each method targets and select the one that yields an unbiased and the most efficient estimator of the treatment effect of our interest.
In this study we aim to review six commonly used methods (ANOVA, ANCOVA I, ANCOVA II, ANOVA-Change, RM, cRM) from a practical standpoint, and we focus on delineating the differences and underlying connections between these methods.In section (2), we provide notations and assumptions for a typical pre-post design, define homogeneous and heterogeneous study population, and discuss common study endpoints and associated treatment effects.In section (3), we first outline these competing models using the same set of population mean, variance, and covariance parameters, and then assess differences and links between them in homogeneous and heterogeneous scenarios; In section (4), we present three simulated weight loss trial data examples (homogeneous data, heterogeneous data with balanced design, heterogeneous data with unbalanced design) to exemplify the differences and links between different statistical methods.In the last section, we discuss the results and give recommendation on the best analytical approach in pre-post designs.

Notations
In a hypothetical two arm parallel weight loss trial comparing the effect of a new drug ("treatment") and a placebo ("control") in reducing participants' weights, we use " #$% to denote weight of the &th patient (& = 1,2,3, … .$ ) in the treatment arm 0 (0 = 0,1) at equally spaced time point 2 (2 = 2 3 , 2 4 ). . 3   and .4 are the number of subjects in the control and treatment arms.
We denote the mean baseline weights for the treatment and control arms by 6 4% 7 and 6 3% 7 , respectively.
It is reasonable to assume 6 4% 7 = 6 3% 7 under randomization and we let 6 % 7 denote the common baseline weight mean.Mean weights at the follow-up time point 2 4 in the treatment and control arms are denoted by 6 4% 8 and 6 3% 8 , respectively (Figure 1).We define homogeneous and heterogeneous study populations as follows: i) The homogeneous scenario: every subject, in either the treatment or control arm, has the same pattern of variance and covariance structure for their baseline and follow-up weights.Variability of post-treatment measurement tends to be larger in the treatment arm than in the control arm because participants may respond to the treatment more differently.i.e., = 3 > = 4 and ; 44 < > ; 34 < .

Metrics of treatment effect
We discuss the following three commonly reported metrics of treatment effect in pre-post trials: i) The primary endpoint is weight at the follow-up time point 2 4 .The difference in mean weights at 2 4 between two arms is the treatment effect: @ = 6 4% 8 − 6 3% 8 e.g, if @ = −10, we can interpret the results as "at the end of the trial, the mean weight was 10 pounds lower in the treatment group than in the control group." ii) The primary endpoint is change score calculated by subtracting the follow up weight from the baseline weight ∆ #$ = " #$% 8 − " #$% 7 .The mean difference in change scores between two arms is the treatment effect.Formally, we have: e.g. if @̃= −10, this difference is normally interpreted as "weight reductions were 10 pounds greater on treatment than control.".Since we have 6 3% 8 = 6 3% 7 due to randomization, it follows directly @̃= @.Furthermore, when we code "0" for the baseline time point (2 3 = 0) and "1" for the follow-up time point (2 4 = 1) , mean change score for each arm can also be interpreted as mean change rate per unit time for each arm, represented by slopes in Figure (1).Thus, difference in slopes, denoted by @̃D = E 4 − E 3 , is also equivalent to @.As we will discuss in details in section (3), ANOVA and ANCOVA target @, ANOVA-CHANGE targets @, and RM targets @D .We can compare these seemingly very different statistical methods in a meaningful way because of the equivalence between @, @, and @D in randomized pre-post designs.
iii) The primary endpoint is the percent change from baseline weight, denoted by . Treatment effect is the mean difference in the percentage change between two arms and formally defined as follows: where F M 4 and F M 3 are the mean percentage change in the treatment and control arms.Although percent change is popular among clinical researchers, this metric has several drawbacks [11][12][13] .
First, percent change is a function of ratio

Statistical methods
In sections (3.1) and (3.2), we focus on six methods aiming to estimate @.We describe each statistical model using the same set of population mean, variance, and covariance parameters defined in section (2)  for homogeneous and heterogeneous scenarios, separately.We present the closed-form expressions of each point estimator of treatment effect and its variance.
We fit an ordinary least squares (OLS) regression to estimate the coefficients and standard errors for model (1).The closed-form expressions of the OLS estimator O U 4,VWX (4) and its variance, denoted by 4) ), are presented in Table 1.O U 4,VWX (4) is estimated by the sample group mean difference in posttreatment weight between two arms.O U is unbiased for @ .The OLS model-based variance of 4) assuming known ; 4 < is expressed as follows: , where d b #$% 8 (4) = O U  4) ) in a homoscedastic linear model.Thus, 4) ) is unbiased for YZ[(O U 4,VWX (4) ).The usual OLS model-based inference (i.e., test statistics lmno f ijk (g h 8,ijk (8) ) and the associated p-value) is valid for testing p V : @ = 0 unconditionally.
We fit an OLS regression to estimate the regression coefficients and standard errors for model (2).The < is presented in Table 1.Since ; r (s) < is generally unknown, it is estimated by the following sample residual variance: , where <) " #$% 7 , the predicted value from fitted model (2).We let ) denote the OLS model-based variance estimator with the estimator ; b r (s)

<
. ), in which " #$% 7 is treated as random variable, for testing p V : @ = 0. To establish this equivalence, we need to show: <) is the average of its conditional variance over the distribution of baseline weight measurement.
Therefore, the usual model-based standard errors and associated p-values are valid for unconditional inference 2,4 .
We fit a generalized least squares (GLS) model with correlated outcomes to estimate the coefficients and standard errors for model (3).The closed-form expressions of the GLS estimator of treatment effect Formally, we model baseline and follow-up weights (" #$% 7 , " #$% 8 ) jointly using the binary factor z #$ , time by treatment interaction N #$ × z #$ in the following cRM model: " #$% 7 using the binary treatment indicator N #$ as follows: is derived as the sample mean difference in weight change between two arms ("difference in difference") and is unbiased for @ .The OLS model-based variance of .; r (á) < is estimated by , is the fitted value from OLS model (5).We let Ö) ) denote the OLS model-based variance estimator with the estimator ; b r (á) < , which is output by standard statistical software (Table 1).Since

Methods comparison
All treatment effect estimators, except ANOVA estimator, are expressed as the mean difference in post-treatment measurements adjusting for the chance imbalance in baseline measurement between two arms in certain ways.Nonetheless, all estimators are unbiased for @ .To compare these competing methods, we evaluate the efficiency of point estimators of treatment effect by comparing their "unconditional" variances.Since the hypothesis testing of no treatment effect is based on dividing the point estimator by its standard error (i.e., variance divided by sample size) and rejecting the null hypothesis when this ratio exceeds a given threshold, the method that produces unbiased point estimate with the smallest unconditional variance is preferred because standard error in the dominator of statistical test determines the statistical power.

When study population is homogeneous
This advantage of ANCOVA over ANOVA can also be observed from the fact that residual error variance of ANCOVA I is less than residual error variance of ANOVA (i.e., (1 − = < ); 4 < ≤ ; 4 < ).When the correlation coefficient = becomes larger, the ANCOVA I estimator has smaller variance.Since " #$% 8 and " #$% 7 are highly correlated in general, the inclusion of " #$% 7 in ANCOVA I explains away some variability in " #$% 8 and thus reduces residual variance and yields a more efficient estimator of treatment effect than

ANOVA.
ANOVA-Change and RM have exactly same point estimators of @ and have the same variances (Table 1).To compare ANOVA-Change or RM with ANOVA, we can derive the difference between their variances as follows: ∆= ; 3 (1 − 2=; 4 ).As shown in Table 1, the ANCOVA I and cRM estimators of @ are equivalent because O 4,VWX (<) =

When study population is heterogeneous
A heterogeneous study population justifies the inclusion of a treatment by baseline weight interaction term.Thus, ANCOVA II is the correctly specified model, whereas ANCOVA I is a mis-specified model.
In this case, the treatment effect is not assumed to be constant across different values of baseline weight, known as conditional treatment effect.The ("marginal") treatment effect @ is simply the average of conditional treatment effect over the distribution of baseline weight and measures an overall treatment effect.As shown in sections (3.2), both ANCOVA models can be used to estimate @ even though

Numerical illustration
We simulated three weight loss trial data sets based on a published study for three scenarios: homogeneous data, heterogeneous data with balanced and unbalanced design as follows 18 : 1) The baseline weights for the control and treatment arms are generated from normal distribution with mean 88 kg and standard deviation 14 kg.Weights at 6 month after treatment for the control arm have mean 86 kg and standard deviation 15 kg.This gives a ~2.3% change from baseline.Mean and standard deviation for weights at 6 month in the treatment arm are 83 kg and 15 kg, respectively; This corresponds to a 5.7% change.
2) In homogeneous data, the correlation coefficient between baseline and follow-up weights is 0.9.180 were assigned to the treatment and control arms equally.In heterogeneous data, the correlation coefficient between the baseline and follow-up weights in the control arm is 0.9 and 0.

Discussion
In this study we compare the efficiency of six unbiased methods analyzing pre-post design.We found we seldom can control the values of baseline measurement, the assumption that baseline measurement is fixed required by OLS casts doubt on the validity of ANCOVA for hypothesis testing 5,10 .Crager proved ANCOVA I is valid for unconditional inference in homogeneous scenario 5 .This conclusion can be simply attributed to that the conditional variance of the ANCOVA I estimator is an unbiased estimate for its unconditional variance 2 .
A few studies investigated further a heterogeneous scenario 2,3,9,10,19 .Although the heterogeneity Choosing between ANCOVA I and II then becomes an evaluation of a trade-off between simplicity and some gains in efficiency.
In homogenous setting, cRM was suggested as a superior choice to ANCOVA I because unconditional variance of the cRM estimator is smaller than conditional variance of the ANCOVA I estimator 21 .

Kenward et al. pointed out that such direct comparison between conditional and unconditional variances
is not meaningful.Both estimators are equivalent and cRM based on REML and Kenward-roger adjustment performed almost identically to ANCOVA I in finite samples 15 .In heterogeneous scenario, cRM is comparable to ANCOVA II 2 .In presence of missing data, applied researchers often prefer cRM over ANCOVA because it can utilize all observed data but ANCOVA uses only complete cases.However, imputation methods which utilize the strong pre-post correlation, such as weighting and regression imputation, can improve the statistical power for ANCOVA without biasing estimates, making it comparable to cRM 15 .
Furthermore, ANCOVA has several advantages over cRM: first, outcome should only be the variable that can be influenced by treatment.Baseline measurement is certainly not an outcome by this definition.
It is conceptually more appropriate to adjust baseline measurement as covariate, not model it as outcome 4 ; Second, it is very convenient to include other baseline variables in a regression model for more efficient estimates of treatment effect.Third, it is easy to adjust for other patterns of heteroscedastic errors in an OLS regression.For example, we may expect larger variability in post-treatment weight associated with larger baseline weight.cRM cannot handle this more complex type of heteroscedasticity easily.HCvariance estimators for ANCOVA are simple fixes and readily implemented in statistical statistical software.

; 4 <; 3 <
are the variances of weight measurements at baseline and follow-up, = is the correlation coefficient between the baseline and follow-up weights.ii)Theheterogeneous scenario: variance and covariance structures of the baseline and follow-up weights differ between the treatment and control arms.The variance and covariance matrix for the control arm is is the common variance of weights at baseline for both control and treatment arms.Both arms have the same baseline variances because of randomization at baseline.; 34 < and ; 44 < are variances of weight measurements at follow-up in the control and treatment arms.= 3 and = 4 are the correlation coefficients between weight measurements at baseline and follow-up in the control and treatment arms, respectively.In practice, the correlation between the pre-and post-treatment measurements are usually stronger in the control arm than in the treatment arm.
<)  is derived as the sample mean difference in post-treatment weight adjusting for the difference in sample mean baseline weight between two arms.The mean difference in baseline weight between two arms can be seen as chance imbalance in a randomized trial.O U 4,VWX (<) is unbiased for @ both conditional on " #$% 7 and unconditionally.The formulas of O U ) are listed in Table1.However, OLS assumes adjusted baseline weight " #$% 7 is fixed.OLS targets the conditional variance of O U u" #$% 7 v with known common residual variance ; r (s) | 3 (}) represents the mean baseline weight for the control arm, | 4 (}) represents the difference in mean baseline weights between the treatment and control arms, | < (}) represents the mean change from the baseline weight in the control arm, and | } (}) is generally interpreted as the difference in the mean change from baseline weight in a unit time interval between the treatment and control arms
and Q #4 (é) are i.i.d random errors in the control and treatment arms.O 4 (é) is the parameter associated with the treatment effect.O < (é) is the regression slope of baseline weight in the control arm.O } (é) measures the difference in regression slopes of baseline weight between the treatment and control arms.Model (6) is heteroscedastic because error terms in the treatment and control arms have different residual variances.The OLS estimator O U 4,VWX (é) is the adjusted mean difference in post-treatment weights controlling for a weighted mean difference of baseline weights between two arms with unequal weights for treatment and control arms (i.e., for the treatment group, and O U <,VWX(é)  for the control group) (

3 ,
and O 4 (ú) = @ .Q #3(ú)  and Q #4(ú)  are random errors in the control and treatment arms.Since Q #3(ú)  and Q #4 (ú) have different variances in general, model (7) is heteroscedastic and the severity of heteroscedasticity is determined by correlation coefficient, variances of post-treatment measurements, and whether the design is balanced.The OLS estimator O U 4,VWX(ú)  is an adjusted mean difference in post-treatment weights controlling for a weighted mean difference of baseline measurements between two arms with equal weights for the treatment and control arms (i.e., O U <,VWX (ú) for both arms).O U

8 , 4 <t 8 ,
∆> 0 and ANOVA outperforms ANOVA-Change and RM because ANOVA estimator has smaller variance.When = > ∆< 0 and ANOVA underperforms the other two methods.It can be shown that the difference between ("unconditional") variances of the ANCOVA I or cRM estimators and those of ANOVA-Change or RM estimators are always nonnegative: Thus, ANOVA-Change or RM is less efficient than either ANCOVA I or cRM because their estimators have larger variances.Intuitively ANCOVA I or cRM assumes two arms have the same mean baseline weights in a randomized study but ANOVA-Change or RM assumes there is a baseline difference and needs to estimate one extra parameter.
ANCOVA I plugs in the OLS estimators O U 4,VWX (<) , whereas cRM plugs in the REML estimators of variance and covariance parameters.The numerical difference between O U 4,VWX (<) and | b },ÇWX (Ñ) becomes negligible as sample size increases.Because of this equivalence between O U 4,VWX (<) and | b },ÇWX (Ñ) , YZ[(O U 4,VWX (<) ) and YZ[(| b },ÇWX (Ñ) ) are equal 2 .As discussed in section (3.1), ANCOVA I is a conditional model assuming fixed baseline covariates.Even though model-based variance estimates are conditional, the usual model-based conditional inference is still valid for unconditional hypothesis testing and ANCOVA I performs comparably to cRM.
7 in the treatment arm.Sample sizes are (. 3 = 90, .4 = 90) for balanced design and (. 3 = 60, .4 = 120) for unbalanced design.We analyzed the data examples using the methods outlined in section (3).The statistical results were reported in Table 3 (SAS programs are provided in the Appendix).In the first data example, ANOVA produced the largest standard error and the largest p-value.ANOVA-Change and RM both outperformed ANOVA with much smaller standard errors and p-values.ANCOVA I and cRM outperformed ANOVA-Change and RM with smaller standard errors and p-values.Although ANCOVA I and cRM are equivalent when sample size is large, there are still small differences between the two in finite sample.For the second data example with a balanced design, Figure (2.a) shows that there is a strong baseline weight by treatment interaction.Both ANCOVA I and II have heteroscedastic errors by treatment arm (Figure (2.b) and (2.c)).As shown in table 2, OLS model-based standard error of ANCOVA I is very similar to its HC and bootstrap standard errors.Thus, heteroscedasticity does not bias model-based standard error of ANCOVA I.Although ANCOVA II is robust against heteroscedasticity in balanced design, OLS model-based standard error of ANCOVA II (s.e=1.333) is still not correct because OLS fails to consider the variability of estimating the overall mean baseline weight.Adjusted HC standard error for ANCOVA II is 1.402, which is closer to model-based and HC standard errors of ANCOVA I. Bootstrapping standard errors for ANCOVA I and II are close to their HC or adjusted HC standard errors.cRM estimate and its standard error are close to those from ANCOVA I and II.For the third example with an unbalanced design, Figure (3.d) also reveals a baseline weight by treatment interaction.Both ANCOVA models have heteroscedastic errors by treatment arm (Figure (3.e) and (3.f)).Model-based standard errors of both ANCOVA models are not valid.Model-based standard errors are larger than HC standard errors and thus overestimated the true conditional variances.Compared with ANCOVA I, ANCOVA II has a smaller HC standard error (also smaller p-value) and thus is slightly more efficient.Adjusted HC standard error for ANCOVA II is very close to model-based standard error for cRM.Bootstrapping standard errors for ANCOVA I and II are very close to their HC or adjusted HC standard errors.

ANCOVA and cRM are
the equally most efficient methods compared with other alternatives in homogeneous and heterogeneous scenarios.The majority of previous studies only examined homogeneous study population.In this setting, ANOVA is one of the least efficient approaches for analyzing pre-post designs because it does not utilize any baseline information.ANOVA-Change and RM incorporate baseline measurement as part of outcome, whereas ANCOVA I adjusts baseline measurement as regression adjustment.ANCOVA I outperforms ANOVA-Change and RM because ANCOVA I utilizes the assumption of the balanced baseline measurement between two arms in a randomized study.Thus, change score is a less efficient way to utilize baseline information than adjusting baseline variables as covariates.Because in randomized trials justifies the inclusion of the baseline measurement by treatment interaction term, ANCOVA I and II are both unbiased.Yang and Tsiatis showed the ANCOVA II has smaller unconditional variance than that of the ANCOVA I estimator unless in balanced design20 .However, the OLS model-based variances of ANCOVA I and II estimators, output by standard statistical software, are conditional variances, not unconditional variances.The OLS model-based standard errors and associated p-values for ANCOVA I and II are generally questionable for unconditional inference2,3,9,19 , particularly when the design is unbalanced.With corrected HC variance estimators, both models provide valid unconditional inference.

Figure 1 .Figure 2 .
Figure 1.Hypothetical two arm pre-post weight loss randomized trial

Table 1 .
Because we want to generalize our conclusions to study populations where values of " #$% 7 can be different from the values in our current sample, we may wonder whether significance tests based on the conditional variance assuming " #$% 7 is fixed (e.g., 2 = is output by standard statistical software (e.g."proc reg" in SAS).Its formula is presented in

Table 2
is the predicted value of d #$% 8 .We let YZ[ f VWX (O U Second, common mean baseline weight 6 % 7 is generally unknown.We need to estimate 6 % 7 in " ç #$% 7 using 16(Table2).Standard statistical software such as SAS does not output(é)|" ç #$% 7 ).It can be shown in balanced design (. 3 = .4),7 )16and can be output from standard software.HC variance estimators are consistent (i.e., unbiased in large sample).Among all available HC variance estimators, HC2 was shown to have the best performance in finite samples 2,3 (e.g."HCCMETHOD=2" in proc reg, SAS).O U ANCOVA I is mis-specified.Then, what is the advantage of using the more complex interaction model over a main effect model?It turns out the ANCOVA II estimator O U , whereas cRM plugs in the REML estimators of variance and covariance parameters.The numerical difference between the ANCOVA II and cRM estimators gets smaller as sample size increases.As discussed in section (3.2), standard statistical software such as SAS does not output unconditional variance for ANCOVA II directly but the usual OLS model-based standard errors and p-values are biased for unconditional inference in heterogeneous scenario.Adjusted HCvariance estimator fixes this bias.Corrected ANCOVA II provides valid unconditional inference and performs comparably to cRM.Another alternative approach to estimate variances of the ANCOVA I and II estimators is to use bootstrap method 17 .