Nothing wrong about change: The adequate choice of the dependent variable and design in prediction of intervention success 2

Investigating predictors of intervention success is a common approach in medical research. In 3 the light of an individualized medicine, it is important not only to investigate the effects of 4 certain pharmacological and nonpharmacological interventions, but also to examine specific 5 individual characteristics of participants who do or do not benefit from these interventions. 6 However, results on specific predictors of intervention success in the overall field are mixed 7 and inconsistent due to different and sometimes inappropriate statistical methods used. Therefore, the present paper gives a guidance on the appropriate use of multiple regression 9 analyses to identify predictors of pharmacological and nonpharmacological interventions. We simulated data based on a predefined and ran a of different to evaluate their performance in retrieving the true model coefficients. The true model consisted of a 2 (between: experimental vs. control group) x 2 (within: pre- vs. post-treatment) design 14 with two continuous predictors, one of which predicted the success in the intervention group 15 and the other did not. In analyzing the data, we considered four commonly used dependent 16 variables (post-test score, absolute change score, relative change score, residual score), five 17 regression models, eight sample sizes, and four levels of reliability.

variables in a multiple regression paradigm to determine which of these variables is the most 1 suited one to investigate performance changes after interventions (Aim 2). Furthermore, we 2 investigate the best sample size in relation to the amount of predictors used in these multiple 3 regression model (Aim 3) and evaluate the influence of the reliability of instruments to 4 measure predictors and outcomes (Aim 4). In a final step, we highlight the specific role of 5 predictor variables as performance of participants at study entry (Aim 5). We used CT as a 6 specific example to illustrate the simulation process. However, our results can apply to many 7 fields, which employ the simulated and discussed experimental design. Simulations. We simulated data from a simple model which is often found in experimental 11 designs reported in the literature, e.g. of CT [e.g. 7, see Figure 1]. The model consists of a 2 12 (group: experimental vs. control) x 2 (time: pre-treatment vs. post-treatment) design, in which 13 the group represents a between-subjects factor and the time represents a within-subjects 14 factor. Additionally, a continuous predictor was included in the design which predicts the 15 success of the treatment in the experimental group. We also included a continuous predictor in 16 our simulations which was not related to the success of the treatment.  The mean X of E2 was computed depending on the level of reliability such that the desired 3 effect size dz = 0.50 emerged given the mean and standard deviation of E1, the standard 4 deviation of E2 and the correlation between E1 and E2. The same applies to the mean Z of 5 C2. Accordingly, the effect size Y of d was variable across the levels of reliability. 6 Note. Depicted arrows do not indicate causality or any direction of influence. We simulated the data in two steps. First, we randomly generated data derived from the 9 true model as described below (see Model Specifications). Second, we added noise to these 10 data given that measurements are never exact and measurement instruments always show a 11 measurement error. We assumed that the noise is normally distributed and that the expected 12 value of the noise is zero. These assumptions are based on the Classical Test Theory [8]. The 13 extent of the noise thus depends on the standard deviation (SD) of the noise distribution, 14 which is directly related to the reliability of the measurement instruments. Therefore, for our 15 basic simulations, we determined the noise SD by setting the reliability for all measures to correlated with the change from time 1 to time 2 in the control group, r(P-I, ΔC1-C2) = .00. 23 The second predictor was not related to any change from time 1 to time 2, r(P-II, ΔC1-C2) = 24 .00 and r(P-II, ΔE1-E2) = .00. We included this predictor in the simulations to examine 25 whether the statistical models we tested (see Analyses) were able to discriminate between 1 predictors that have a true effect and predictors that do not. 2 Note that the observed effect sizes (dz and r) also depend on the reliability [10]. In 3 general, the higher the reliability is, the larger the observed effect sizes are given a constant 4 true effect size. To account for this, we kept the true effect size constant. To this extent, we 5 computed the true effect sizes in a scenario with medium effect sizes, i.e. r = .30 and dz = 6 0.50, and a good reliability, i.e. rtt = .80. These true effect sizes were subsequently used as a 7 basis for the true model for which we generated data as described above and on which we 8 imposed different levels of noise reflecting the respective reliability. Accordingly, the 9 observed effect sizes vary as a function of reliability while the true effect sizes remain 10 constant, as can be assumed in a real-world setting.

12
Analyses. After generating n = 1,000 data sets for each sample size from the true model and 13 imposing noise reflecting the respective reliability for all measures (E1, E2, C1, C2, P-I, P-II), 14 we ran five different regression analyses on each individual data set (Aim 1, see Table 1). The 15 different regression models differed in terms of the predictors included in the model (Aim 1). 16 In Model 1, the dependent variable was predicted by the external predictors which might be 17 associated with the treatment success, i.e. P-I and P-II. In Model 2, the score measured at time 18 1, i.e. the pre-test score, was added. Model 2 thus consisted of P-I, P-II and the pre-test score 19 (i.e. E1 and C1) as the predictors of the dependent variable. In Model 3, we additionally 20 added the treatment Group as a binary predictor (dummy-coded: 0 = control group, 1 = 21 experimental group). In Model 4, the dependent variable was predicted by P-I, P-II, the pre-22 test score, Group and the interaction between P-I and Group, and P-II and Group. Finally, in 23 Model 5, we removed the pre-test score from the model, such that Model 5 contained the 24 predictors P-I, P-II, Group and the interaction between P-I and Group, and P-II and Group 25 (see Table 1 for an overview). In addition to varying the predictors in the regression model, we also varied the 7 dependent variable in order to investigate the consequences of the different measures used in 8 the literature to quantify treatment success (Aim 2). Specifically, we used the following 9 measures as dependent variables: (1) the measure at time 2 (E2, C2), i.e. the post-test score, 10 (2) the absolute change from time 1 to time 2 (E2 minus E1, C2 minus C1), (3) the relative 11 change from time 1 to time 2 (E2 minus E1 divided by E1, C2 minus C1 divided by C1), and 12 (4) the residuals of the post-test score (E2, C2) after controlling for the pre-test score (E1, 13 C1). We not only ran the regression analyses for the observed data, but also for the true data.

26
14 This allowed us to compute a bias by subtracting the observed regression coefficients from 15 the true regression coefficients (see below of the 1,000 individual data sets generated for each of the eight sample sizes and each for the 2 four levels of reliability. 3 We then aggregated the regression coefficients for each predictor by computing the 4 mean of the coefficients for each set of predictors, each dependent variable, each level of 5 reliability, and each sample size. Furthermore, we computed the standard deviation of these 6 coefficients which is an estimate for the standard error (SE) of the regression coefficient, i.e. 7 the precision with which the regression coefficient was estimated.

8
To evaluate the success of each model and each dependent variable of detecting a true 9 effect while simultaneously controlling for the alpha error and to also highlight the specific 10 role the performance of participants at study entry as a predictor (Aim 5), we proceeded as given the smaller SE for large sample sizes. Accordingly, in our case, the regression 3 coefficient estimates need to be divided by the product of their SD (i.e. their SE) and the 4 square root of the sample size. This product is the actual SD of the estimates. In total, we ran 5 1,280,000 regression analyses (five models of four dependent variables and eight sample 6 sizes, four reliability levels in 1,000 datasets, for each data the true and the observed data). prognostic performance is to be evaluated. Second, we need to account for the treatment that 15 is applied to the experimental group, but not the control group. To this extend, we also need to 16 include the binary predictor Group in the regression model.

17
Importantly, however, the external predictor P-I can only predict the outcome in the the outcome variable might be overseen because it only exists in the experimental group but 24 not in the control group. Jointly, this might lead to an insignificant main effect of P-I.

25
Alternatively, a significant main effect of P-I in a regression model which does not include the interaction term P-I × Group cannot be interpreted as the ability of P-I to predict the 1 intervention success because such an intervention success can only be observed in the 2 experimental group. In this case, the significant main effect might just reflect a general 3 relationship between P-I and the outcome variable (depending on which criterion is used, see 4 Aim 2) which does not reflect the ability of P-I to predict the intervention success. To 5 examine this, it is crucial to include the interaction term P-I × Group in the regression model. 6 Finally, we recommend also including the pre-test scores as a predictor in the regression 7 model. This controls for differences in the variable of interest that were present prior to the 8 intervention, similar to a covariate in an analysis of covariance. Our simulations show that 9 models including the pre-test score as a predictor yield a better power to unveil a significant 10 P-I main effect or P-I × Group interaction effect than models that do not include the pre-test 11 score as a predictor. For example, Table 2 shows that for a reliability of rtt = .80 and a sample  changes. Yet, all these types of dependent variables have different implications as regards 10 content and interpretation.

11
In a classical pre-post design, which underlies most non-pharmacological intervention 12 studies, the post-test score seems to be an established dependent variable in multiple 13 regression analyses measuring training success. However, using the post-test score (that is 14 performance after training/intervention) answers the question "Is x a likely cause of y" [11], 15 but does not refer to gain. Furthermore, imagine an external predictor such as P-I emerged as 16 a significant predictor of the post-test score in the experimental group. Would that indicate 17 that the external predictor can predict the intervention success? Not necessarily, because the 18 predictor might just be related to the construct captured by the post-score. In this case, one 19 would also find that P-I is similarly related to the pre-test score in the experimental group.

20
Furthermore, an external predictor such as P-I could be related to the post-test score in both 21 the experimental group and the control group. Thus, finding a significant effect of P-I on the 22 post-test score in the experimental group is necessary, but insufficient to draw the conclusion 23 that P-I can predict the intervention success.

24
Absolute change scores (post-pre performance) answer the question "whose score is 25 most likely to increase/decrease over time?", therefore directly referring to intervention gain which are inherent in traditional reliability or generalizability coefficients [15]. They can be 14 interpreted in terms of how much progress an individual in comparison to others has made. 15 Therefore, the focus is not on changes in the individual's performance, but on comparisons to 16 others. Yet, our simulations demonstrated that the relative change scores are more vulnerable 17 to the methodological artifact (described by 17) than absolute change scores. The probability 18 of detecting a significant negative regression coefficient for the pre-test score was consistently 19 higher for relative change scores than for absolute change scores, regardless of sample size, 20 regression model used, or level of reliability. Keep in mind that we did not model a 21 relationship between the pre-test score and the intervention success when simulating the data.

22
The indication of a significant negative regression coefficient is thus an alpha-error. Similarly, 23 the power of detecting a significant P-I × Group interaction effect was consistently higher for  Figure S1). Obviously, due 23 to the fact that sample size and power are dependent on each other, as the sample size 24 increases, the power increases, regardless of which dependent variable is used in the 25 regression model. Further, as an overall trend it can be stated that as the sample size is also dependent on the reliability; as the reliability increases, a smaller sample size is needed to 1 achieve the same level of power.  As depicted in Figure 2, not integrating the pre-test score in our regression model leads 9 to the need of a higher sample size to achieve the same power as regression models which 10 integrate the pre-test score in the calculation. This is the case for all dependent variables 11 except one: when using the residual score as a dependent variable, there is nearly no 12 difference in power/sample size increase between regression models that in-or exclude the 13 pre-test score, as the pre-test score is already included in the dependent variable as a defining 14 character of the residual score. 15 Overall, Figure 2 shows that, regardless which dependent variable and which predictors your dependent measure is not as reliable as you wish it would be. 1 Yet, when using the 4 change score as the dependent variable in the calculation and the reliability is rather low 5 (.60/.70), a sample size of n = 300 seems even more appropriate to achieve a good power. It is 6 important to always calculate and report reliabilities of the used instruments to ensure good 7 scientific practice and help other researchers to better understand and evaluate your results. Aim 4: The role of reliability of the measurement instruments 10 The simulations show that, in order to achieve adequate power to detect a true effect, a 11 relatively large sample size is required which is often difficult to achieve in scientific practice.

12
However, the simulations also illustrate that an adequate power can not only be achieved by 13 increasing the sample size, but also by selecting more reliable measures. While increasing the 14 sample size mostly decreases the standard error which in turn leads to an increased power, 15 increasing reliability also increases the estimates of the regression coefficient, i.e. the estimate 16 and its entire confidence interval is shifted away from zero, making it more likely that a true 17 effect is detected (see Figure 3 for the regression coefficients of P-I or P-I x Group as a   Note. Red colour indicates a reliability of .60; blue colour indicates a reliability of .70; green colour Aim 5: The special role of the pre-test score as a predictor in a multiple regression 1 Studying Table 2 (or Tables 1 -32 in the Supplementary Material), a striking 2 observation is that whenever the pre-test score is included in a regression model, the 3 regression coefficients for the other predictors yield the exact same results independent of the 4 criterion, apart from the relative change because relative change is measured on another scale 5 than the other three criteria. This suggests that whenever the pre-test score is a predictor in the This also implies that the difference c0 -b0 also has to equate to zero, giving 13 0 = 0 (7)

14
First, these mathematical equations show that when the pre-test score is included as a 15 predictor in the regression model, the regression coefficients for the other predictors and the 16 intercept are identical for the post-test score and the absolute change score as criteria 17 (assuming that formal prerequisites for multiple regression analyses are met).

18
Second, the regression coefficients of the pre-test score for both criteria are a linear 19 transformation of each other. Considering that the coefficient b1 reflects the relationship 20 between the pre-test score and the post-test score, it can be interpreted as an estimate of the 21 (test-retest) reliability. The coefficient c1 reflecting the relationship between the pre-score and the absolute change score is thus always negative, because the reliability can never exceed 1 23 and because we have shown that c1 = b1 -1. Furthermore, this relationship paradoxically 24 implies that the relationship between the pre-test score and the change score is larger when the 25 reliability of the measure is lower. help to inform whether the regression coefficient for the pre-test score is purely a statistical 10 artefact or might reflect a relationship that persists beyond the statistical artefact. In the 11 context of cumulative research evidence, it is also of high importance to report and publish 12 studies with small sample sizes that only or mostly show non-significant prognostic effects.

13
These studies can also contribute to cumulative research findings (e.g. in meta-analysis). This true pre-test score effect on the change score or not. Apart from reporting the sample size, 24 also report the reliability of the employed measures as it has a considerable impact on the 25 probability of detecting a true effect and should thus be made accessible to your readers. The datasets generated and analysed during the current study are available in the Open 10 Science Framework (OSF) repository, 11 https://osf.io/p54j3/?view_only=79663d4a95cb4705b25e2a5f374d5155 [anonymized link for 12 reviews; will be replaced by public DOI once the manuscript is accepted for publication] 13 14 Competing interests 15 The authors declare that they have no competing interests.