 Research article
 Open access
 Published:
Nothing wrong about change: the adequate choice of the dependent variable and design in prediction of cognitive training success
BMC Medical Research Methodology volume 20, Article number: 296 (2020)
Abstract
Background
Even though investigating predictors of intervention success (e.g Cognitive Training, CT) is gaining more and more interest in the light of an individualized medicine, results on specific predictors of intervention success in the overall field are mixed and inconsistent due to different and sometimes inappropriate statistical methods used. Therefore, the present paper gives a guidance on the appropriate use of multiple regression analyses to identify predictors of CT and similar nonpharmacological interventions.
Methods
We simulated data based on a predefined true model and ran a series of different analyses to evaluate their performance in retrieving the true model coefficients. The true model consisted of a 2 (between: experimental vs. control group) × 2 (within: pre vs. posttreatment) design with two continuous predictors, one of which predicted the success in the intervention group and the other did not. In analyzing the data, we considered four commonly used dependent variables (posttest score, absolute change score, relative change score, residual score), five regression models, eight sample sizes, and four levels of reliability.
Results
Our results indicated that a regression model including the investigated predictor, Group (experimental vs. control), pretest score, and the interaction between the investigated predictor and the Group as predictors, and the absolute change score as the dependent variable seemed most convenient for the given experimental design. Although the pretest score should be included as a predictor in the regression model for reasons of statistical power, its coefficient should not be interpreted because even if there is no true relationship, a negative and statistically significant regression coefficient commonly emerges.
Conclusion
Employing simulation methods, theoretical reasoning, and mathematical derivations, we were able to derive recommendations regarding the analysis of data in one of the most prevalent experimental designs in research on CT and external predictors of CT success. These insights can contribute to the application of considered data analyses in future studies and facilitate cumulative knowledge gain.
Background
In medical and psychological research, researchers and clinicians often study the effects of certain pharmacological and nonpharmacological interventions. One focus in the field of neuropsychology so far is the effects of nonpharmacological interventions, especially cognitive training interventions to delay or even prevent the onset of cognitive decline. Cognitive training (CT) interventions are defined as a standardized set of exercise [1], which involves repeated practice and is designed to reflect particular cognitive functions, such as memory, attention, or executive functions [2, 3]. CT is not only effective in improving and maintaining cognitive abilities in patients with neurological diseases such as Alzheimer’s [4] or Parkinson’s disease [5], but also in healthy older adults as an attempt to prevent cognitive impairment in the aging process [6]. Yet, in the course of the increasing importance of personalized medical approaches, the question: “Who benefits most from CTs” is gaining more and more attention. Defining prognostic factors for performance changes after nonpharmacological interventions is of high importance in order to define subgroups of participants who may benefit from a specific treatment [7, 8], and for the design of new and more effective training programs [9, 10]. For example, many studies have investigated the impact on sociodemographic variables such as age [11], sex [12], and education [13] as predictors of CT success. Yet, results on prognostic factors for changes in performance after nonpharmacological trainings so far are highly heterogeneous and in some cases contradictory. A study of Matysiak et al. (2019) investigated for example prognostic factors for changes in performance after a working memory training for healthy older adults. With the help of multilevel analysis they could show that older adults with initially lower working memory capacity (lower scores at study entry in the investigated domain) improved less and reached lower levels of performance [14]. This was explained with an approach called the magnification account, which predicts that cognitively efficient people also show the most gain in nonpharmacological interventions [15]. In contrast to that, a study by Zinke et al. (2014), also investigating predictors of working memory training success, revealed that participants with initially lower baseline performance were related to higher gains after training [16], using stepwise regression analyses for their calculation. Yet, to explain this result, a different explanatory account was used: the compensation account, which states that interventions will yield the largest gain in the least cognitively efficient people [15]. But how is it possible that two studies, which studied a similar topic (predictors of working memory training success) reveal such contradictory results? To answer this question, a systematic review on prognostic factors of memory training success in healthy older adults was conducted that could show that the results vary not only as a function of the type of statistical calculation used to determine prognostic factors, but also of the type of dependent variables used in the calculations [17]: posttest scores, change scores, relative change scores, and residual change scores. Posttest scores are hereby defined as performance after training/intervention, change scores refer to postpre training scores, relative change scores are normreferenced change scores, and residual change scores are defined as change scores adjusted for baseline variance. Moreover, the systematic review could show that different prognostic studies used different independent variables and variations of these as their prediction models: e.g. some studies did include “group” (Experimental vs. Control Group) as a binary predictor in their regression analyses, whereas some studies only calculated predictors within the experimental group. In some regression models, interactions between group variables and possible predictors were assessed, whereas in other studies these interactions were missing in the regression models. In addition to that, some studies calculated regression models that integrated neuropsychological performance at study entry as a possible predictor. A special role of neuropsychological performance at study entry was identified, leading to the two already mentioned explanatory accounts: magnification vs. compensation. However, a current paper of Smolén et al. (2018) could show that most evidence for the compensation account of nonpharmacological training interventions is unreliable due to methodological errors in the original studies [18]. As systematical error related to the choice of the dependent variable in a prognostic model and the special role of neuropsychological performance at study entry can theoretically be translated to all research fields which use multiple regressions to determine prognostic factors for changes after interventions, the present paper wants to establish a framework for the appropriate use of multiple regression analysis in the context of prognostic research, here with a special focus on CT interventions.
Therefore, in the present paper, with the use of simulation methods, we systematically investigate not only which multiple regression model is best suited to answer the question of “who benefits?” by calculating different regression models with different independent variables as possible predictors (Aim 1), but also take a look at the impact of these four different dependent variables in a multiple regression paradigm to determine which of these variables is the most suited one to investigate performance changes after intervention (Aim 2). Furthermore, we investigate the best sample size in relation to the amount of predictors used in these multiple regression model (Aim 3) and evaluate the influence of the reliability of instruments to measure predictors and outcomes (Aim 4). In a final step, we highlight the special role of the pretest score as a predictor in the multiple regression analysis to shed further light on the discussion in context of the magnification and compensation account (Aim 5). We used CT as a specific example to illustrate the simulation process. However, our results can apply to many fields, which employ the simulated and discussed experimental design.
Method
Simulations
We simulated data from a simple model which is often found in experimental designs reported in the literature, for instance of CT (e.g. [19], see Fig. 1). The model consists of a 2 (group: experimental vs. control) × 2 (time: pretreatment vs. posttreatment) design, in which the group represents a betweensubjects factor and the time represents a withinsubjects factor. Additionally, a continuous predictor was included in the design which predicts the success of the treatment in the experimental group (e.g. age which has been identified as a predictor of CT success [11]). We also included a continuous predictor in our simulations which was not related to the success of the treatment (e.g. education [13, 20]).
We simulated the data in two steps. First, we randomly generated data derived from the true model as described below (see Model Specifications). Second, we added noise to these data given that measurements are never exact and measurement instruments always show a measurement error. We assumed that the noise is normally distributed and that the expected value of the noise is zero. These assumptions are based on the Classical Test Theory [21]. The extent of the noise thus depends on the standard deviation (SD) of the noise distribution, which is directly related to the reliability of the measurement instruments. Therefore, for our basic simulations, we determined the noise SD by setting the reliability for all measures to .80, reflecting good reliability [22, 23]. In a further step, we systematically varied the reliability of the measures and generated additional data assuming a reliability of .60 (acceptable reliability), .70 (moderate reliability), and .90 (excellent reliability).
Furthermore, we varied the sample size in our simulations: We ran simulations with a sample size of n = 50, 100, 150, 200, 250, 300, 400 and 500 participants, to investigate the impact of sample size on the detection of a desired effect.
For each sample size, we generated n = 1000 data sets as described above. We provide the simulated data and the R code here: www.osf.io/p54j3
Model specifications
We determined a true model that we used to generate sample data. The model was as follows (see Fig. 1 for a summary): At time 1, i.e. before the treatment, both the experimental group (E1) and the control group (C1) had the same mean and standard deviation on the measure that we simulated (e.g. the score on a cognitive test). We used the norms of the Tscale as the values for the pretreatment condition, i.e. M_{E1/C1} = 50 and SD_{E1/C1} = 10. At time 2, i.e. after the treatment, the mean in the experimental group (E2) was higher than at time 1 with a medium effect size of dz_{E1E2} = 0.50, reflecting a successful treatment. Furthermore, we set the SD_{E2} to 13, i.e. a bit higher than at time 1, reflecting the common finding that the variance is larger in groups that were submitted to a treatment compared to groups that were not given an intervention. The SD of the control group (C2), however, was set to the same value as at time 1, i.e. SD_{C2} = 10. To account for the common observations that a given measure also increases in the control group from time 1 to time 2 despite the lack of treatment, we set the effect size dz_{C1C2} to 0.05, reflecting a negligible increase.
Furthermore, we simulated two predictors (e.g. age (PI) and education (PII) in years, as frequently used as predictors in CT studies). Both predictors (PI and PII) had a mean of 50 and a standard deviation of 10. Importantly, PI was correlated with the increase in the experimental group, r(PI, ΔE1E2) = .30, reflecting a medium effect. However, PI was not correlated with the change from time 1 to time 2 in the control group, r(PI, ΔC1C2) = .00. The second predictor was not related to any change from time 1 to time 2, r(PII, ΔC1C2) = .00 and r(PII, ΔE1E2) = .00. We included this predictor in the simulations to examine whether the statistical models we tested (see Analyses) were able to discriminate between predictors that have a true effect and predictors that do not.
Note that the observed effect sizes (dz and r) also depend on the reliability [23]. In general, the higher the reliability is, the larger the observed effect sizes are given a constant true effect size. To account for this, we kept the true effect size constant. To this extent, we computed the true effect sizes in a scenario with medium effect sizes, i.e. r = .30 and dz = 0.50, and a good reliability, i.e. r_{tt} = .80. These true effect sizes were subsequently used as a basis for the true model for which we generated data as described above and on which we imposed different levels of noise reflecting the respective reliability. Accordingly, the observed effect sizes vary as a function of reliability while the true effect sizes remain constant, as can be assumed in a realworld setting.
Analyses
After generating n = 1000 data sets for each sample size from the true model and imposing noise reflecting the respective reliability for all measures (E1, E2, C1, C2, PI, PII), we ran five different regression analyses on each individual data set (Aim 1, see Table 1). The different regression models differed in terms of the predictors included in the model (Aim 1). In Model 1, the dependent variable was predicted by the external predictors which might be associated with the treatment success, i.e. PI and PII. In Model 2, the score measured at time 1, i.e. the pretest score, was added. Model 2 thus consisted of PI, PII and the pretest score (i.e. E1 and C1) as the predictors of the dependent variable. In Model 3, we additionally added the treatment Group as a binary predictor (dummycoded: 0 = control group, 1 = experimental group). Even though the treatment Group as a binary predictor is fundamental when calculating training success, we did not include it in Models 1 and 2 as not integrating this predictor is commonly observed in recent prediction research. Therefore, we want to show how not integrating the Group variable in the regression can influence the results and lead to misleading interpretations. In Model 4, the dependent variable was predicted by PI, PII, the pretest score, Group and the interaction between PI and Group, and PII and Group. Finally, in Model 5, we removed the pretest score from the model, such that Model 5 contained the predictors PI, PII, Group and the interaction between PI and Group, and PII and Group (see Table 1 for an overview). All continuous predictors (i.e. PI, PII, and pretest score) were centered prior to entering them in the regression model to allow for a better interpretability. In Models 1 to 3, usually the regression coefficients for the predictors PI and PII are interpreted to investigate the prediction of CT success. In Models 4 and 5, the regression coefficients for the interaction term between the Group and the predictors PI and PII are of interest.
In addition to varying the predictors in the regression model, we also varied the dependent variable in order to investigate the consequences of the different measures used in the literature to quantify treatment success (Aim 2). Specifically, we used the following measures as dependent variables: (1) the measure at time 2 (E2, C2), i.e. the posttest score, (2) the absolute change from time 1 to time 2 (E2 minus E1, C2 minus C1), (3) the relative change from time 1 to time 2 (E2 minus E1, divided by E1; C2 minus C1, divided by C1), and (4) the residuals of the posttest score (E2, C2) after controlling for the pretest score (E1, C1). We not only ran the regression analyses for the observed data, but also for the true data. This allowed us to compute a bias by subtracting the true regression coefficients from the observed regression coefficients (see below). Furthermore, we varied the sample size in our simulations: We ran simulations with a sample size of n = 100, 150, 200, 250, 300, 400 and 500 participants, to investigate the impact of sample size on the detection of a desired effect (Aim 3). We also varied the reliability for all measures with reliabilities of .60, .70, .80, and .90, to examine how the results are affected by measurement accuracy (Aim 4). Importantly, we fully crossed the set of predictors and the dependent variables, i.e. we computed each regression model for each dependent variable. This resulted in 20 regression analyses for each of the 1000 individual data sets generated for each of the eight sample sizes and each for the four levels of reliability.
We then aggregated the regression coefficients for each predictor by computing the mean of the coefficients for each set of predictors, each dependent variable, each level of reliability, and each sample size. Furthermore, we computed the standard deviation of these coefficients which is an estimate for the standard error (SE) of the regression coefficient, i.e. the precision with which the regression coefficient was estimated.
To evaluate the success of each model and each dependent variable of detecting a true effect while simultaneously controlling for the alpha error and to also highlight the specific role the performance of participants at study entry as a predictor (Aim 5), we proceeded as follows: for each set of predictors, each dependent variable, each level of reliability, and each sample size, we counted the number of times a given predictor yielded a significant relationship with the dependent variable (i.e. p < .05) and divided it by the total number of analyses (i.e. 1000). The resulting value P thus represents the proportion of significant effects of the given predictor in all analyses. If there is no true relationship between the given predictor and the dependent variable, P indicates the alpha error, i.e. the probability of finding an effect even though no true effect exists. If, however, there is a true relationship between the given predictor and the dependent variable, P indicates the power, i.e. the probability of detecting an effect when the true effect exists.
Furthermore, we computed the bias, i.e. the difference between the true regression coefficient and the observed regression coefficient. To compare the bias across different regression coefficients and different models with different units of the dependent variable (raw units for the postscore, the residuals and the absolute change; relative scores for the relative change), we studentized them. The unit of the studentized biases is “standard deviations”. To studentize a variable, its values are divided by its standard deviation. However, in our simulations, the standard deviation of the regression coefficient estimates is in fact the standard error of the estimates. Dividing by this SE would results in a larger studentized bias for large sample sizes given the smaller SE for large sample sizes. Accordingly, in our case, the regression coefficient estimates need to be divided by the product of their SD (i.e. their SE) and the square root of the sample size. This product is the actual SD of the estimates. In total, we ran 1,280,000 regression analyses (five models of four dependent variables and eight sample sizes, four reliability levels in 1000 datasets, for each data the true and the observed data).
Results
Aim 1: the choice of an adequate multiple regression model including all relevant predictors
The choice of the adequate regression model, i.e. the answer to the question which predictors should be included in the model, can be derived theoretically. First, it is obvious that the external predictor PI needs to be included in the regression model since its prognostic performance is to be evaluated. Second, we need to account for the treatment that is applied to the experimental group, but not the control group. To this extend, we also need to include the binary predictor Group in the regression model.
Importantly, however, the external predictor PI can only predict the outcome in the experimental group, but not in the control group, given that the control group is unaffected by the treatment and no systematic variations in the outcome variable (Aim 2) should be observed in this group. This relationship has to be modelled explicitly which is achieved by including the interaction of PI and Group PI × Group in the regression model. If this interaction term is not included in the regression model, a true relationship between PI and the outcome variable might be overseen because it only exists in the experimental group but not in the control group. Jointly, this might lead to an insignificant main effect of PI. Note, for instance, that the power to detect a significant effect of PI in the Models 1 to 3 is much lower than the power to detect a significant effect of the PI × Group interaction in the Models 4 and 5 (Table 2). Alternatively, a significant main effect of PI in a regression model which does not include the interaction term PI × Group cannot be interpreted as the ability of PI to predict the intervention success because such an intervention success can only be observed in the experimental group. In this case, the significant main effect might just reflect a general relationship between PI and the outcome variable (depending on which criterion is used, see Aim 2) which does not reflect the ability of PI to predict the intervention success. To examine this, it is crucial to include the interaction term PI × Group in the regression model.
Finally, we recommend also including the pretest scores as a predictor in the regression model. This controls for differences in the variable of interest that were present prior to the intervention, similar to a covariate in an analysis of covariance. Our simulations show that models including the pretest score as a predictor yield a better power to unveil a significant PI main effect or PI × Group interaction effect than models that do not include the pretest score as a predictor. For example, Table 2 shows that for a reliability of r_{tt} = .80 and a sample size of n = 200, Model 5 without the pretest score as a predictor yields a power of 0.63 to detect a significant PI × Group interaction effect for the absolute change. Model 4, which does include the pretest score as a predictor, yields a much higher power of 0.71. A similar pattern is found for the other criteria. An exception to this observation are models that use the residual score as the criterion because the residual scores are defined as the posttest score after controlling for the pretest score. Consequently, the pretest score can never significantly predict the residual test score and including or excluding the pretest score in the model does not impact the regression coefficients of the other predictors. In the section on Aim 5, we discuss the special role of the pretest scores as a predictor in the regression models in more detail.
Apart from the power to detect the desired effect, the interpretation of the regression coefficients also varies between the different regression models. In the Models 1 to 3, the coefficients of the continuous predictors indicate how the outcome variable changes when the corresponding predictor increases by one unit. For example, in Model 2 for the absolute change score as the criterion, an increase of one unit of PI would lead to an increase of 0.15 units in the absolute change (Table 2). Additionally, in Model 3, the coefficient for the binary Group variable indicates the mean difference in the outcome variable between the experimental group and the control group. Since the Group variable was dummycoded (0 = control group, 1 = experimental group), the coefficient informs about the deviation of the experimental group from the control group in terms of the outcome variable. The intercept indicates the predicted mean of the outcome variable for mean values of all continuous predictors (Models 1 to 3) and in the control group (only for Model 3). This explains why the intercept is lower in Models 1 and 2 than in Model 3. In the first two models, the Group variable is not accounted for, thus the intercept represents the overall mean in the sample. In Model 3, the Group variable is taken into account. Since the control group was modelled to have lower values on the outcome variable than the experimental group, the intercept is lower compared to the other two models (see also Fig. 2).
The interpretation is slightly different for the Models 4 and 5 which include interaction terms. Specifically, the interpretation for the continuous predictors is limited to the control group, i.e. the regression coefficients for PI, PII (and the pretest score) indicate the change in the outcome variable for the control group if the predictors increase by one unit. Ideally, these should be zero (except for the pretest score) because the predictor PI is expected to predict the intervention success and there was no intervention in the control group. The regression coefficients for the interaction terms indicate how much more (or less) the outcome variable changes in the experimental group compared to the control group when the continuous predictor increases by one unit. Take Model 4 for the absolute change score as the criterion for example: If PI increases by one unit, the absolute change score does not change at all in the control group (Group = 0) because the regression coefficient for PI is 0.00. In the experimental group (Group = 1), the absolute change score would change by 0.00 + 0.30 = 0.30 units, i.e. the sum of the regression coefficient for PI and regression coefficient for the PI by Group interaction. Finally, the regression coefficient for the binary group variable indicates the mean difference in the outcome variable between the experimental and control group for mean values on the continuous predictors, i.e. if the predictors are zero.
Fig. 2 illustrates how the interpretation of regression coefficients changes depending on whether the regression model only comprises a continuous predictor (Example 1; Models 1 and 2), a continuous predictor and the binary group variable (Example 2; Model 3), or a continuous predictor, the binary group variable and their interaction (Example 3; Models 4 and 5). In Example 1, there is only one regression line for the entire sample, ignoring the assignment to the experimental or control group and thus weakening the power to detect the effect. In Example 2, there are two regression lines – one for the experimental group and one for the control group – that are parallel to each other and have the same slope as the regression line as in Example 1 (but different intercepts) which also weakens the power to detect the effect. Finally, in Example 3, the slopes of the regression lines differ between the experimental and the control group. Ideally, the slope of the control group is zero, indicating that the predictor cannot predict the intervention success in the control group (because there was no intervention). The slope of the experimental group should be larger in Example 3 than in Example 2, because the impact of the predictor in the experimental group can now be isolated from the impact in the control group which is why the power to detect the effect is overall higher than in the other two examples.
To conclude, we strongly favor a regression model with the following predictors: PI, Group, pretest score, and PI × Group. For an overview of all calculated models for the different dependent variables, reliability scores, and sample sizes see Supplementary Material Tables 1–32.
Aim 2: the choice of an adequate criterion variable for the regression model
As a recent systematic review on prognostic factors of performance changes after memory training in healthy older adults could show, the type of dependent variables used for prognostic factor calculations differs across different studies [17]. Posttest scores, change scores, residual scores, and relative change scores were used to measure performance changes. Yet, all these types of dependent variables have different implications as regards content and interpretation.
In a classical prepost design, which underlies most studies on CT, the posttest score seems to be an established dependent variable in multiple regression analyses measuring training success. However, using the posttest score (that is performance after training/intervention) answers the question “Is x a likely cause of y” [24], but does not refer to gain. Furthermore, imagine an external predictor such as PI emerged as a significant predictor of the posttest score in the experimental group. Would that indicate that the external predictor can predict the intervention success? Not necessarily, because the predictor might just be related to the construct captured by the postscore. In this case, one would also find that PI is similarly related to the pretest score in the experimental group. Furthermore, an external predictor such as PI could be related to the posttest score in both the experimental group and the control group. Thus, finding a significant effect of PI on the posttest score in the experimental group is necessary, but insufficient to draw the conclusion that PI can predict the intervention success.
Absolute change scores (postpre performance) answer the question “whose score is most likely to increase/decrease over time?”, therefore directly referring to intervention gain [24]. Yet, change scores are under high criticism due to the fact that subtracting pretest scores from posttest scores are in discredit to lead to fallacious conclusions, because they are systematically related to random measurement errors [25] and are sensitive to regression to the mean. However, these criticisms are unfounded under a plausible regression model, which does not integrate the dependent variable as an independent variable [26]. Also, with the advent of structural equation modeling, which permits modeling of errorfree constructs, much of the criticism on change scores in the literature has decreased further [27]. Change scores are easy to interpret (changes in the individual’s level of performance [28]), may help to remove unexplained variance, and change score models are appropriate whenever pretest scores can be assumed to remain stable over time if no treatment occurs, that is, when pretest scores are useful baseline measures [29].
A further type of dependent variable, which may be used in studies investigating intervention success, are relative change scores. Relative change scores are normreferenced, which are inherent in traditional reliability or generalizability coefficients [28]. They can be interpreted in terms of how much progress an individual in comparison to others has made. Therefore, the focus is not on changes in the individual’s performance, but on comparisons to others. Yet, our simulations demonstrated that the relative change scores are more vulnerable to the methodological artifact (described by 18) than absolute change scores. The probability of detecting a significant negative regression coefficient for the pretest score was consistently higher for relative change scores than for absolute change scores, regardless of sample size, regression model used, or level of reliability. Keep in mind that we did not model a relationship between the pretest score and the intervention success when simulating the data. The indication of a significant negative regression coefficient is thus an alphaerror. Similarly, the power of detecting a significant PI × Group interaction effect was consistently higher for absolute change scores than for relative change scores, regardless of sample size, regression model used, or level of reliability. Consequently, our simulations have shown that relative change scores are inferior to absolute change scores as criteria in regression models.
Residual scores, which are calculated by regressing dependent variable of a construct onto an assessment measured at baseline, provide a simple change score adjusted for baseline variance [30] and are in literature often referred to as a more appropriate method of measuring change in constructs over time than postpre change scores [31]. Yet, residual score models ask slightly different questions than the change score models: Residual score models assume that posttest scores are a linear function of pretest scores and that this function is not necessarily 1 [29].
Our simulations showed that when including the pretest score as a predictor in the regression model, the regression coefficients for the other predictors are identical for posttest scores, absolute change scores and residual scores. In other words, as long as the pretest score is a predictor in the regression model, it does not matter whether posttest scores, absolute change scores or residual scores serve as the criterion because they yield the same regression coefficients for the other predictors in the model (for a more thorough discussion of this phenomenon, see Aim 5).
Aim 3: the choice of an adequate sample size
We ran simulations with a sample size of n = 50, 100, 150, 200, 250, 300, 400 and 500 participants to investigate the impact of sample size on the detection of a desired effect of PI or PI x Group (if the interaction term was included in the regression model). The results for each dependent variable, each regression model and each level of reliability are displayed in Fig. 3 (for PII and PII x Group, see Supplementary Material Figure S1). Obviously, due to the fact that sample size and power are dependent on each other, as the sample size increases, the power increases, regardless of which dependent variable is used in the regression model. Further, as an overall trend it can be stated that the power is also dependent on the reliability; as the reliability increases, a smaller sample size is needed to achieve the same level of power.
As depicted in Fig. 3, not integrating the pretest score in our regression model leads to the need of a higher sample size to achieve the same power as regression models which integrate the pretest score in the calculation. This is the case for all dependent variables except one: when using the residual score as a dependent variable, there is nearly no difference in power/sample size increase between regression models that in or exclude the pretest score, as the pretest score is already included in the dependent variable as a defining character of the residual score.
Overall, Fig. 3 shows that, regardless which dependent variable and which predictors (of the ones investigated here) are used in the calculation, it is important to at least use a sample size of n = 250 to n = 300 such that a power of at least .50 (independent of the reliability) is achieved. Due to the fact that often in experimental designs and/or research on new clinical patient groups the reliability of the used measures is either not known or not well established, a sample size of n = 250 therefore ensures an at least moderate power for the worst case that your dependent measure is not as reliable as you wish it would be.^{Footnote 1} Yet, when using the change score as the dependent variable in the calculation and the reliability is rather low (.60/.70), a sample size of n = 300 seems even more appropriate to achieve a good power. It is important to always calculate and report reliabilities of the used instruments to ensure good scientific practice and help other researchers to better understand and evaluate your results.
Aim 4: the role of reliability of the measurement instruments
The simulations show that, in order to achieve adequate power to detect a true effect, a relatively large sample size is required which is often difficult to achieve in scientific practice. However, the simulations also illustrate that an adequate power can not only be achieved by increasing the sample size, but also by selecting more reliable measures. While increasing the sample size mostly decreases the standard error which in turn leads to an increased power, increasing reliability also increases the estimates of the regression coefficient, i.e. the estimate and its entire confidence interval is shifted away from zero, making it more likely that a true effect is detected (see Fig. 4 for the regression coefficients of PI or PI x Group as a function of dependent variable, regression model, sample size and reliability, and Supplementary Material Figure S2 for PII or PII x Group).
Although at first sight, this observation might cause confusion, it can easily be explained by the fact that imperfectly reliable measures limit the maximum correlation that can be observed [32]. For example, assuming a true correlation of r = .50 between two variables that were measured with a reliability of r_{tt} = .60, the observed correlation will amount to r = .30, i.e. the true correlation multiplied by the square root of the product of both reliabilities [32]. Increasing the reliability to r_{tt} = .90, the observed correlation will amount to r = .45, approximating the true correlation of r = .50.
Employing more reliable measurements in research thus not only increases the probability of detecting a true effect, but also reduces the bias, because true effects are estimated more precisely (see also Fig. 5 and Supplementary Material Figure S3). Note that reliability can not only be increased by employing more reliable measures, but also by repeating measures or by assessing a construct of interest by multiple tests instead of only one test [33, 34]. In other words, if a researcher wishes to increase the power of their study, but it is hardly possible to increase the sample size, they could increase the number of measures/measurements instead.
Aim 5: the special role of the pretest score as a predictor in a multiple regression
Studying Table 2 (or Tables 1–32 in the Supplementary Material), a striking observation is that whenever the pretest score is included in a regression model, the regression coefficients for the other predictors yield the exact same results independent of the criterion, apart from the relative change because relative change is measured on another scale than the other three criteria. This suggests that whenever the pretest score is a predictor in the model, the choice of the criterion (among posttest score, residual score, and absolute change score) is redundant.
Furthermore, the regression coefficient of the pretest score for the posttest score and the absolute change score are a linear transformation of each other: the coefficient for the posttest score equals the coefficient of the absolute change score plus one. Note that although we did not model a negative relationship between the pretest score and the absolute change, a negative regression coefficient emerges consistently and even reaches a high probability of reaching statistical significance for larger sample sizes, giving way to the faulty interpretation in favor of a compensation effect.
In the following, we briefly explain both observations mathematically. The regression equation for a model with the posttest score (T_{2}) as the criterion and the centered pretest score (T_{1}) and any other variable V as predictors can be written as follows:
with b_{0} indicating the intercept, b_{1} the regression coefficient for the pretest score, and b_{2} the regression coefficient for the additional predictor. Analogously, the regression equation for a model with the absolute change score (\( {T}_2\left({T}_1+\overline{T_{1_{nc}}}\right) \)) as the criterion and the pretest score and another variable as predictors can be written as follows:
with c_{0} indicating the intercept, c_{1} the regression coefficient for the pretest score, and c_{2} the regression coefficient for the additional variable. Note that the absolute change score is computed by subtracting the noncentered pretest score (\( {T}_1+\overline{T_{1_{nc}}} \)) from the noncentered posttest score T_{2} and that the noncentered pretest score consists of the centered pretest score T_{1} plus the mean of the noncentered pretest score (\( \overline{T_{1\_ nc}} \); “nc” for “noncentered”). Resolving Eq. (2) for T2 and combining Eqs. (1) and (2) results in
which equals
For this equation to be true for all values of T_{1} and V, the terms (c_{1} – b_{1} + 1) and (c_{2} – b_{2}) each have to equate to zero (assuming the absence of multicollinearity, a formal prerequisite for a multiple regression analysis), giving
and
This also implies that the term \( {c}_0{b}_0+\overline{T_{1\_ nc}} \) also has to equate to zero, giving
First, these mathematical equations show that when the pretest score is included as a predictor in the regression model, the regression coefficients for the other predictors are identical for the posttest score and the absolute change score as criteria (assuming that formal prerequisites for multiple regression analyses are met).
Second, the intercepts can be linearly transformed into each other. The intercept for the posttest score as the criterion equals the intercept for the absolute change score as the criterion plus the mean of the noncentered pretest score. In case the continuous predictors are not centered prior to entering them into the regression models, the intercepts will be identical for the posttest score and the absolute change score as criteria.
Third and most importantly, the regression coefficients of the pretest score for both criteria are a linear transformation of each other. Considering that the coefficient b_{1} reflects the relationship between the pretest score and the posttest score, it can be interpreted as an estimate of the (testretest) reliability. The coefficient c_{1} reflecting the relationship between the pretest score and the absolute change score is thus always negative, because the reliability can never exceed 1 and because we have shown that c_{1} = b_{1}–1. Furthermore, this relationship paradoxically implies that the relationship between the pretest score and the change score is larger when the reliability of the measure is lower.
Smoleń et al. (2018) notes that many of the correlations between pretest scores and absolute change scores reported in the literature to support the compensation account are suspiciously high, especially considering the theoretical limit of observable correlations given the imperfect reliability of psychological measures [18]. Here, we have demonstrated that these high correlations might in fact reflect low reliabilities of the measures used in the respective studies, which is in line with Smoleń’s mathematical demonstrations of why negative correlations between pretest scores and absolute change scores emerge naturally [18].
Discussion
As prognostic research and especially studies on the impact of parameters predicting the success of CT (or in general pharmacological and nonpharmacological interventions) have become of huge scientific interest over the past few years, the present paper aimed at systematically showing and discussing different types of regression models and dependent variables used, as well as the influence of reliability of measures, sample sizes, and the specific role of baseline measurements (pretest scores) as predictors in multiple regressions to account for changes after interventions. With the help of simulation methods and mathematical derivations we could show that (Aim 1) a regression model including PI, Group, pretest score, and PI × Group as predictors seems most convenient when investigating predictors of changes after interventions such as CT, as well as (Aim 2) using the absolute change scores as the dependent variable. Further, (Aim 3) studies should use at least a sample size of n = 250 and (Aim 4) one should take care of the reliability of used measures and their impact on the calculations. Finally, (Aim 5), although the pretest score should be included as a predictor in the regression model for reasons of statistical power, its coefficient should not be interpreted because chances are high that even if there is no true relationship, a negative and statistically significant regression coefficient emerges.
In clinical research, especially when investigating specific patient populations, it is often difficult to recruit large sample sizes. For some patient populations or areas, a sample size of n = 250 is even utterly unrealistic. Yet, one has to be aware of the fact that when conducting multiple regression analyses to detect possible predictors of interventions in a relatively small sample, the power of the analysis is lacking. Therefore, it is even more important to ensure a high reliability of the used clinical tests and paradigms tested. This implies that already established tests have to be validated regarding their reliability norms when used in “new” clinical populations, in case that no test norms are available for this population. Further, reliability scores of the used tests should always be reported as they may help to inform whether the regression coefficient for the pretest score is purely a statistical artefact or might reflect a relationship that persists beyond the statistical artefact. In the context of cumulative research evidence, it is also of high importance to report and publish studies with small sample sizes that only or mostly show nonsignificant prognostic effects. These studies can also contribute to cumulative research findings (e.g. in metaanalysis). This cumulative gain of knowledge is further facilitated if a joint methodological approach such as the one we suggest here is used, as this makes statistical results more comparable across separate studies.
Our simulations (and subsequent mathematical proof) also showed that unless the measures are perfectly reliable, there will always be a negative regression coefficient for the pretest score predicting the absolute change score, even when there is no true relationship between them. In fact, the regression coefficient is the more negative, the less reliable the measures are. Thus, the negative regression coefficient should never be interpreted in favour of the compensation hypothesis. Our results support the concerns raised by Smoleń et al. (2018) regarding the validity of the evidence reported in the literature in favour of the compensation hypothesis.
In medical research, guidelines for prognostic research exist [35], which focus in detail on the design, conduction, and reporting of prognostic factor research, hereby differentiating between prognostic factor studies (a single prognostic factor that aims to predict a future outcome) and prognostic model studies (defined as a set of multiple prognostic factors to predict a future outcome). Yet, until now, there was no clear recommendation on the specific statistical methods which should be used when calculating multiple regressions to investigate these predictors in the realm of CT. Our present paper also emphasizes the need for the choice of the adequate dependent variable for prognostic research on different continuous outcomes after specific interventions and gives recommendations regarding the choice of the adequate regression model that should be used, as well as adequate sample size, reliability of outcome measures, and integration of baseline measurements. Therefore, when conducting prognostic research, a clear statistical rational should be provided. Furthermore, the present recommendations as well as the already existing medical guidelines on prognostic research should be adapted also for studies conducted in other fields (e.g. neuropsychology) to ensure a good practice and reporting of prognostic studies and results.
Limitations
We are aware that the results of simulations strongly depend on the input to the simulations. In our case, we explicitly modelled an effect of the external predictor on the absolute change score in the experimental group. This decision was based on profound theoretical considerations. While it may not be surprising that the result of simulations favoured the inclusion of the interaction between PI and the Group, and the absolute change score as the criterion, the simulations demonstrated the consequences of applying a range of statistical models (different combinations of predictors and criteria) to data that were generated by a different true model. Furthermore, we hope to have conveyed why we believe the true model we chose was the most reasonable of the models we considered in our simulations.
Conclusion and recommendations
We systematically investigated the impact of different regression models, dependent variables, sample sizes and levels of reliability on the conclusions drawn from the respective analyses. Extensive simulations allowed us to derive wellconsidered recommendations for future analysis of data in one of the most common experimental designs in research on CT and prediction of CT success. Furthermore, we mathematically showed that the choice of dependent variable is redundant if the pretest score is a predictor in the regression model, but that the corresponding regression coefficient should not be interpreted, preventing unjustified conclusions.
For future prognostic studies on predictors of changes after an intervention, we thus recommend the following analysis pipeline: Prior to data collection, determine the required sample size by considering the effect sizes you expect (e.g. based on previous findings) and the reliability of the measures you employ. Compute the absolute change scores and enter them as the criterion in a regression model. Include the pretest scores, the group variable, the external predictor variables which you want to investigate, and the interactions between the external predictor variables and the group variable as predictors in the regression model. If you find a significant interaction effect, perform a posthoc analysis. If the external predictor variable is able to predict the intervention success, it should only be related to the outcome variable in the experimental group, but not in the control group. Do not interpret the regression coefficient of the pretest score, since it will always be negative (if your pretest and posttest scores correlate positively). Keep in mind that less reliable pre and posttest scores will produce a larger (negative) regression coefficient, regardless of whether there is a true pretest score effect on the change score or not. Apart from reporting the sample size, also report the reliability of the employed measures as it has a considerable impact on the probability of detecting a true effect and should thus be made accessible to your readers.
Availability of data and materials
The datasets generated and analysed during the current study are available in the Open Science Framework (OSF) repository: www.osf.io/p54j3
Notes
Note that as a good scientific practice of course it is important to ensure that all used tests and dependent measures have a moderate to high reliability established for the participant or patient group you investigate. See also “Aim 4: the role of reliability of the measurement instruments”.
Abbreviations
 CT:

Cognitive Training
 SD:

Standard Deviation
 SE:

Standard Error
References
Martin M, Clare L, Altgassen AM, Cameron MH, Zehnder F. Cognitionbased interventions for healthy older people and people with mild cognitive impairment. Cochrane Database Syst Rev. 2011:CD006220. https://doi.org/10.1002/14651858.CD006220.pub2.
Clare L, Woods RT, Moniz Cook ED, Orrell M, Spector A. Cognitive rehabilitation and cognitive training for earlystage Alzheimer's disease and vascular dementia. Cochrane Database Syst Rev. 2003:CD003260. https://doi.org/10.1002/14651858.CD003260.
Bamidis PD, Vivas AB, Styliadis C, Frantzidis C, Klados M, Schlee W, et al. A review of physical and cognitive interventions in aging. Neurosci Biobehav Rev. 2014;44:206–20. https://doi.org/10.1016/j.neubiorev.2014.03.019.
Kallio EL, Öhman H, Kautiainen H, Hietanen M, Pitkälä K. Cognitive training interventions for patients with Alzheimer's disease: a systematic review. J Alzheimers Dis. 2017;56:1349–72. https://doi.org/10.3233/JAD160810.
Leung IHK, Walton CC, Hallock H, Lewis SJG, Valenzuela M, Lampit A. Cognitive training in Parkinson disease: a systematic review and metaanalysis. Neurology. 2015;85:1843–51. https://doi.org/10.1212/WNL.0000000000002145.
Bherer L. Cognitive plasticity in older adults: effects of cognitive training and physical exercise. Ann N Y Acad Sci. 2015;1337:1–6. https://doi.org/10.1111/nyas.12682.
Altman DG, Lyman GH. Methodological challenges in the evaluation of prognostic factors in breast cancer. Breast Cancer Res Treat. 1998;52:289–303. https://doi.org/10.1023/A:1006193704132.
Lipkovich I, Dmitrienko A, B R. Tutorial in biostatistics: datadriven subgroup identification and analysis in clinical trials. Stat Med. 2017;36:136–96. https://doi.org/10.1002/sim.7064.
Sandberg P, Rönnlund M, DerwingerHallberg A, Stigsdotter NA. Memory plasticity in older adults: cognitive predictors of training response and maintenance following learning of numberconsonant mnemonic. Neuropsychol Rehabil. 2016;26:742–60. https://doi.org/10.1080/09602011.2015.1046459.
Langbaum JBS, Rebok GW, BandeenRoche K, Carlson MC. Predicting memory training response patterns: results from ACTIVE. J Gerontol B Psychol Sci Soc Sci. 2009;64:14–23. https://doi.org/10.1093/geronb/gbn026.
O’Hara R, Brooks JO, Friedman L, Schröder CM, Morgan KS, Kraemer HC. Longterm effects of mnemonic training in communitydwelling older adults. J Psychiatr Res. 2007;41:585–90. https://doi.org/10.1016/j.jpsychires.2006.04.010.
Mohs RC, Ashman TA, Jantzen K, Albert M, Brandt J, Gordon B, et al. A study of the efficacy of a comprehensive memory enhancement program in healthy elderly persons. Psychiatry Res. 1998;77:183–95. https://doi.org/10.1016/S01651781(98)000031.
Neely AS, Bäckman L. Effects of multifactorial memory training in old age: generalizability across tasks and individuals. J Gerontol B Psychol Sci Soc Sci. 1995;50:P134–40. https://doi.org/10.1093/geronb/50b.3.p134.
Matysiak O, Kroemeke A, Brzezicka A. Working memory capacity as a predictor of cognitive training efficacy in the elderly population. Front Aging Neurosci. 2019;11:126. https://doi.org/10.3389/fnagi.2019.00126.
Lövdén M, Brehmer Y, Li SC, Lindenberger U. Traininginduced compensation versus magnification of individual differences in memory performance. Front Hum Neurosci. 2012;6:141. https://doi.org/10.3389/fnhum.2012.00141.
Zinke K, Zeintl M, Rose NS, Putzmann J, Pydde A, Kliegel M. Working memory training and transfer in older adults: effects of age, baseline performance, and training gains. Dev Psychol. 2014;50:304–15. https://doi.org/10.1037/a0032982.
Roheger M, Folkerts AK, Krohm F, Skoetz N, Kalbe E. Prognostic factors for change in memory test performance after memory training in healthy older adults: a systematic review and outline of statistical challenges. Diagn Progn Res. 2020;4:7. https://doi.org/10.1186/s4151202000718.
Smoleń T, Jastrzebski J, Estrada E, Chuderski A. Most evidence for the compensation account of cognitive training is unreliable. Mem Cogn. 2018;46:1315–30. https://doi.org/10.3758/s134210180839z.
Rebok GW, Ball K, Guey LT, Jones RN, Kim HY, King JW, et al. Tenyear effects of the advanced cognitive training for independent and vital elderly cognitive training trial on cognition and everyday functioning in older adults. J Am Geriatr Soc. 2014;62:16–24. https://doi.org/10.1111/jgs.12607.
Roheger M, Meyer J, Kessler J, Kalbe E. Predicting short and longterm cognitive training success in healthy older adults: who benefits? Neuropsychol Dev Cogn B Aging Neuropsychol Cogn. 2020;27:351–69. https://doi.org/10.1080/13825585.2019.1617396.
Novick MR. The axioms and principal results of classical test theory. J Math Psychol. 1966;3:1–18. https://doi.org/10.1016/00222496(66)900022.
Hedge C, Powell G, Sumner P. The reliability paradox: why robust cognitive tasks do not produce reliable individual differences. Behav Res Methods. 2018;50:1166–86. https://doi.org/10.3758/s1342801709351.
Nunnally JC Jr. Introduction to psychological measurement; 1970.
Lord FM. A paradox in the interpretation of group comparisons. Psychol Bull. 1967;68:304–5. https://doi.org/10.1037/h0025105.
Cronbach LJ, Furby L. How we should measure "change": or should we? Psychol Bull. 1970;74:68–80. https://doi.org/10.1037/h0029382.
Allison PD. Change scores as dependent variables in regression analysis. Sociol Methodol. 1990;20:93. https://doi.org/10.2307/271083.
CastroSchilo L, Grimm KJ. Using residualized change versus difference scores for longitudinal research. J Soc Pers Relat. 2018. https://doi.org/10.1177/0265407517718387.
Miller TB, Kane M. The precision of change scores under absolute and relative interpretations. Appl Meas Educ. 2001;14:307–27. https://doi.org/10.1207/S15324818AME1404_1.
Gollwitzer M, Christ O, Lemmer G. Individual differences make a difference: on the use and the psychometric properties of difference scores in social psychology. Eur J Soc Psychol. 2014;44:673–82. https://doi.org/10.1002/ejsp.2042.
Prochaska JJ, Velicer WF, Nigg CR, Prochaska JO. Methods of quantifying change in multiple risk factor interventions. Prev Med. 2008;46:260–5. https://doi.org/10.1016/j.ypmed.2007.07.035.
Rowan AA, McDermott MS, Allen MS. Intention stability assessed using residual change scores moderates the intentionbehaviour association: a prospective cohort study. Psychol Health Med. 2017;22:1256–61. https://doi.org/10.1080/13548506.2017.1327666.
Spearman C. The proof and measurement of association between two things. Am J Psychol. 1904;15:72. https://doi.org/10.2307/1412159.
Brown W. SOME EXPERIMENTAL RESULTS IN THE CORRELATION OF MENTAL ABILITIES1. Br J Psychol, 1904–1920. 1910;3:296–322. https://doi.org/10.1111/j.20448295.1910.tb00207.x.
Spearman C. CORRELATION CALCULATED FROM FAULTY DATA. Br J Psychol, 19041920. 1910;3:271–95. https://doi.org/10.1111/j.20448295.1910.tb00206.x.
Riley RD, Hayden JA, Steyerberg EW, Moons KGM, Abrams K, Kyzas PA, et al. Prognosis research strategy (PROGRESS) 2: prognostic factor research. PLoS Med. 2013;10:e1001380. https://doi.org/10.1371/journal.pmed.1001380.
Acknowledgements
Not applicable.
Funding
No funding was received for designing the study or collecting, analyzing or interpreting the data or writing the manuscript. Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
AM designed the simulations, generated the data sets, analysed the data sets, interpreted the results, and wrote the manuscript. MR conceptualized the research idea, interpreted the results, and wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Mattes, A., Roheger, M. Nothing wrong about change: the adequate choice of the dependent variable and design in prediction of cognitive training success. BMC Med Res Methodol 20, 296 (2020). https://doi.org/10.1186/s12874020011768
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874020011768