 Research
 Open Access
 Published:
The impact of moderator by confounder interactions in the assessment of treatment effect modification: a simulation study
BMC Medical Research Methodology volume 22, Article number: 88 (2022)
Abstract
Background
When performed in an observational setting, treatment effect modification analyses should account for all confounding, where possible. Often, such studies only consider confounding between the exposure and outcome. However, there is scope for misspecification of the confounding adjustment when estimating moderation as the effects of the confounders may themselves be influenced by the moderator. The aim of this study was to investigate bias in estimates of treatment effect modification resulting from failure to account for an interaction between a binary moderator and a confounder on either treatment receipt or the outcome, and to assess the performance of different approaches to account for such interactions.
Methods
The theory behind the reason for bias and factors that impact the magnitude of bias is explained. Monte Carlo simulations were used to assess the performance of different propensity scores adjustment methods and regression adjustment where the adjustment 1) did not account for any moderatorconfounder interactions, 2) included moderatorconfounder interactions, and 3) was estimated separately in each moderator subgroup. A realworld observational dataset was used to demonstrate this issue.
Results
Regression adjustment and propensity score covariate adjustment were sensitive to the presence of moderatorconfounder interactions on outcome, whilst propensity score weighting and matching were more sensitive to the presence of moderatorconfounder interactions on treatment receipt. Including the relevant moderatorconfounder interactions in the propensity score (for methods using this) or the outcome model (for regression adjustment) rectified this for all methods except propensity score covariate adjustment. For the latter, subgroupspecific propensity scores were required. Analysis of the realworld dataset showed that accounting for a moderatorconfounder interaction can change the estimate of effect modification.
Conclusions
When estimating treatment effect modification whilst adjusting for confounders, moderatorconfounder interactions on outcome or treatment receipt should be accounted for.
Introduction
Treatment effect modification (TEM) occurs when the effect of treatment on an outcome is influenced by a third variable, termed a moderator. Such an analysis can identify patients who are more likely to benefit or be harmed from treatment. In some cases, a moderator may have a strong scientificrationale and their investigation is prespecified. For example, in their randomised controlled trial protocol, Kyle et al. hypothesised that age may moderate the effect of cognitive behavioural therapy for insomnia on cognitive functioning outcome [1]. In other cases, researchers may investigate several potential moderators in an exploratory manner at the analysis stage. TEM is typically evaluated by including a product term (statistical interaction) between treatment and the moderator in a regression model applied to the full cohort of patients in the study.
Randomised clinical trials provide the best evidence regarding the causal effects of treatments, and thus also the best evidence for the existence of TEM, although if the causal effect of a moderator is of interest, treatment randomisation does not ensure unbiased estimation of this [2]. Observational studies however are more feasible for many research questions, for example when investigating rare treatment sideeffects or assessing realworld effectiveness. Observational studies require appropriate adjustment of confounders in order for valid inference on the causal effect of treatments to be made [3]. Confounders are often accounted for via regression adjustment in a multivariable regression model or, increasingly, by the use of propensity scores [4].
Observational studies attempt to adjust for confounding of the treatment–outcome relationship but often do not consider any additional features required to unbiasedly evaluate TEM [5]. Here, we focus on the situation where the moderator not only influences the relationship between the treatment and outcome, but also the relationship between a confounder and either treatment receipt or the outcome.
Figure 1 illustrates the concept where path A represents the moderator influencing the effect of a confounder on treatment receipt and path B represents the moderator influencing the effect of a confounder on the outcome. If the moderator influences the effect of the confounder on treatment receipt, this implies that the way in which the confounder influences the decision to prescribe treatment varies across the moderator, e.g. obesity (X) may discourage clinicians from prescribing a specific treatment (T) in women more than in men (M). If the moderator influences the effect of the confounder on outcome, this implies that the relationship between the confounder and the outcome varies across the moderator, e.g. obesity (X) may increase the risk of heart disease (Y) by a larger amount in men than in women (M). In many cases, the moderator will itself be a confounder – although not necessarily. For example, a treatment may be more effective in reducing cardiovascular disease in older people than in younger people, (i.e. age moderates the effect of treatment), but age would only also be a confounder if it were associated with both receipt of treatment and the outcome.
If the differential effect of the confounder across the moderator is not accounted for, this may introduce bias into the estimate of treatment effect modification. Suppose M is binary and the effect of the confounder is greater in one subgroup of M than the other. Accounting for the overall effect of the confounder will lead to an underestimation of the effect of the confounder in one subgroup and an overestimation of the effect of the confounder in the other subgroup, which will lead to biased estimates of the subgroup treatment effects, and hence treatment effect modification. The magnitude of the bias will depend on the prevalence of the moderator, the relative sizes of the moderator and confounder effects, and the number of confounders which are influenced by the moderator. A more detailed explanation is given in the Supplementary material.
Failure to account for an interaction that exists between a moderator and confounder is essentially a misspecification problem. Both propensity score methods and regression adjustment are sensitive to model misspecification. Some research has indicated regression adjustment (without PS methods) is more sensitive to model specification than PS methods [6]. Furthermore, there is variation amongst PS methods: Greifer and Stuart say that matching methods are less sensitive to misspecification than weighting methods, as the former does not directly rely on the exact propensity score [7]. In practice, whether or not PS methods or regression adjustment will perform better under the corresponding model misspecification will likely vary by study, depending on how complex or well understood the treatment model or outcome model is and how much information is available to model each.
Consideration of differences in treatment assignment across subgroups of a moderator when applying propensity score (PS) methods to estimate subgroupspecific effects has been discussed previously [8,9,10,11, 12]. Radice et al. [9] and Krief et al. [8] showed that using subgroupspecific propensity scores for PS matching and inverse probability of treatment weighting (IPTW) resulted in better covariate balance and lower bias than when differences in treatment assignment were ignored. Wang et al. [11] and Green and Stuart [10] focussed on different PS matching techniques and similarly found that balance metrics were improved when differences in treatment assignment were accounted for, either by estimating subgroupspecific PS models or including moderatorconfounder interactions in a single PS model.
In this paper, we aim to add to this literature by additionally considering 1) PS covariate adjustment and regression adjustment as confounding adjustment methods, 2) situations where the moderator influences the relationship between the confounder and either/both the treatment (or exposure) and outcome, and 3) bias introduced into estimates of both TEM and subgroupspecific treatment effects. Patterns of bias under these different scenarios are explored in a simulation study. We discuss factors that influence the magnitude of bias and compare reduction in bias and precision of different approaches to accounting for moderatorconfounder interactions.
To investigate the impact of the issues discussed in practice, we compared the estimates of treatment effect modification in a real observational dataset when moderatorconfounder interactions were and were not included in the confounding adjustment. The dataset comprised information from the 201819 National Survey for Wales and the interest was in whether the effect of tinnitus on mental wellbeing was moderated by certain variables.
Methods
Simulation study
We performed a simulation study to confirm and demonstrate that 1) estimates of subgroupspecific treatment effects and TEM can be biased if the moderator also influences the effect of the confounders and this is unaccounted for, 2) the presence of bias depends on the confounding adjustment method and whether or not the moderator influences the effect of the confounder on treatment receipt or the effect of the confounder on the outcome, and 3) the impact of this bias depends on the prevalence of the moderator and the relative effect sizes of the treatment effect moderation. We compared the accuracy and precision of estimates between methods of accounting for the moderatorconfounder interactions.
Data generation
The simulated data comprised the following information on each patient: \(T\) – a binary indicator of treatment assignment (yes/no), \(Y\)  a continuous outcome measure, \(M\) – a binary treatment effect moderator, \({X}_{1}, {X}_{2}, {X}_{3}\)—three continuous confounding variables, and \({X}_{4}, {X}_{5}, {X}_{6}\) – three binary confounding variables. Throughout this paper, we assume no treatmentinduced confounding [13]. Datasets of size 1000 were generated to reflect a moderately large but realistic sample size and to allow patterns in the magnitude of the standard errors to be more easily assessed graphically.
The three continuous confounders \({X}_{1}, {X}_{2}, {X}_{3}\) were defined to follow a \(N(\mathrm{0,1})\) distribution. Binary confounders \({X}_{4}, {X}_{5}, {X}_{6}\) were defined to have prevalence 0.1, 0.25 and 0.5 respectively. \(M\) was simulated first with prevalence 0.5 and then 0.1.
An individual’s true probability of treatment was defined to depend on the main effects of the variables \({X}_{1}, \dots ,{X}_{6}, M\) and a product term between \(M\) and \({X}_{1}\) representing modification of the effect of \({X}_{1}\) on treatment receipt by \(M\):
\(p\) is the probability of treatment allocation (i.e. \(P(T=1{\varvec{X}}, M)\)), and \({e}_{1}\sim N(0, 0.2)\). The chosen values of the \(\alpha\) coefficients are given in Table 1. The binary treatment variable, \(T\), was generated via a Bernoulli distribution: \(T\sim \mathrm{Bernoulli}(p)\).
An individual’s outcome measure was defined to be dependent on the main effects of \(T\), \(M\) and \({X}_{1}, \dots ,{X}_{6}\), as well as a product term between \(T\) and \(M\) (representing the treatment effect modification effect) and a product term between \(M\) and \({X}_{1}\) representing modification of the effect of \({X}_{1}\) on the outcome by \(M\):
The chosen values of the \(\beta\) coefficients are given in Table 1. These were not based on a specific dataset but agreed as realistic values for an observational study.
The simulations considered three different values of the coefficients \({\alpha }_{M{X}_{1}}\) and \({\beta }_{M{X}_{1}}\) corresponding to null, moderate and large effect sizes of the \(M\times {X}_{1}\) term on treatment receipt and the \(M\times {X}_{1}\) term on outcome respectively (Table 1). Two values for \({\beta }_{TM}\) were chosen to represent small and moderate treatment effect modification effect respectively (Table 1). All other coefficient values were kept fixed. This resulted in 18 different combinations of modification effect sizes.
Confounding adjustment methods
Four methods of confounding adjustment were considered: (1) regression adjustment, (2) PS covariate adjustment (where the estimated PS score is added as a linear term to the outcome model), (3) inverse probability of treatment weighting (IPTW) using the propensity score, and (4) PS nearest neighbour onetoone matching, with replacement. For each, four confounding adjustment models were considered, the first three being (a) adjusting for the main effects of\(M\), \({X}_{1},\dots ,{X}_{6}\), but no product terms, (b) adjusting for the main effects of\(M\), \({X}_{1},\dots ,{X}_{6}\) and an \(M\times {X}_{1}\) product term, and (c) adjusting for the main effects of\(M\), \({X}_{1},\dots ,{X}_{6}\) and six product terms\({X}_{1}\times M, {X}_{2}\times M,\dots , {X}_{6}\times M\). The fourth model (d) was the adjustment of confounding separately within each subgroup of\(M\), thus separate linear models were fit in the two subgroups. This involved estimating subgroupspecific PS models. Here, the term ‘adjustment model’ refers to the propensity score for the three propensity score methods and the outcome model for regression adjustment.
The estimated individual propensity scores were obtained by fitting a logistic regression model, regressing treatment receipt on the set of confounders, including \(M\). The individual inverse probability of treatment weights were defined as the inverse of the probability of that individual receiving the treatment allocation they did receive. The nearest neighbour matching was performed with no specified calliper.
For each confounding model and method combination (16 in total), estimates of the subgroupspecific treatment effects, \({\widehat{\beta }}_{TM=1}\) and \({\widehat{\beta }}_{T M=0}\), and the treatmenteffect moderation estimate, \({\widehat{\beta }}_{TM}= {\widehat{\beta }}_{TM=1} {\widehat{\beta }}_{T M=0}\) were obtained via a linear regression model.
Parameter estimation
500 simulations were run per scenario (18 combinations of moderation effect sizes). This number of simulations was determined to be a conservative number required to detect a treatmentmoderator interaction effect size of 0.3 within an accuracy of 10% when the sample size was 1000 [14]. For each scenario, the mean of the 500 estimates of \({\widehat{\beta }}_{TM=1}\), \({\widehat{\beta }}_{TM=0}\) and \({\widehat{\beta }}_{TM}\) from the 16 confounding adjustment method/model combinations were obtained, along with the empirical standard error (calculated as the standard deviation of the estimates over the 500 simulations) and the average model standard error [15].
Applied example
To demonstrate the potential impact of accounting for interactions between a moderator and a confounder in practice, we used a dataset comprising information from the 2018–19 National Survey for Wales, a largescale crosssectional survey run annually by the Office for National Statistics on behalf of the Welsh Government. Participants are randomly selected from the population of Wales and asked a variety of questions regarding their health, lifestyle and interests. The anonymised data is available from the UK Data Service [16].
The specific example relates to the estimation of experiencing tinnitus on mental wellbeing. Tinnitus is a selfreported binary measure (experiences tinnitus or does not experience tinnitus) and mental wellbeing was measured using the WarwickEdinburgh Mental Wellbeing Scale, a numerical scale scored between 14–70 where a higher score indicates a higher level of mental wellbeing. We investigated whether the effect of tinnitus on mental wellbeing was moderated by three binary variables: gender, ethnicity (White/nonWhite) or current smoking status (currently smoke/currently do not smoke). Additional potential confounders accounted for were age (in years) and BMI (as a numerical variable).
A series of linear regression models were fitted including interactions between tinnitus and each of the three potential moderators. Confounding was adjusted for via both regression adjustment and IPTW. In the assessment of each of the three moderators, the other two potential moderators were included as potential confounders. Two confounding adjustment models for each method were specified, the first adjusting only for the main effects of each confounder, the second adjusting for the main effects of each confounder and interactions between the moderator and each confounder.
Software
The simulation study and analysis for the applied example were performed in Stata version 14 [17].
Results
Simulation study
The forest plots in Figs. 2 and 3 show the estimated values of \({\beta }_{TM=1}\), \({\beta }_{TM=0}\) and \({\beta }_{TM}\) for the various scenarios regarding the magnitude of the \(M\times {X}_{1}\) effect size on both treatment receipt and outcome and for each of the confounding adjustment methods and models tested, averaged over the 500 simulations, where \({\beta }_{TM=1}=1.8\), \({\beta }_{TM=0}=1.5\) and \({\beta }_{TM}=0.3\). The 95% confidence intervals (obtained using the estimated standard error assuming a normal distribution) are displayed. The prevalence \(M\) was 0.5 for the results in Fig. 2 and 0.1 for Fig. 3. Tables displaying these results are given the Supplementary material.
We first discuss the patterns of bias observed when no moderatorconfounder interactions were accounted for in the confounding adjustment (i.e. confounding model (a)). In Fig. 2 a nonzero \({\alpha }_{M{X}_{1}}\) did not introduce bias into either the subgroupspecific treatment effect estimates or the interaction effect estimate \({\widehat{\beta }}_{TM}\) when the confounding method was regression adjustment or PS covariate adjustment. However, there was bias in estimates obtained from these two methods when \({\beta }_{M{X}_{1}}>0\). Alternatively, for IPTW and PS matching, increasing \({\alpha }_{M{X}_{1}}\) from 0 did induce bias into the subgroup and the interaction effect. Increasing \({\beta }_{M{X}_{1}}\) also appeared to have a slight impact on the estimates from these latter two confounding adjustment methods. For all confounding adjustment methods, the magnitude of the bias increased as the effect size of the relevant moderatorconfounder interactions increased.
In this simulation study, the direction of bias was always positive for \({\widehat{\beta }}_{TM=1}\) and the interaction effect, \({\widehat{\beta }}_{TM}\), and negative for \({\widehat{\beta }}_{TM=0}\) as the impact of the confounder \({X}_{1}\) on both treatment receipt and the outcome was set to be larger when \(M=1\) (since both \({\alpha }_{M{X}_{1}}>0\) and \({\beta }_{M{X}_{1}}>0\)). When the average effect of \({X}_{1}\) was adjusted for equally in both groups of \(M\), this led to an underadjustment of \({X}_{1}\) when \(M=1\) and an overadjustment of \({X}_{1}\) when \(M=0\). Because the inclusion of \({X}_{1}\) attenuated the estimated treatment effect, this led to an overestimation of the treatment effect when \(M=1\) and an underestimation when \(M=0\).
When an \(M\times {X}_{1}\) interaction term was included in the confounding adjustment model, i.e. for confounding model (b), the bias in each three parameter estimates was substantially reduced for regression adjustment, IPTW and PS matching. For PS covariate adjustment, a similar pattern of bias was seen as for confounding adjustment model (a).
Including all possible moderatorconfounder interactions in the confounding adjustment model, i.e. adjustment model (c), and performing the stratified analysis, i.e. adjustment model (d), also resulted in substantially reduced levels of bias in the subgroupspecific treatment effects and \({\widehat{\beta }}_{TM}\) for regression adjustment, IPTW and PS matching. Confounding models (c) and (d) gave the exact same results for regression adjustment and IPTW due to their nature. Adjustment model (c) resulted in similar biased estimates to (b) for PS covariate adjustment. Only in the stratified analysis did PS covariate adjustment produce accurate estimates.
In Fig. 3, the magnitude of bias in the \({\widehat{\beta }}_{TM=1}\) subgroup treatment effect was larger than in Fig. 2 and the magnitude of bias in the \({\widehat{\beta }}_{TM=0}\) was smaller. The bias in the overall interaction effect was similar but overall the confidence intervals were wider in Fig. 3. Another discrepancy between Figs. 2 and 3 is seen for PS covariate adjustment using IPTW (model (c), method (2)). Even where \({\alpha }_{M{X}_{1}}=0\) and \({\beta }_{M{X}_{1}}=0\), all estimates were biased when the prevalence of the moderator was 0.1. Otherwise, Figs. 2 and 3 showed similar patterns of bias.
When \({\beta }_{TM=1}\), \({\beta }_{TM=0}\) and \({\beta }_{TM}\) were larger, the magnitude of bias was the same but the impact of the bias in the TEM estimate is smaller relative to its larger magnitude (supplementary tables 1, 2, 3, 4, 5, 6, 7, 8).
To more thoroughly compare the accuracy in estimates across all models for each adjustment method, we compared the mean bias in \({\widehat{\beta }}_{TM}\) across the set of \({\alpha }_{M{X}_{1}}\) and \({\beta }_{M{X}_{1}}\) values (Table 2).
As expected, the average bias tended to be highest when no moderatorconfounder interactions were adjusted for (adjustment model (a)). Overall, the average bias was still reasonably small in magnitude for adjustment model (a); however, this is the average over all \({\alpha }_{M{X}_{1}}\) and \({\beta }_{M{X}_{1}}\) values, and the bias was larger (up to 0.15 in magnitude) as \({\alpha }_{M{X}_{1}}\) and \({\beta }_{M{X}_{1}}\) increased.
The average bias in the estimation of \({\widehat{\beta }}_{TM}\) was the same (to 4dp) across adjustment models (b)(d) for regression adjustment. For IPTW, the bias was smaller when all moderatorconfounder interactions were accounted for and when a stratified analysis was performed than when only the one moderatorconfounder was accounted for, and more so when the prevalence of M was 0.5. There was not a clear pattern for PS matching.
The precision of the various estimates of \({\widehat{\beta }}_{T}\) and \({\widehat{\beta }}_{TM}\), i.e. how much confidence we have that sample estimates reflect the population parameter, can be most easily assessed in Figs. 2 and 3 and the results tables in the supplementary material by examining the width of the 95% confidence intervals. Confounding models (b), (c) and (d) gave similar levels of precision for estimates obtained via regression adjustment. For PS covariate adjustment, the precision (as well as the accuracy) of estimates was highest for confounding model (d), i.e. in the stratified analysis. For IPTW, the precision was higher for confounding models (c) and (d) than confounding models (a) and (b). For PS matching, the precision was highest for confounding model (d).
Comparing the empirical and average model standard error is a way of assessing bias in the estimation of the model standard error [15]. For regression adjustment, the average model standard errors are very close to the empirical standard errors across all models and scenarios, suggesting this methods accurately estimate the standard errors of the estimates. However, for the other methods using propensity scores, particularly IPTW and PS matching, the average model standard errors typically overestimated the empirical standard error, sometimes severely. The difference between the two standard errors was largest for confounding model d.
Morris et al. say that the comparison of the empirical standard error and the average model standard error should be interpreted with caution when the methods are known to be biased as the empirical SEs can be small as a result [15]. However, large differences were seen even when the method was not biased in the estimates of \({\widehat{\beta }}_{T}\) and \({\widehat{\beta }}_{TM}\). Other studies have shown that when propensity scores are used, the average model standard error can be larger than the empirical standard error [18, 19]. We suspect this is due to the use of the robust variance estimator in these models as this can overestimate the variance of effect to protect against some element of misspecification [19].
This analysis did not primarily seek to compare the accuracy and precision of the different confounding adjustment methods overall. In general, estimates obtained via regression adjustment had the smallest bias. However, this is likely to reflect the way in which the data was simulated, and will likely not be true in all applications. Estimates obtained via IPTW had noticeably higher precision than PS matching, but it is possible that a more sophisticated version of PS matching would have performed better.
Applied example
Table 3 displays the interaction effect estimates for tinnitus and each of gender, white ethnicity, and current smoking status on mental wellbeing obtained from fitting a series of linear regression models. Confounding was adjusted for (separately) via both regression adjustment and IPTW, and the confounding adjustment models included either no moderatorconfounder interactions or all possible moderatorconfounder interactions. The sample size in all models was 5402 observations.
Adjusting for interactions between the moderator and confounders in regression adjustment had little difference on the estimates of interaction between tinnitus and each of gender and current smoking status. Although not statistically significant in either case, the interaction effect between tinnitus and white ethnicity roughly tripled in magnitude when moderatorconfounder interactions were adjusted for. Further inspection showed that an interaction between White ethnicity and age was present and statistically significant. This implies that the effect of age on mental wellbeing onset was different for people of White and nonWhite ethnicity.
Again, when IPTW was applied, adjusting for interactions between the moderator and confounders in the propensity score model had little difference on the estimates of interaction between tinnitus and each of gender and current smoking status. interaction effect between tinnitus and White ethnicity increased in magnitude when moderatorconfounder interactions were included in the propensity score model. Upon further inspection, none of the interactions between White ethnicity and the confounders were of notable size or were statistically significant, however, the combined effect of their inclusion still had an impact on the overall interaction effect between White ethnicity and tinnitus on mental wellbeing.
We did not aim to provide a robust answer to the clinical question posed as there are limitations regarding unmeasured confounding and the dichotomisation of smoking status and ethnicity. However, this practical application shows that different point estimates of effect modification may be obtained in practice depending on whether interactions between the moderator of interest and the confounders are included in the confounding adjustment. In many cases, the differences may be marginal and the overall conclusions will not change. In some cases however, the differences may lead to different conclusions being made.
Discussion
Summary of findings
Our findings confirm that failure to account for any interactions present between the moderator and a confounder on treatment receipt introduced bias into subgroupspecific and TEM estimates when IPTW and PS matching was applied [8,9,10,11, 12]. Our simulations also indicated that the presence of moderatorconfounder interactions on the outcome induced a small amount of bias into parameter estimates. Both adjusting for the relevant (or all possible) moderatorconfounder interactions in the propensity score creation and estimating subgroupspecific PS models removed this bias.
Whilst it is not surprising, to our knowledge it has not been previously clarified that PS covariate adjustment is instead sensitive to failure to account for interactions between the moderator and a confounder on outcome. Hence, inclusion of confoundermoderator interactions in the PS model does not rectify this problem; only when subgroupspecific PS models were estimated did PS covariate adjustment produce accurate estimates. Similarly, regression adjustment produced biased estimates where there existed a moderatorconfounder interaction on outcome which was not accounted for.
The accuracy and precision of estimates (based on the empirical standard errors) obtained from regression adjustment were similar when only the one moderatorconfounder interaction was accounted for in the confounding model, when all possible moderatorconfounder interactions were accounted for and when the stratified analysis was performed. For IPTW, the accuracy and precision was higher when all possible moderatorconfounder interactions were accounted for and when the stratified analysis was performed, compared to when just the one moderatorconfounder interaction was accounted for. For PS matching, the accuracy and precision was highest in the stratified analysis. However, in the methods using propensity scores, particularly IPTW and PS matching, the average model standard errors tended to overestimate the empirical standard errors which would lead to less precision of the subgroup and moderator effect estimates in practice when these methods were used.
If the moderator itself is a confounder, by not accounting for any moderatorconfounder interactions that exist on either treatment or outcome, one is essentially misspecifying the propensity score or outcome model. This should in theory induce bias into any estimates obtained from the outcome model and it is already recommended that confounderconfounder interactions be considered [20] although this is not always done. However, it seems plausible that when the interest is specifically in treatment effect modification, not accounting for existing moderatorconfounder interactions will have a more serious impact on accuracy than not accounting for other confounderconfounder interactions. Furthermore, the magnitude of treatment effect modification is typically small relative to the magnitude of main effects, thus such estimates may be more sensitive to bias.
The applied example demonstrated the potential impact of accounting for moderatorconfounder interactions. In many cases, the difference in the estimates of treatment effect modification obtained when moderatorconfounder interactions were and were not accounted for was very small. However, a difference was observed for some.
Although we did not include these in our simulation studies, doubly robust estimators are an attractive way of estimating the effect of exposures on outcomes in observational studies [21]. Doubly robust estimators use both the outcome model and propensity score, giving an unbiased effect estimate if at least one is correctly specified. Hence, if interactions exist between the moderator and one or more confounders on either treatment receipt or the outcome, but not both, and these are not accounted for, doubly robust estimation should still provide unbiased estimates. However, it is still advisable to consider the presence of interactions between a potential moderator and the confounders on both treatment receipt and the outcome, to avoid potential misspecification of both the outcome model and propensity score.
Limitations
In this study, we considered four different methods of adjusting for confounding. Other methods which could have been considered include stratification by the PS and other versions of PS matching. The aim of this analysis was not however to compare the different methods in terms of accuracy and precision, but to explore the bias within each method. We suspect that, in general, methods based on the PS will always be prone to bias when there are interactions between the moderator and confounder on treatment assignment (the exception being PS covariate adjustment).
In the simulations, we only considered situations in which there was a linear interaction between the moderator and confounder when either was continuous. In practice, there may be more complex nonlinear interactions between the moderator and confounder which may be insufficiently accounted for with a linear interaction term in the confounding adjustment. If this is the case, this should be incorporated into the confounding adjustment if possible.
Additionally, we only considered scenarios with a continuous outcome and a binary moderator. We expect similar patterns of bias to be seen with other outcome and moderator types, but the relative accuracy and precision of the confounding adjustment models within each method may not be the same. We also only simulated data where there were six confounders and only one moderator of interest, but in practice there may be many more confounders and moderators of interest. It may be that a moderator interacts with multiple confounders and, if the bias introduced by each moderatorconfounder pair were in the same direction, the overall amount of bias on the estimation of treatment effect modification could be substantial. Alternatively, the different biases could cancel each other out if they acted in different directions.
It has been recommended that a propensity score model include not only confounders, but also variables associated with the outcome as this increases precision [22]. It seems intuitive that interactions between the moderator(s) and variables only associated with the outcome do not need to be considered in a propensity score model, as the moderator cannot influence the effect of such variables on treatment receipt if the variable does not have an effect on treatment receipt.
Here, we consider a simplistic, although common, approach to assessing treatment effect modification as only one moderator is considered at a time in a parametric model. More sophisticated and flexible approaches exist which allow researchers to assess treatment heterogeneity more generally. Bayesian additive regression trees (BART), for example, avoid the strong parametric assumptions required for the standard linear and logistic regression models and automates the detection of interactions [23]. Additionally, the Bayesian causal forest model works particularly well when there is strong confounding [24]. Many of these methods are readily available in R. For example, the EffectLiteR package in R enables the estimation of average and conditional effects whilst taking into account any number of continuous and categorical covariates, can estimate multiple interaction effects simultaneously [25].
Conclusion
In conclusion, we recommend that the presence of moderatorconfounder interactions are considered and accounted for when estimating treatment effect medication whilst adjusting for additional variables. Accounting for moderatorconfounder interactions that did not exist did not have a negative impact in our simulation study, hence we suggest that researchers include interactions terms if they are undecided about their presence. However, this approach may be unattractive when using regression adjustment with a smaller sample size. We also recommend that subgroupspecific propensity scores are created and used in a stratified analysis when using propensity score covariate adjustment to assess treatment effect modification by a binary variable.
Availability of data and materials
Stata code and the simulated data will be shared upon request via Antonia Marsden.
Abbreviations
 TEM:

Treatment effect modification
 PS:

Propensity score
 IPTW:

Inverse probability of treatment weighting
References
Kyle SD, Hurry MED, Emsley R, Luik AI, Omlin X, Spiegelhalder K, et al. Effects of digital Cognitive Behavioural Therapy for Insomnia on cognitive function: study protocol for a randomised controlled trial. Trials. 2017;18(1):281.
VanderWeele TJ. On the Distinction Between Interaction and Effect Modification. Epidemiology. 2009;20(6):863–71.
McNamee R. Confounding and confounders. Occup Environ Med. 2003;60(3):227–34 quiz 164, 234.
Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behav Res. 2011;46(3):399–424.
Liu AH, Abrahamowicz M, Siemiatycki J. Conditions for confounding of interactions. Pharmacoepidemiol Drug Saf. 2016;25(3):287–96.
Drake C. Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics. 1993;49(4):1231–6.
Greifer N, Stuart EA. Matching Methods for Confounder Adjustment: An Addition to the Epidemiologist's Toolbox. Epidemiol Rev. 2022;43(1):118–29. https://doi.org/10.1093/epirev/mxab003.
Kreif N, Grieve R, Radice R, Sadique Z, Ramsahai R, Sekhon JS. Methods for estimating subgroup effects in costeffectiveness analyses that use observational data. Med Decis Making. 2012;32(6):750–63.
Radice R, Ramsahai R, Grieve R, Kreif N, Sadique Z, Sekhon JS. Evaluating treatment effectiveness in patient subgroups: a comparison of propensity score methods with an automated matching approach. Int J Biostat. 2012;8(1):25.
Green KM, Stuart EA. Examining moderation analyses in propensity score methods: application to depression and substance use. J Consult Clin Psychol. 2014;82(5):773–83.
Wang SV, Jin Y, Fireman B, Gruber S, He M, Wyss R, et al. Relative Performance of Propensity Score Matching Strategies for Subgroup Analyses. Am J Epidemiol. 2018;187(8):1799–807.
Rassen JA, Glynn RJ, Rothman KJ, Setoguchi S, Schneeweiss S. Applying propensity scores estimated in a full cohort to adjust for confounding in subgroup analyses. Pharmacoepidemiol Drug Saf. 2012;21(7):697–709.
Wodtke GT, Zhou X. Effect Decomposition in the Presence of Treatmentinduced Confounding: A Regressionwithresiduals Approach. Epidemiology. 2020;31(3):369–75.
Burton A, Altman DG, Royston P, Holder RL. The design of simulation studies in medical statistics. Stat Med. 2006;25(24):4279–92.
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–102.
Welsh Government, Office for National Statistics. National Survey for Wales, 20182019. [data collection]. UK Data Service. SN: 8591. 2020. https://doi.org/10.5255/UKDASN85911.
StataCorp. Stata statistical software: release 14. College Station, TX: StataCorp LP; 2015. https://www.stata.com/support/faqs/resources/citingsoftwaredocumentationfaqs/.
Austin PC. Type I error rates, coverage of confidence intervals, and variance estimation in propensityscore matched analyses. Int J Biostat. 2009;5(1):Article 13. https://doi.org/10.2202/15574679.1146.
Xu S, Ross C, Raebel MA, Shetterly S, Blanchette C, Smith D. Use of stabilized inverse propensity scores as weights to directly estimate relative risk and its confidence intervals. Value Health. 2010;13(2):273–7.
D’Agostino RB Jr. Propensity score methods for bias reduction in the comparison of a treatment to a nonrandomized control group. Stat Med. 1998;17(19):2265–81.
Funk MJ, Westreich D, Wiesen C, Sturmer T, Brookhart MA, Davidian M. Doubly robust estimation of causal effects. Am J Epidemiol. 2011;173(7):761–7.
Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Sturmer T. Variable selection for propensity score models. Am J Epidemiol. 2006;163(12):1149–56.
Green DP, Kern HL. Modeling Heterogeneous Treatment Effects in Survey Experiments with Bayesian Additive Regression Trees. Public Opinion Quarterly. 2012;76(3):491–511.
Hahn PR, Murray JS, Carvalho C. Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Anal. 2020;15(3):965–1056.
Mayer A, Dietzfelbinger L, Rosseel Y, Steyer R. The EffectLiteR Approach for Analyzing Average and Conditional Effects. Multivariate Behav Res. 2016;51(2–3):374–91.
Acknowledgements
Not applicable.
Funding
AM conducted this work as part of a Ph.D. at The University of Manchester funded by a National Institute for Health Research Musculoskeletal Biomedical Research Unit Ph.D. studentship (UK). The views expressed in this publication are those of the authors and not necessarily those of the NHS, the National Institute for Health Research or the Department of Health.
Author information
Authors and Affiliations
Contributions
AM: conceptualisation, methodology, formal analysis, writing of the original draft. WD: methodology, reviewing and editing. GD: methodology, reviewing and editing. RE: methodology, reviewing and editing. The author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Ethical approval was not sought for the simulation study as this was not deemed necessary. The applied example used secondary data from the CPRD. The study protocol for the original study was approved by the CPRD's Independent Scientific Advisory Committee (approval no. 11_113R).
Consent for publication
Not applicable.
Competing interests
WGD has received consultancy fees from Bayer, Abbvie and Google, unrelated to this work.
AM, GD and RE declare no conflicts of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Graham Dunn Deceased.
Supplementary Information
12874_2022_1519_MOESM1_ESM.docx
Additional file 1.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Marsden, A.M., Dixon, W.G., Dunn, G. et al. The impact of moderator by confounder interactions in the assessment of treatment effect modification: a simulation study. BMC Med Res Methodol 22, 88 (2022). https://doi.org/10.1186/s12874022015197
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874022015197
Keywords
 Confounding
 Interaction
 Propensity scores
 Treatment effect modification