Use of Missing Indicators as Proxies for Unmeasured Variables: Simulation Study

Background : Within routinely collected health data, missing data for an individual might provide useful information in itself. This occurs, for example, in the case of electronic health records, where the presence or absence of data is informative. While the naive use of missing indicators to try to exploit such information can introduce bias when used inappropriately, its use in conjunction with other imputation approaches may unlock the potential value of missingness to reduce bias and improve prediction. Methods: We conducted a simulation study to determine when the use of a missing indicator, combined with an imputation approach, such as multiple imputation, would lead to improved model performance, in terms of minimising bias for causal effect estimation, and improving predictive accuracy, under a range of scenarios with unmeasured variables. We use directed acyclic graphs and structural models to elucidate causal structures of interest. We consider a variety of missingness mechanisms, then handle these using complete case analysis, unconditional mean imputation, regression imputation and multiple imputation. In each case we evaluate supplementing these approaches with missing indicator terms. Results: For estimating causal effects, we find that multiple imputation combined with a missing indicator gives minimal bias in most scenarios. For prediction, we find that regression imputation combined with a missing indicator minimises mean squared error. Conclusion : In the presence of missing data, careful use of missing indicators, combined with appropriate imputation, can improve both causal estimation and prediction accuracy.


Background
Missing data is a common feature in observational studies. Particularly for studies that target causal effects, but also for prediction, careful thought is needed when deciding how to handle missing data. The mechanism for missingness is conventionally divided into three categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) [1].
In the case of MCAR and MAR, an unbiased estimator of any causal effect of interest exists, with approaches such as complete case analysis providing an unbiased causal estimates under MCAR. One of the most popular means of handling missing data is multiple imputation, which can provide unbiased estimates under both MAR and MCAR. In contrast, under MNAR, unbiased estimators of a given causal effect may or may not exist. In this case, drawing causal diagrams or Directed Acyclic Graphs (DAGs) that include the missingness mechanism, called m-graphs [2], can help to identify whether a given effect can be estimated, and how to do so.
One form of MNAR is where the missingness mechanism depends on entirely unmeasured variables, that are causally related to the outcome variable. These unmeasured variables may act as confounders, in which case, unbiased estimators of a causal effect do not exist even in the absence of missing data.
An emerging hypothesis is that in some scenarios with such unmeasured confounding affecting a particular estimand, missing data may be a blessing rather than a curse. For example, in electronic health records, presence of a particular laboratory test result 4 indicates that a decision was made to run this test, and the reason for this decision is likely to depend on unrecorded health characteristics of the patient. These unrecorded characteristics may affect both the result of the laboratory test, and the outcome of interest.
In these cases, the missingness mechanism (i.e. presence or absence of the laboratory test) may act as a proxy for unmeasured confounding, allowing for partial adjustment [3].
Although in MCAR and MAR scenarios, use of missing indicators can introduce bias [4,5], use of missing indicators may reduce bias where the missingness is informative, and particularly when the missing indicator is used in conjunction with multiple imputation (MIMI) [3,[6][7][8].
We hypothesise that use of missing indicator in conjunction with regression/ multiple imputation methods in such cases would also improve predictive accuracy. Indeed, we suggest that the approach to imputation that one utilises should differ depending on the underlying analytical aim. Specifically, how one handles missing data should arguably differ for studies aiming to estimate causal effects (where primary interest is in obtaining unbiased parameter estimates) as opposed to studies aiming to develop risk models for a particular prediction task (where primary interest is in obtaining accurate risk predictions, regardless of the underlying parameter estimates).
In this paper we study, through simulation supplemented with analytical findings, the potential for using the missingness mechanism to partly adjust for unmeasured confounding, and study the scenarios where this can be beneficial, both for causal and prediction objectives. Our primary aim is to identify missing data strategies that recover causal effects with minimal bias in a variety of MNAR scenarios; as a secondary aim we will examine predictive accuracy of the models, representing the case where ones primary interest is developing a prediction model, rather than recovering causal effects. The scenarios that we will consider in this paper are given in Figure 1. We consider a partially observed exposure , a fully observed outcome and an unobserved variable . and (iv) complete case analysis is unbiased (see e.g. [9] for scenario (iv)). In scenarios (iii) and (vi), the unobserved variable confounds the relationship between and , so an unbiased estimate of the causal effect of on would not be available even if there were no missingness.

Scenarios and data generating mechanisms
In Scenarios (ii), (iii), (v), and (vi), we could view as a proxy for the unmeasured . It therefore seems reasonable to include in the outcome model. This may reduce bias in the estimation of the effect of on , and provide at least partial information about the effect of on .
We now specify the structural models that will be assumed for our further derivations and simulations.
For consideration of causal effects we will use the counterfactual notation, e.g. ( = ) denotes the value of that would be observed if, possibly contrary to fact, we set = , and we will abbreviate to ( ) where this does not lead to ambiguity. See

Considered approaches
We consider the following imputation and modelling approaches. Throughout, we denote the imputed as imp . For each imputation approach, we consider the following three outcome/analysis models: When imp are generated using multiple imputations, Model 1 represents a standard multiple imputation approach, while models 2 and 3 are variants of the MIMI approach. In model 3, if we view as a proxy for we would hope that ̂3 ≈ , ̂3 ≈ , and ̂3 ≈ . For the other models, by standardisation, we would hope that ̂≈ + (for = 0,1,2), which represents the marginal causal effect of on . For each model, an estimate of each parameter can be produced under each of the three imputation approaches we consider, with the exception that Model 3 is not identified for unconditional mean imputation.

Analytical comments
It is instructive to consider the special case of scenario (ii) (see Figure 1) where = .
While this may seem extreme, it could well happen in practice: for example, if a particular blood test is only run if a particular condition is met, and that condition is not recorded. In 9 this case, it is trivial that the causal effect of on is recoverable. In both scenarios we have exchangeablity ⫫ ( ), therefore where the equalities follow, respectively, from ⫫ ( ), then consistency, and finally that = . This also holds in Scenario (iii).
Interestingly, if we impute the exposure through multiple imputation, then fit an outcome model with the imputed , imp , and include , then when = , this model produces a biased estimate of the effect of on even in the simple case of Scenario (ii) with = 0, so that and do not interact in the outcome model.
In such a case the estimate produced has [̂] ≈ 2 2 2 + 2 . This is because fitting the imputation model introduces regression dilution in this case [11]. A justification is given in the Appendix.

Simulation set-up
The aims, general structure, and models, are described above. We consider the following specific data generating mechanisms, which cover all of the scenarios (i)-(vi) described in Figure 1. We closely follow best practice for the design and reporting of simulation studies as proposed in [12].
For the ≠ case: • We fix the sample size (number of observations within each simulation run) to be = 10,000, and fix = 0.5. • The interaction effect parameters, and , are varied between {0,0.5}.
For the = case, we use the same simulation settings with the following exceptions: • We exclude , and , which are redundant.
• We vary over the grid {0.25,0.5,0.75}, as this is required to vary the proportion of missingness.
All combinations of the parameters are evaluated, resulting in 9504 scenarios, of which 288 cover the case = .
For each Scenario we fit the models described in the previous section, and report estimates of the outcome coefficients from the various models, using each method of imputation.
Each scenario is repeated 200 times and summary statistics over these iterations retained.
For the parameters ̂0, ̂1, ̂2, ̂2, ̂3, ̂3, and ̂3 we retain the 2.5th, 25th, 50th, 75th and 97.5th percentile parameter estimates. We retain the same percentiles for the mean 11 squared error of each model fit. We also retain the average model-based standard errors and empirical standard errors for each parameter.

Simulation results
Here we present a subset of the simulations that capture the main findings. For ease of interepretation, throughout this section we restrict to the outcome model with = 1,     The key parameters varying are and .
Increasing changes the ̂ estimates. There is some, but lesser, impact of . All imputation strategies yield similar results for ̂, except that bias is larger for regression imputation in models 1 and 2. However, has a larger impact on the ̂ estimates. As expected, when is larger (i.e. missingness is a stronger proxy for the unmeasured ) the ̂ estimate becomes larger, particularly for the multiple imputation models, although still much smaller than = 1. We again see that regression imputation has the smallest MSE, with model 3 regression imputation being the smallest. 16

19
The results are similar to those for Scenario (v) except that is more commonly overestimated.

Discussion
In this paper we have explored, through simulation, the potential merits of supplementing a missing data strategy with a missing indicator, particularly in circumstances where missingness is not at random, and the missingness may moreover act as a proxy for unmeasured confounding or an unmeasured prognostic variable. We divide the main findings into implications for causal estimation, and implications for prediction.

Implications for causal estimation
In the MCAR scenario, without unmeasured confounding, adding a missing indicator was unlikely to introduce bias in estimation of causal effects. In the presence of unmeasured confounding, bias in estimation was sometimes better and sometimes worse when including a missing indicator and/or its interaction with the main effect. Specifically, when unmeasured confounding exists, the missing indicator and/or its interaction with the main effect were estimated to be non-zero. Additionally, when missingness was perfectly 21 correlated with the unmeasured confounder, the measured effect was highly biased (see Appendix). Nevertheless, these non-zero effect estimates of the missing indicator act as a signal that it will be difficult or impossible to obtain unbiased causal effects. Alongside whether to incorporate missing indicators, we also explored the relative benefits of mean imputation, regression imputation and multiple imputation. As expected, between these approaches, multiple imputation was found to be the most robust. We found that in some MNAR scenarios where multiple imputation is usually biased, this bias is removed or alleviated by MIMI.
The 'missing indicator' approach has a somewhat negative reputation in the causal inference literature. This is because it is usually coupled with a weak approach to impute the missing data itself -such as using the unconditional mean [8]. With such application, missing indicator is known to lead to biased estimation even under MCAR [4,5]. The idea of combining the missing indicator approach with multiple imputation was first proposed by [6], and has been further explored by [7] and [3]. In these articles, the focus is on handling missing data in covariates used in propensity scores, whereas here we consider missing data in the exposure of interest. Nevertheless, [3] in particular noted that the use of missing indicators can partly adjust for unmeasured confounding, similar to our findings.
Therefore, we recommend the use of MIMI (including interactions between missing indicators and the corresponding variable) as a strategy for handling missing data in causal estimation problems. Non-zero estimates of the missing indicator then alert to possible occurrence of MNAR, and the need for further sensitivity analysis. We caveat that the use of missing indicators should not replace careful consideration of assumed plausible causal 22 structures, and drawing a causal diagram to depict these assumptions remains the starting point for a well conducted causal inference.

Implications for prediction
Regression imputation led to smaller MSE than corresponding multiple imputation approaches, especially when combined with a missing indicator and associated interaction term. Multiple imputation has long being assumed to be the best choice for handling missing data in prediction, despite the motivation for the approach coming from consideration of bias, which are only relevant for causal inference. Here we demonstrate that regression imputation (imputing the predicted mean rather than simulating from the posterior predictive distribution) leads to reduced MSE. This finding is likely due to the associated reduction in variance with little or no loss in information (since prediction focuses on prediction of , not causal estimation of parameters). However, care is needed in estimating the standard error of an associated predictive interval, since standard methods would underestimate this based on a single regression imputation. We also saw that regression imputation led to larger bias in effect estimates than other approaches.
While such bias is not a direct concern in predictive modelling, causal effects are known to be more stable and robust over time and geography [13], and also allow for counterfactual prediction, which is useful in many decision support contexts [14,15].
For prediction specifically, [8] advocate the use of a pattern submodel, in which separate models are fit for each missingness pattern. Our Model 3 can be thought of as similar to this approach; as noted by [8], there is asymptotic equivalence between the approaches. The pattern submodel is easier to use in prediction, but hard to interpret from a causal effect estimation perspective, so we did not consider it here. [16] also compared techniques to handle missing data in prediction; however, they did not consider the regression imputation technique that we found to be optimal in terms of MSE, and gave limited attention to MNAR mechanisms, which are likely in routine data where prediction models are commonly derived and applied.
Our prediction findings require further investigation, to ascertain whether the regression imputation strategy improves accuracy of a predictive model in real data. This should specifically be explored in the context of model calibration and discrimination.
Nonetheless, it is worth noting that a further advantage of using regression imputation for prediction is that it is more feasible to apply at 'prediction time' -i.e. dealing with missing predictors when making a prediction for a new observation. Applying multiple imputation in this setting is often infeasible, and there are issues with using a different approach to imputation when developing a model, compared with the approach when the model is used in practice [8]. Therefore, we recommend that developers of predictive models consider regression imputation as an alternative approach to handle missing data, and base the choice of imputation method on the accuracy of the resulting models, and the feasibility of performing the required imputation at 'prediction time'.

Strengths and limitations
The paper has several strengths. We explored a wide range of simulation settings in a fully factorial design. While we can only present a limited range of results in the paper, the simulation code and results are available online for inspection. Nevertheless, simulations are necessarily simpler than scenarios that might be encountered in practice, where 24 missingness may affect many covariates. While addition of missing indicators, and interactions, seems robust, it may break down in some scenarios with complex multivariate patterns of missingness, and may also lead to unacceptable model complexity.

Conclusions
We recommend that addition of a missing indicator, and corresponding interaction terms, can supplement, but not replace, the existing chosen imputation strategy. Where the goal is prediction, regression imputation should be explored as an alternative to multiple imputation, as this may increase both accuracy and practicality.

Ethical approval and consent to participate
Not applicable.
In the absence of missing data, we would of course simply solve using least squares, and if