 Research article
 Open Access
 Open Peer Review
 Published:
Imputation strategies when a continuous outcome is to be dichotomized for responder analysis: a simulation study
BMC Medical Research Methodology volume 19, Article number: 161 (2019)
Abstract
Background
In many clinical trials continuous outcomes are dichotomized to compare proportions of patients who respond. A common and recommended approach to handling missing data in responder analysis is to impute as nonresponders, despite known biases. Multiple imputation is another natural choice but when a continuous outcome is ultimately dichotomized, the specifications of the imputation model come into question. Practitioners can either impute the missing outcome before dichotomizing or dichotomize then impute. In this study we compared multiple imputation of the continuous and dichotomous forms of the outcome, and imputing responder status as nonresponse in responder analysis.
Methods
We simulated four response profiles representing a twoarm randomized controlled trial with a continuous outcome at four time points. We omitted data using six missing at random mechanisms, and imputed missing observations three ways: 1) replacing as nonresponder; 2) multiply imputing before dichotomizing; and 3) multiply imputing the dichotomized response. Imputation models included the continuous response at all timepoints, and additional auxiliary variables for some scenarios. We assessed bias, power, coverage of the 95% confidence interval, and type 1 error. Finally, we applied these methods to a longitudinal trial for patients with major depressive disorder.
Results
Both forms of multiple imputation performed better than nonresponse imputation in terms of bias and type 1 error. When approximately 30% of responses were missing, bias was less than 7.3% for multiple imputation scenarios but when 50% of responses were missing, imputing before dichotomizing generally had lower bias compared to dichotomizing before imputing. Nonresponse imputation resulted in biased estimates, both underestimates and overestimates. In the example trial data, nonresponse imputation estimated a smaller difference in proportions than multiply imputed approaches.
Conclusions
With moderate amounts of missing data, multiply imputing the continuous outcome variable prior to dichotomizing performed similar to multiply imputing the binary responder status. With higher rates of missingness, multiply imputing the continuous variable was less biased and had wellcontrolled coverage probabilities of the 95% confidence interval compared to imputing the dichotomous response. In general, multiple imputation using the longitudinally measured continuous outcome in the imputation model performed better than imputing missing observations as nonresponders.
Background
Clinical trials can be evaluated by differences in rates of successful response. In socalled responder analysis, subjects are classified as responders, often by dichotomizing a continuous outcome, if they improve by a specified threshold. For example, responder definitions could be a 5% change in body mass index or an improvement in symptoms by 10 points on a 100point symptom scale. Responder analysis is commonly used with patientreported outcomes (PROs) because results are easily interpretable to patients and other stakeholders and can lend language to drug labels and promotional claims.
When the outcome is measured for all subjects at baseline and the timepoint of interest, responder status can be calculated, and the analysis is routine. However missing data are ubiquitous in longitudinal trials and responder status cannot be determined for subjects missing the outcome. [1] One approach for handling missing data in responder analysis, recommended in the regulatory setting [2,3,4] is to impute subjects missing the outcome as nonresponders, termed nonresponse imputation (NRI). However, it is a strong assumption to assume unobserved outcomes are uniformly “failures” rather than come from the distribution of subjects who do not improve. NRI can be thought of as a composite outcome of response and a dropout indicator. Methodologists warn that composite endpoints can be misleading, for example, when the components have varying degrees of severity and treatment effects of each component differ between groups. [5, 6] This could be true if dropout depended at least partly on a tolerability. For example, a cancer treatment may offer a favorable toxicity profile relative to a comparator. Using NRI, the response rate of the comparator arm more than in the treatment arm would reflect the effect of tolerability, i.e., have more nonresponders, and could widening the between arm difference. While some may view NRI as a conservative approach (since the proportions of responders can only decrease), treating missing as response failure can result in unpredictable differences in proportions between treatment groups. [7, 8]
In longitudinal trials, missing observations can be intermittent, as in a missed study visit, but dropout accounts for most missing data. We focus this article on monotone missing patterns, implying that observations are observed up until one is missing and all subsequent observations are missing. Little and Rubin [9] provide a framework to describe categories of missing data mechanisms given the relationship with observed and unobserved values. When the probability of missingness is independent of the observed and unobserved data the mechanism is said to be missing completely at random (MCAR). Data are missing at random (MAR) if the probability of missingness is independent of the unobserved data after conditioning on observed data. Finally, data are considered missing not at random (MNAR) if they are neither MCAR or MAR and the missing mechanism depends on the unobserved values, given the observed data.
The MAR assumption is usually reasonable in the context of longitudinal trials and current guidance outlines a framework that includes sensitivity analyses to assess the extent to which analytic approaches are robust to missing data assumptions. [10,11,12] Appropriate analyses that assume MAR include mixed models using maximum likelihood estimation, extensions of generalized estimating equations (GEEs) such as weighted GEE, and multiple imputation (MI). [13, 14] Of these, MI is the only approach that can be used with any analytic model. MI is a threestage process. First, missing values are filled M times by a random draw from a posterior distribution of the imputation model to generate M complete datasets. Secondly, the M datasets are analyzed via any statistical approach and thirdly, results are combined using a set of rules that accounts for the uncertainty of the imputed values. [15] The imputation model must be congenial, i.e., include the same variables, but does not have to be consistent with the substantive model. Thus, the imputation model can include variables predictive of missingness such as the outcome from intermittent timepoints, making MI a natural choice in responder analysis using a test of proportions. For these reasons we focus this paper on MI.
When a continuous outcome is ultimately dichotomized, the specifications of the imputation model come into question. Practitioners can either impute the missing outcome before dichotomizing the response (IBD) or dichotomize the outcome then impute the response (DTI). Demirtas evaluated efficiency and accuracy of the estimated proportions of responders using IBD under the multivariate normal assumption compared to DTI using a saturated binomial model for the dichotomous response indicator, and concluded that DTI was superior across most scenarios. [16] This finding is in contrast to Yoo’s work that concluded MI with GEEs performs better when the underlying continuous outcome is imputed prior to dichotomizing. [17] More generally, Von Hippel’s work supports the use of justanothervariable (JAV), analogous to DTI, to impute a quadratic and interaction term under a linear regression analysis model with a conceptual argument extending to the logistic setting. [18] Others demonstrated poor performance using JAV when data were MAR particularly with logistic regression [19], prompting some researchers to discourage this practice. [14]
In trial settings where the dichotomized response of a continuous outcome is of interest, there is no clear best way to handle missing data. The aim of this paper is to clarify inconsistent results in the performance of multiply imputing the IBD or DTI in responder analysis and compare with the commonly recommended nonresponse imputation.
Methods
Notation and analysis
Let the underlying continuous measure which is to be dichotomized into the response indicator be Y_{ij} for subject i where i = 1, …, n measured at the j timepoint. Measurements are repeated over time such that j = 1, …, t_{i} are the observed measurements for each subject and t_{i} represents the time of dropout or end of the study, T. Without loss of generality, assume that higher values of Y indicated better outcomes. Let Y_{ij > 1} − Y_{i1} = C_{i} represent change from baseline to time j > 1. Subject i is classified as a responder if C_{i} ≥ λ for some threshold λ, defined as R_{i} = I(C_{i} ≥ λ). Consider a randomized controlled trial with treatment and control arm.
The objective of responder analysis is to evaluate the difference in proportion of responders at the endpoint between treatment arms.
Multiple imputation approach
When data have either an intermittent or monotone missing pattern, multiple imputation using the Markov chain Monte Carlo (MCMC) method and fully conditional specification (FCS, also known as imputation by chained equations) method are two popular options. [20] Both are relatively flexible to specify, straightforward to understand, and easy to apply with standard statistical software. The FCS assumes the existence of, but does not rely on, a multivariate distribution. [20] Specifically, the FCS approach assumes conditional densities for each partially observed variable and uses a corresponding regression model to sequentially generate imputations, e.g., linear regression for continuous variables and logistic regression for categorical variables. We used FCS MI for imputing both the unobserved continuous outcomes for IBD MI and the missing responder status for DTI MI, both using the continuous outcomes intermittent timepoints as auxiliary variables, and in some cases, additional covariates related to the outcome, detailed below. Thus, the comparison is not in the MI method but rather the specification of the imputation model.
In general, the FCS procedure can be described in the following steps. [21, 22] Consider a set of variables X = X_{1}, …, X_{q} in the imputation model. First, starting values for unobserved measures are filled in sequentially for each variable in the order specified. Continuous variables are filled in by regressing one variable, say, X_{1}, on the other X_{2}, …, X_{q} covariates and using the resulting set of parameters to fill in the missing values of X_{1}. Binary variables are filled in similarly using logistic regression. The next imputation phase replaces the filled in values with imputed values. For a set of observed values of one variable, X_{1}, the corresponding imputation model is fit using both the observed and filledin values of all other q − 1 variables as the independent variables and X_{1} as the dependent variable. In this study, the binary variable, R, is fit using logistic regression and the continuous variables, Y_{j}, are fit with linear regression. The resulting set of parameters are used to impute the first set missing values. The latter two steps are repeated on the remaining q − 1 variables to comprise a cycle. The algorithm runs through a number of cycles updating the imputed values until convergence, at which point the current values of all X ’s complete the first imputed dataset. The process is repeated for M datasets.
To calculate the estimand θ using IBD MI, we imputed the missing continuous outcomes Y_{j}, calculated the responder status, R_{i}, estimated the difference and combined estimates using Rubin’s rules in the final step. For DTI MI, we calculated responder status prior to imputing and included the partially observed responder status, R_{i}, in the imputation model. Using the imputed R_{i}, we calculated the difference in proportions between treatment arms on the M datasets and combined using Rubin’s rules.
Data generation
We simulated twentyfour scenarios to represent a randomized trial with two treatment arms with N = 200, and a continuous outcome measured at baseline and three subsequent timepoints. The scenarios described two response profiles with the same mean difference at the final assessment, six mechanisms of dropout, and two dropout rates. One response profile was linear where only treatment A was effective. In the other response profile, treatment A is effective after a period of worsening and treatment B demonstrates no effectiveness after a period of improving, hence the mean trajectories of treatment A and B cross. The third and fourth response profiles had no treatment differences at the final timepoint and were used to evaluate type 1 error.
Data for the continuous response were simulated to represent a PRO scale with equal allocation to treatment groups. Let Y_{ij} represent a continuous measure for the i^{th} individual at the j^{th} timepoint where j = 1, …, 4. Specifically, data were simulated according to the underlying model:
where x_{trt} = 1 for treatment arm A and 0 for treatment arm B, β_{j} denotes the effect of the j^{th} timepoint and δ_{j} ∗ x_{trt} is the interaction of treatment group and the timepoint. Here, \( {b}_i\sim N\left(0,{\sigma}_b^2\right) \) represents the random subject effect and the error term, \( {\epsilon}_{ij}\sim N\ \left(0,{\sigma}_{\epsilon}^2\right) \) represents the withinsubject error. The mean vectors for the linear response profile were μ_{A} = (65, 67, 69, 71)′ and μ_{B} = (65, 65, 65, 65)′ . The nonlinear response profile was μ_{A} = (65, 63, 68, 71)′ and μ_{B} = (65, 67, 66, 65)′. The third and fourth response profiles to estimate type 1 error were μ = (65, 65, 65, 65)′ for both arms; and μ_{A} = (65, 67, 69, 71)′ and μ_{B} = (65, 63, 68, 71)′, respectively. Based on typical PRO scale data [23], we set σ_{b} = 12 and σ_{ϵ} = 7. These variance components correspond to a compound symmetric covariance structure with a withinperson correlation of 0.7. Additionally, we created a normally distributed continuous correlated variable (CV) to Y_{4} such that \( {\rho}_{CV,{Y}_4}\cong 0.3 \), and the mean and standard deviation were 38.0 and 62.7 respectively.
Let Y_{i4} − Y_{i1} = C_{i} represent change from baseline to timepoint j = 4. To achieve 80% power to detect the difference of response rates between the two arms, the dichotomized response was defined as R_{i4} = I(C_{i} ≥ 12.4). Using this definition, response rates for the first and second response profiles for treatment A and B were 25.6 and 10.6, respectively. (Exploratory result using thresholds ranging from 10 to 20 produced similar trends.)
Missing data
We used six probability models representing plausible trial scenarios to delete postbaseline observations using a MAR mechanism. Let Z_{ij} = 0 if outcome Y_{ij} is missing and 1 otherwise.
Dropout model 1
For the first model of dropout, the probability of missing response is dependent on the value of the observed outcome at Y_{j − 1} such that \( P\left({Z}_{ij}=0\right)\propto \left(1\Phi \left({Y}_{j1},{\hat{\theta}}_{Y_{j1}},{\hat{\sigma}}_{Y_{j1}}^2\right)\right) \), where j > 1 and Φ is the normal cumulative distribution function with mean \( {\hat{\theta}}_{Y_{j1}} \) and standard deviation \( {\hat{\sigma}}_{Y_{j1}}^2 \) estimated from the data. This model represents the probability of dropout due to lack of efficacy.
Dropout model 2
The mechanism leading to dropout can differ by treatment. [25] To model this, observations in treatment A were more likely to be missing when the outcome, Y_{j − 1}, value was low such that \( P\left({Z}_{ij}=0\right)\propto \left(1\Phi \left({Y}_{j1},{\hat{\theta}}_{Y_{j1}},{\hat{\sigma}}_{Y_{j1}}^2\right)\right) \), j > 1, and observations in treatment B were more likely to be missing when Y_{j − 1} values were high such that \( P\left({Z}_{ij}=0\right)\propto \left(\Phi \left({Y}_{j1},{\hat{\theta}}_{Y_{j1}},{\hat{\sigma}}_{Y_{j1}}^2\right)\right) \), j > 1.
Dropout model 3
Model 3 represents missing mechanisms in the opposite direction of model 2 for the treatment arms. For example, lack of efficacy could drive dropout in a placebo arm while those on treatment may be less motivate to return to follow up when they are feeling better, i.e. improved efficacy. Here, treatment B observations were more likely to be missing when the outcome, Y_{j − 1}, value was low such that \( P\left({Z}_{ij}=0\right)\propto \left(1\Phi \left({Y}_{j1},{\hat{\theta}}_{Y_{j1}},{\hat{\sigma}}_{Y_{j1}}^2\right)\right) \), j > 1, and treatment A observations more likely to be missing when Y_{j − 1} values were high such that \( P\left({Z}_{ij}=0\right)\propto \left(\Phi \left({Y}_{j1},{\hat{\theta}}_{Y_{j1}},{\hat{\sigma}}_{Y_{j1}}^2\right)\right) \), j > 1.
Dropout model 4
Treatment arm dropout rate can be differential. [26, 27] We modeled substantial differential dropout by including a weight term, \( {w}_{x_{trt}} \), specific to treatment arm, such that \( P\left({Z}_{ij}=0\right)\propto {w}_{x_{trt}}\ast \left(1\Phi \left({Y}_{j1},{\hat{\theta}}_{Y_{j1}},{\hat{\sigma}}_{Y_{j1}}^2\right)\right) \), where w_{1} = 0.3 and w_{0} = 1.
Dropout model 5
Here, Y_{i} was set to missing with probability \( P\left({Z}_{ij}=0\right)\propto \left[\frac{1}{1+{e}^{\left({b}_1{Y}_{j1}\right)}}\right] \), where j > 1 and b_{1} = 0.01 modeling drop out due to lack of efficacy using a different mechanism than model 1.
Dropout model 6
We simulated a repeated indicator variable representing occurrence of adverse events (AEs) to represent drug tolerability. The probability of AE depended jointly on treatment arm and occurrence of an AE at the prior visit such that for each assessment for each treatment group
where x_{trt} represents the treatment arm and γ represents AE status at j − 1. Probabilities were estimated from actual trial data and were similar to prior published event rates (Table 1). [24] For simplicity we assumed that no AEs occurred at baseline and the probability of AE at j = 2 was 0.3 for x_{trt} = 1 and 0.5 for x_{trt} = 0. For each subject we generated AE status at each postbaseline visit as \( {AE}_{ij}\sim Bernoulli\left({p}_j^{AE}\right) \).
The response Y_{i} was set to missing with probability \( P\left({Z}_{ij}=0\right)\propto \left[\frac{1}{1+{e}^{\left({b}_1{Y}_{j1}+{b}_2{AE}_j\right)}}\right] \), where j > 1 and b_{1} = 0.01 and b_{2} = − 0.40 to model the probability of dropout due to lack of efficacy and tolerability. If Y_{i} was set to missing, all subsequent AE were also set to missing.
For all dropout models, we multiplied P(Z_{ij} = 0) by a randomly generated uniform variable and determined a cutoff value creating the overall proportion of missing responses at j = 4 to be 30% or 50%. If a patient was missing at any Y_{j = a} then all Y_{j > a} were set to missing.
Analysis and comparison of methods
We determined the required number of simulated datasets per scenario, n_{sim}, by estimating the standard deviation (SD) of \( \hat{\theta} \) to be ≤6.0, based on exploratory simulations and setting the maximum tolerated Monte Carlo standard error (MCSE) of bias to be ≤.15. Given \( MCSE(Bias)=\sqrt{\frac{Var\left(\hat{\theta}\right)}{n_{sim}}} \)’ the required number of datasets was n_{sim} = 1600. [28] For each simulated dataset, we evaluated the proportions of responders in, and the difference between, each arm at j = 4. For IBD MI and DTI MI, all imputation models contained the group indicator, x_{trt}, and the continuous outcomes Y_{j}. In some imputation models, we included CV, a variable representing a correlated covariate to evaluate the utility of including an auxiliary variable. For DTI MI, the imputation model included the binary response variable, R. Scenarios using dropout model 6 also included AE status at j = 2, 3, 4 in the imputation model. The M = 30 or M = 50 estimates [22] of the difference in proportions and respective standard errors when 30% or 50% of responses at j = 4 were missing, respectively, were combined using Rubin’s Rules. [29] Sample SAS code is included in the Appendix.
We compared percent bias, coverage probability of the 95% confidence interval (CI) from multiple imputation, power, and type 1 error rate to assess the relative performance of NRI, IBD MI and DTI MI to the fully observed simulated data. We calculated percent bias of the difference as:
where π represents the true proportion of responders, and \( \overline{p} \) is the average proportion of responders among datasets with missing observations. Positive values represent positive biases of the estimated difference in proportions. We calculated coverage probability as the proportion of MI results where the true value was contained within the 95% CI. Power was calculated as the percentage of statistically significant group differences at the significance level of 0.05. Similarly, the type 1 error rate was calculated as the percentage of statistically significant group differences at the significance level of 0.05 when simulating a scenario with no between group difference. We assess performance of the simulation with the MCSE of bias, mean square error (MSE), standard error of the model (SE_{mod}) and the empirical standard error of the difference in proportions (SE_{emp})_{.} Let \( \hat{\theta}={\hat{p}}_A{\hat{p}}_B \) be the difference in proportions between groups. MSE, calculated as
is a combined measure of variance and bias. SE_{mod} is the average standard error of each \( \hat{\theta_i} \), and SE_{emp}, is the standard error of \( \hat{\theta} \), measuring the efficiency of \( \hat{\theta} \)_{.} Simulation and analyses were conducted using SAS software version 9.4 (SAS Institute Inc., 2013).
Results
When the response profile was linear with 30% of responses missing, bias was less than 7.3% for all MI approaches and ranged from 8.5 to − 36.7% for NRI (Table 2). Similar results were seen in the nonlinear response profile (Appendix A). IBD MI had slightly lower or equal bias relative to DTI MI for all scenarios, and bias was conservative in direction, i.e., negative for 4 out of the 5 dropout models. All MI models included the continuous repeated outcomes as auxiliary variables in the imputation model. When using DTI MI, the addition of the correlated auxiliary variable reduced bias and changed the direction from positive to negative in all scenarios except when there were differential dropout rates. Including the auxiliary variable in the IBD MI model increased the negative bias in all but the scenario with differential dropout.
The probability of dropout in model 6 was related to both treatment arms, through AE status, and outcome score. Including AE status at j = 2, 3, 4 in the imputation model negligibly reduced bias with DTI MI, and maintained a similar level of bias with IBD MI, compared to no auxiliary variables.
NRI suffered from high negative bias and substantial loss of power to detect differences in all but one scenario. The proportion of responders per treatment arm were always underestimated because missing observations were classified as nonresponders. When the dropout mechanism affected the two arms differentially (model 4), NRI produced a positively biased difference estimate.
When 50% of responses were missing with the linear response profile, IBD MI had less bias relative to DTI MI without the use of CV for all scenarios, and bias was negative in direction for 5 of the 6 dropout models (Table 3). Specifically, bias with DTI MI (with no auxiliary variables) ranged from − 21.8 to 11.0. Under the same conditions, the bias of IBD MI ranged from − 6.9 to 0.7. In general, power to detect treatment differences was lower using IBD MI compared to DTI MI.
Coverage probabilities of 95% confidence for all MI approaches ranged from 93.2 to 95.3% when 30% of the responses were missing (Table 2). When 50% of responses were missing, the coverage probabilities when imputing the continuous response were closer to the nominal level of 95% compared to imputing the dichotomized response, ranging from 90.1 to 94.4% and 77.5 to 92.6%, respectively (Table 3). NRI coverage was lower than the MI approaches in all scenarios except for when there was differential dropout. Although IBD MI generally had lower power to detect treatment differences compared to DTI MI, the difference was negligible. NRI was more precise as measured through the SE_{emp} of the difference in proportions between groups, compared to all MI approaches (Table 4). However, as a function of the high levels of bias, NRI performed poorly in terms of MSE compared to the MI approaches. The MCSE of bias was between 0.12–0.14, less than our tolerated level of uncertainty, when 30% of responses were missing. NRI had higher precision estimating the group difference, compared to the other approaches as seen with the lower SE_{emp}. The SE_{mod} was similar to the SE_{emp} suggesting bias of SE_{emp} is not a concern.
Type 1 error rate was controlled at less than 5% for both multiple imputation strategies, reported in Table 5. When dropout rates differed between groups (model 4), NRI had type 1 error rates ranging from 0.16 to 0.31, suggesting false positives are of concern.
The nonlinear response profile demonstrated very similar results overall, as shown in the Appendix.
Application to a clinical trial
We applied the above imputation approaches to data adapted from a Phase III randomized, doubleblind clinical trial in patients with major depressive disorder. The trial evaluated efficacy of duloxetine 40 mg/d and 80 mg/d versus placebo and a comparator, paroxetine 20 mg/d, to treat emotional and physical symptoms in depressed patients. [30] Details of the original trial design are reported in Goldstein et al. [30] For the purpose of this study, we considered a publicly available dataset modified from the original trial data. [31] The trial included four parallel arms; the modified dataset has two arms: the original placebo arm and a “treatment” arm consisting of a random sample of patients from the three active drug arms. At 6 weeks post randomization, 75% of the patients remained in the study. To further illustrate the effect of imputation choice, we used a MAR mechanism (Dropout model 1) to identify observations to omit so that 60% of patients have outcome values at week 6. The outcome was the total score on the 17item Hamilton depression rating scale (HAMD17), measured at baseline and weeks 1, 2, 4, and 6 after randomization. Lower scores indicate less severity; negative change scores indicate improvement. We conducted a responder analysis using a meaningful change threshold of 6 points to assess the proportions of patients who improved at 6 weeks postbaseline, as this threshold coincides with common categories of depression severity, e.g., the difference between mild and moderate depression is 6 points.
Case study results
At baseline N = 172 subjects (n = 84 in the treatment group and n = 88 in the control group) had complete HAMD17 total scores. The difference in proportions of responders at week 6 was 19.1% (p = 0.009), 21.9% (p = 0.009) and 21.1% (p = 0.007) estimated using NRI, IBD MI and DTI MI, respectively (Table 6). When the number of patient dropouts was increased to 40%, the difference in proportions was reduced from 19.1 to 13.1% (p = 0.064), remained similar at 21.9 and 22.6% (p = 0.007), or increased from 21.1 to 24.6% (p = 0.002) when using NRI, IBD MI and DTI MI, respectively, compared to the original data. We repeated the random sampling using dropout model 1 three times and saw similar results. These results show that as missingness increased, IBD estimates remained similar. NRI estimates decreased (and were no longer able to detect statistically significant differences) and DTI MI estimates increased slightly. Using the IBD method, 56.3% of patients in the treatment arm improved at least as much as 6 points in the HAMD17 depression scale compared to 36.3% of those in the placebo arm for a between group difference of responders of 21.9 (CI: [5.3, 36.6], p = 0.009).
Discussion
When continuous data are collected in longitudinal trials with the ultimate interest in differences of a binary response, imputing missing as nonresponse produces positively and negatively biased estimates. Multiply imputing before dichotomization is often slightly less biased than dichotomizing then imputing but both methods perform well when 30% of the responses are missing. When there are higher rates of missing outcomes, dichotomizing before imputing produced estimates with over 10% bias in three scenarios. When applied to real trial data where the true difference in proportions is unknown, the method of imputing prior to dichotomizing produced similar estimates when both 25 and 40% of observations at the endpoint were missing.
Literature addressing IBD and DTI has been contradictory. One reason could be the choice in MI method. For example, Demirtas used a saturated multinomial model to impute the binary outcome. [16] While statistically sound, this MI approach is not readily available in standard statistical software. Another study using the Markov chain Monte Carlo (MCMC) method comparing IBD MI and DTI MI prior to assessing binary outcomes longitudinally via GEEs found an advantage to imputing before dichotomizing, consistent with the work of Yoo. [17] One distinguishing feature of our study was the use of the continuous Y_{j}’s as auxiliary variables in the imputation model making the MAR assumption more likely if they are predictive of missingness, the outcome, or both. [14, 25]
The use of auxiliary variables in addition to the outcomes from interim timepoints in the imputation models provided limited usefulness. It is likely that the correlation between CV and the outcome was not strong enough to systematically increase precision. Further, adverse events were not related to the outcome after conditioning on the treatment group. The use of auxiliary variables are generally useful to reduce the standard error when highly correlated with the outcome or reduce bias when correlated with the outcome and missingness. [22]
It is unclear why NRI is a recommended strategy in light of the highly biased estimates produced in this simulation and others. [7, 8, 32, 33] Practitioners may erroneously believe that NRI always produces conservative results. Indeed, the NRI can only underestimate proportions of responders in treatment groups. However, when the difference in proportions is of interest, which is usually the case, using NRI when there is differential dropout can yield erratic results including positively biased estimates as shown in model 4. [7, 26] Further warnings include those related to composite endpoints [5, 6] and single imputation methods which underestimate the uncertainty of the missing data in the form of overly precise standard errors. [13, 34]
This study aimed to determine the optimal approach to imputing missing observations for responder analysis when a continuous variable is dichotomized. However, it is impossible to simulate all scenarios that could occur in real settings. We simulated outcomes under a normal distribution which may not always happen. For example, the baseline measure will not be normally distributed if the measure is also an inclusion criterion and subjects must meet a cutoff value. Many outcomes, such as PROs, are measured ordinally and imputing a continuous version via a linear regression could produce values not possible on the original scale. Data here were simulated to be MAR yet in real settings missing may be MNAR or a mixture of mechanisms.
Conclusion
We compared imputation methods for missing outcomes in a responder analysis. MI approaches using the longitudinally measured continuous outcome as auxiliary variables performed better than imputing missing observations as failures. Differences in proportions of responders between arms, bias, coverage probabilities of the 95% confidence interval, and other performance measures were similar for both MI approaches with moderate rates of missingness. With high rates of missingness, imputing the continuous outcome prior to dichotomizing was less biased and provided better coverage probability than imputing the already transformed response. Trialists conducting responder analysis by dichotomizing a continuous outcome can benefit from these findings.
Availability of data and materials
The dataset analyzed as the case study during the current study is available at http://www.missingdata.org.uk. The simulated datasets analysed during the current study are available from the corresponding author on reasonable request.
Abbreviations
 AE:

Adverse events
 CI:

Confidence interval
 CV:

Correlated variable
 DTI:

Dichotomize then impute
 FCS:

Full conditional specification
 GEE:

Generalized estimating equation
 HAMD:

Hamilton Depression Rating Scale
 IBD:

Impute before dichotomizing
 JAV:

Just another variable
 MAR:

Missing at random
 MCAR:

Missing completely at random
 MCMC:

Markov chain Monte Carlo
 MI:

Multiple imputation
 MNAR:

Missing not at random
 MSE:

Mean squared error
 NRI:

Nonresponse imputation
 PRO:

Patient reported outcome
 SD:

Standard deviation
 SE_{emp} :

Standard error, empirical
 SE_{mod} :

Standard error, model
References
 1.
Bell ML, Fiero M, Horton NJ, Hsu CH. Handling missing data in RCTs; a review of the top medical journals. BMC Med Res Methodol. 2014;14:1–8.
 2.
LaVange LM, Permutt T. A regulatory perspective on missing data in the aftermath of the NRC report. Stat Med WileyBlackwell. 2016;35:2853–64.
 3.
Brundage M, Osoba D, Bezjak A, Tu D, Palmer M, Pater J. Lessons learned in the assessment of healthrelated quality of life: selected examples from the National Cancer Institute of Canada clinical trials group. J Clin Oncol American Society of Clinical Oncology. 2007:5078–81.
 4.
Moore AR, Straube S, Eccleston C, Derry S, Aldington D, Wiffen P, et al. Estimate at your peril: imputation methods for patient withdrawal can bias efficacy outcomes in chronic pain trials using responder analyses. Pain. 2012;153:265–8.
 5.
Cordoba G, Schwartz L, Woloshin S, Bae H, Gøtzsche PC. Definition, reporting, and interpretation of composite outcomes in clinical trials: systematic review. BMJ British Medical Journal Publishing Group. 2010;341:c3920.
 6.
FerreiraGonzález I, PermanyerMiralda G, Busse JW, Bryant DM, Montori VM, AlonsoCoello P, et al. Methodologic discussions for using and interpreting composite endpoints are limited, but still identify major concerns. J Clin Epidemiol Pergamon. 2007;60:651–7.
 7.
Hall SM, Delucchi KL, Velicer WF, Kahler CW, RangerMoore J, Hedeker D, et al. Statistical analysis of randomized trials in tobacco treatment: longitudinal designs with dichotomous outcome. Nicotine Tob Res. 2001;3:193–202.
 8.
Hedeker D, Mermelstein RJ, Demirtas H. Analysis of binary outcomes with missing data: missing = smoking, last observation carried forward, and a little multiple imputation. Addiction. 2007;102:1564–73.
 9.
Little RJA, Rubin DB. Statistical analysis with missing data. 2002.
 10.
Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, et al. Analyzing incomplete longitudinal clinical trial data. Biostatistics. 2004;5:445–64.
 11.
Molenberghs G, Kenward MG. Missing data in clinical studies. Chichester, UK: John Wiley & Sons, Ltd; 2007.
 12.
National Research Council (US) Panel on Handling Missing Data in Clinical Trials. The Prevention and Treatment of Missing Data in Clinical Trials. Washington (DC): National Academies Press (US); 2010. Available from: https://www.ncbi.nlm.nih.gov/books/NBK209899/.
 13.
Bell ML, Fairclough DL. Practical and statistical issues in missing data for longitudinal patient reported outcomes. Stat Methods Med Res. 2013;625:1–20.
 14.
Carpenter JR, Kenward MG. Missing data in randomised controlled trials — a practical guide. 2007;1–206.
 15.
Rubin DB. Multiple imputation for nonresponse in surveys. Vol. 81. Wiley, 2004.
 16.
Demirtas H. Practical advice on how to impute continuous data when the ultimate interest centers on dichotomized outcomes through prespecified thresholds. Commun Stat Simul Comput Taylor & Francis Group. 2007;36:871–89.
 17.
Yoo B. The impact of dichotomization in longitudinal data analysis: a simulation study. Pharm Stat Pharm Stat. 2010;9:298–312.
 18.
Von Hippel PT. How to impute interactions, squares, and other transformed variables. Sociol Methodol. Wiley/Blackwell (10.1111); 2009;39:265–91.
 19.
Seaman SR, Bartlett JW, White IR. Multiple imputation of missing covariates with nonlinear effects and interactions: an evaluation of statistical methods. BMC Med Res Methodol BioMed Central. 2012;12:46.
 20.
Van Buuren S, Brand JPL, GroothuisOudshoorn CGM, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul Taylor & Francis. 2006;76:1049–64.
 21.
Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res NIH Public Access. 2011;20:40–9.
 22.
White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30:377–99.
 23.
Bell ML, McKenzie JE. Designing psychooncology randomised trials and cluster randomised trials: variance components and intracluster correlation of commonly used psychosocial measures. Psychooncology. 2013;22:1738–47.
 24.
Lipkovich I, Duan Y, Ahmed S. Multiple imputation compared with restricted pseudolikelihood and generalized estimating equations for analysis of binary repeated measures in clinical studies. Pharm Stat WileyBlackwell. 2005;4:267–85.
 25.
Fairclough D. Design and analysis of quality of life studies in clinical trials; 2010.
 26.
Bell ML, Kenward MG, Fairclough DL, Horton NJ. Differential dropout and bias in randomised controlled trials: when it matters and when it may not. BMJ BMJ Group. 2013;346:e8668.
 27.
Leucht S, Corves C, Arbter D, Engel RR, Li C, Davis JM. Secondgeneration versus firstgeneration antipsychotic drugs for schizophrenia: a metaanalysis. Lancet. 2009;373:31–41.
 28.
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019 Jan 16.
 29.
Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92.
 30.
Goldstein DJ, Lu Y, Detke MJ, Wiltse C, Mallinckrodt C, Demitrack MA. Duloxetine in the treatment of depression: a doubleblind placebocontrolled comparison with paroxetine. J Clin Psychopharmacol. 2004;24:389–99.
 31.
MissingData.org: London School of Hygeine and Tropical Medicine [internet]. [cited 2018 Oct 7]. Available from: http://www.missingdata.org.uk/.
 32.
Yamaguchi Y, Misumi T, Maruo K. A comparison of multiple imputation methods for incomplete longitudinal binary data. J Biopharm Stat Taylor & Francis. 2017 Sep 8:1–23.
 33.
Nelson DB, Partin MR, Fu SS, Joseph AM, An LC. Why assigning ongoing tobacco use is not necessarily a conservative approach to handling missing tobacco cessation outcomes. Nicotine Tob Res. 2009;11:77–83.
 34.
Mallinckrodt CH. Preventing and treating missing data in longitudinal clinical trials: a practical guide. Cambridge University Press, 2013.
Acknowledgements
N/A
Funding
This work was not supported by grant funding.
Author information
Affiliations
Contributions
LF conceived and designed the study, analyzed and interpreted the data and drafted the manuscript. MB contributed to the conception and design of the study, the interpretation of the results, and editing of the content. All authors read and approved the final version of the manuscript.
Corresponding author
Correspondence to Lysbeth Floden.
Ethics declarations
Ethics approval and consent to participate
N/A
Consent for publication
N/A
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Responder analysis
 Clinical trials
 Multiple imputation
 Missing data
 Missing at random