Skip to main content

Imputation strategies when a continuous outcome is to be dichotomized for responder analysis: a simulation study



In many clinical trials continuous outcomes are dichotomized to compare proportions of patients who respond. A common and recommended approach to handling missing data in responder analysis is to impute as non-responders, despite known biases. Multiple imputation is another natural choice but when a continuous outcome is ultimately dichotomized, the specifications of the imputation model come into question. Practitioners can either impute the missing outcome before dichotomizing or dichotomize then impute. In this study we compared multiple imputation of the continuous and dichotomous forms of the outcome, and imputing responder status as non-response in responder analysis.


We simulated four response profiles representing a two-arm randomized controlled trial with a continuous outcome at four time points. We omitted data using six missing at random mechanisms, and imputed missing observations three ways: 1) replacing as non-responder; 2) multiply imputing before dichotomizing; and 3) multiply imputing the dichotomized response. Imputation models included the continuous response at all timepoints, and additional auxiliary variables for some scenarios. We assessed bias, power, coverage of the 95% confidence interval, and type 1 error. Finally, we applied these methods to a longitudinal trial for patients with major depressive disorder.


Both forms of multiple imputation performed better than non-response imputation in terms of bias and type 1 error. When approximately 30% of responses were missing, bias was less than 7.3% for multiple imputation scenarios but when 50% of responses were missing, imputing before dichotomizing generally had lower bias compared to dichotomizing before imputing. Non-response imputation resulted in biased estimates, both underestimates and overestimates. In the example trial data, non-response imputation estimated a smaller difference in proportions than multiply imputed approaches.


With moderate amounts of missing data, multiply imputing the continuous outcome variable prior to dichotomizing performed similar to multiply imputing the binary responder status. With higher rates of missingness, multiply imputing the continuous variable was less biased and had well-controlled coverage probabilities of the 95% confidence interval compared to imputing the dichotomous response. In general, multiple imputation using the longitudinally measured continuous outcome in the imputation model performed better than imputing missing observations as non-responders.

Peer Review reports


Clinical trials can be evaluated by differences in rates of successful response. In so-called responder analysis, subjects are classified as responders, often by dichotomizing a continuous outcome, if they improve by a specified threshold. For example, responder definitions could be a 5% change in body mass index or an improvement in symptoms by 10 points on a 100-point symptom scale. Responder analysis is commonly used with patient-reported outcomes (PROs) because results are easily interpretable to patients and other stakeholders and can lend language to drug labels and promotional claims.

When the outcome is measured for all subjects at baseline and the timepoint of interest, responder status can be calculated, and the analysis is routine. However missing data are ubiquitous in longitudinal trials and responder status cannot be determined for subjects missing the outcome. [1] One approach for handling missing data in responder analysis, recommended in the regulatory setting [2,3,4] is to impute subjects missing the outcome as non-responders, termed non-response imputation (NRI). However, it is a strong assumption to assume unobserved outcomes are uniformly “failures” rather than come from the distribution of subjects who do not improve. NRI can be thought of as a composite outcome of response and a dropout indicator. Methodologists warn that composite endpoints can be misleading, for example, when the components have varying degrees of severity and treatment effects of each component differ between groups. [5, 6] This could be true if dropout depended at least partly on a tolerability. For example, a cancer treatment may offer a favorable toxicity profile relative to a comparator. Using NRI, the response rate of the comparator arm more than in the treatment arm would reflect the effect of tolerability, i.e., have more non-responders, and could widening the between arm difference. While some may view NRI as a conservative approach (since the proportions of responders can only decrease), treating missing as response failure can result in unpredictable differences in proportions between treatment groups. [7, 8]

In longitudinal trials, missing observations can be intermittent, as in a missed study visit, but dropout accounts for most missing data. We focus this article on monotone missing patterns, implying that observations are observed up until one is missing and all subsequent observations are missing. Little and Rubin [9] provide a framework to describe categories of missing data mechanisms given the relationship with observed and unobserved values. When the probability of missingness is independent of the observed and unobserved data the mechanism is said to be missing completely at random (MCAR). Data are missing at random (MAR) if the probability of missingness is independent of the unobserved data after conditioning on observed data. Finally, data are considered missing not at random (MNAR) if they are neither MCAR or MAR and the missing mechanism depends on the unobserved values, given the observed data.

The MAR assumption is usually reasonable in the context of longitudinal trials and current guidance outlines a framework that includes sensitivity analyses to assess the extent to which analytic approaches are robust to missing data assumptions. [10,11,12] Appropriate analyses that assume MAR include mixed models using maximum likelihood estimation, extensions of generalized estimating equations (GEEs) such as weighted GEE, and multiple imputation (MI). [13, 14] Of these, MI is the only approach that can be used with any analytic model. MI is a three-stage process. First, missing values are filled M times by a random draw from a posterior distribution of the imputation model to generate M complete datasets. Secondly, the M datasets are analyzed via any statistical approach and thirdly, results are combined using a set of rules that accounts for the uncertainty of the imputed values. [15] The imputation model must be congenial, i.e., include the same variables, but does not have to be consistent with the substantive model. Thus, the imputation model can include variables predictive of missingness such as the outcome from intermittent timepoints, making MI a natural choice in responder analysis using a test of proportions. For these reasons we focus this paper on MI.

When a continuous outcome is ultimately dichotomized, the specifications of the imputation model come into question. Practitioners can either impute the missing outcome before dichotomizing the response (IBD) or dichotomize the outcome then impute the response (DTI). Demirtas evaluated efficiency and accuracy of the estimated proportions of responders using IBD under the multivariate normal assumption compared to DTI using a saturated binomial model for the dichotomous response indicator, and concluded that DTI was superior across most scenarios. [16] This finding is in contrast to Yoo’s work that concluded MI with GEEs performs better when the underlying continuous outcome is imputed prior to dichotomizing. [17] More generally, Von Hippel’s work supports the use of just-another-variable (JAV), analogous to DTI, to impute a quadratic and interaction term under a linear regression analysis model with a conceptual argument extending to the logistic setting. [18] Others demonstrated poor performance using JAV when data were MAR particularly with logistic regression [19], prompting some researchers to discourage this practice. [14]

In trial settings where the dichotomized response of a continuous outcome is of interest, there is no clear best way to handle missing data. The aim of this paper is to clarify inconsistent results in the performance of multiply imputing the IBD or DTI in responder analysis and compare with the commonly recommended non-response imputation.


Notation and analysis

Let the underlying continuous measure which is to be dichotomized into the response indicator be Yij for subject i where i = 1, …, n measured at the j timepoint. Measurements are repeated over time such that j = 1, …, ti are the observed measurements for each subject and ti represents the time of dropout or end of the study, T. Without loss of generality, assume that higher values of Y indicated better outcomes. Let Yij > 1 − Yi1 = Ci represent change from baseline to time j > 1. Subject i is classified as a responder if Ci ≥ λ for some threshold λ, defined as Ri = I(Ci ≥ λ). Consider a randomized controlled trial with treatment and control arm.

The objective of responder analysis is to evaluate the difference in proportion of responders at the endpoint between treatment arms.

Multiple imputation approach

When data have either an intermittent or monotone missing pattern, multiple imputation using the Markov chain Monte Carlo (MCMC) method and fully conditional specification (FCS, also known as imputation by chained equations) method are two popular options. [20] Both are relatively flexible to specify, straightforward to understand, and easy to apply with standard statistical software. The FCS assumes the existence of, but does not rely on, a multivariate distribution. [20] Specifically, the FCS approach assumes conditional densities for each partially observed variable and uses a corresponding regression model to sequentially generate imputations, e.g., linear regression for continuous variables and logistic regression for categorical variables. We used FCS MI for imputing both the unobserved continuous outcomes for IBD MI and the missing responder status for DTI MI, both using the continuous outcomes intermittent timepoints as auxiliary variables, and in some cases, additional covariates related to the outcome, detailed below. Thus, the comparison is not in the MI method but rather the specification of the imputation model.

In general, the FCS procedure can be described in the following steps. [21, 22] Consider a set of variables X = X1, …, Xq in the imputation model. First, starting values for unobserved measures are filled in sequentially for each variable in the order specified. Continuous variables are filled in by regressing one variable, say, X1, on the other X2, …, Xq covariates and using the resulting set of parameters to fill in the missing values of X1. Binary variables are filled in similarly using logistic regression. The next imputation phase replaces the filled in values with imputed values. For a set of observed values of one variable, X1, the corresponding imputation model is fit using both the observed and filled-in values of all other q − 1 variables as the independent variables and X1 as the dependent variable. In this study, the binary variable, R, is fit using logistic regression and the continuous variables, Yj, are fit with linear regression. The resulting set of parameters are used to impute the first set missing values. The latter two steps are repeated on the remaining q − 1 variables to comprise a cycle. The algorithm runs through a number of cycles updating the imputed values until convergence, at which point the current values of all X ’s complete the first imputed dataset. The process is repeated for M datasets.

To calculate the estimand θ using IBD MI, we imputed the missing continuous outcomes Yj, calculated the responder status, Ri, estimated the difference and combined estimates using Rubin’s rules in the final step. For DTI MI, we calculated responder status prior to imputing and included the partially observed responder status, Ri, in the imputation model. Using the imputed Ri, we calculated the difference in proportions between treatment arms on the M datasets and combined using Rubin’s rules.

Data generation

We simulated twenty-four scenarios to represent a randomized trial with two treatment arms with N = 200, and a continuous outcome measured at baseline and three subsequent timepoints. The scenarios described two response profiles with the same mean difference at the final assessment, six mechanisms of dropout, and two dropout rates. One response profile was linear where only treatment A was effective. In the other response profile, treatment A is effective after a period of worsening and treatment B demonstrates no effectiveness after a period of improving, hence the mean trajectories of treatment A and B cross. The third and fourth response profiles had no treatment differences at the final timepoint and were used to evaluate type 1 error.

Data for the continuous response were simulated to represent a PRO scale with equal allocation to treatment groups. Let Yij represent a continuous measure for the ith individual at the jth timepoint where  j = 1, …, 4. Specifically, data were simulated according to the underlying model:

$$ {Y}_{ij}=\left({\beta}_0+{b}_i\right)+{\beta}_j+{\delta}_j\ast {x}_{trt}+{\epsilon}_{ij} $$

where xtrt = 1 for treatment arm A and 0 for treatment arm B, βj denotes the effect of the jth timepoint and δjxtrt is the interaction of treatment group and the timepoint. Here, \( {b}_i\sim N\left(0,{\sigma}_b^2\right) \) represents the random subject effect and the error term, \( {\epsilon}_{ij}\sim N\ \left(0,{\sigma}_{\epsilon}^2\right) \) represents the within-subject error. The mean vectors for the linear response profile were μA = (65, 67, 69, 71)′ and μB = (65, 65, 65, 65)′ . The non-linear response profile was μA = (65, 63, 68, 71)′ and μB = (65, 67, 66, 65)′. The third and fourth response profiles to estimate type 1 error were μ = (65, 65, 65, 65)′ for both arms; and μA = (65, 67, 69, 71)′ and μB = (65, 63, 68, 71)′, respectively. Based on typical PRO scale data [23], we set σb = 12 and σϵ = 7. These variance components correspond to a compound symmetric covariance structure with a within-person correlation of 0.7. Additionally, we created a normally distributed continuous correlated variable (CV) to Y4 such that \( {\rho}_{CV,{Y}_4}\cong 0.3 \), and the mean and standard deviation were 38.0 and 62.7 respectively.

Let Yi4 − Yi1 = Ci represent change from baseline to timepoint j = 4. To achieve 80% power to detect the difference of response rates between the two arms, the dichotomized response was defined as Ri4 = I(Ci ≥ 12.4). Using this definition, response rates for the first and second response profiles for treatment A and B were 25.6 and 10.6, respectively. (Exploratory result using thresholds ranging from 10 to 20 produced similar trends.)

Missing data

We used six probability models representing plausible trial scenarios to delete post-baseline observations using a MAR mechanism. Let Zij = 0 if outcome Yij is missing and 1 otherwise.

Dropout model 1

For the first model of dropout, the probability of missing response is dependent on the value of the observed outcome at Yj − 1 such that \( P\left({Z}_{ij}=0\right)\propto \left(1-\Phi \left({Y}_{j-1},{\hat{\theta}}_{Y_{j-1}},{\hat{\sigma}}_{Y_{j-1}}^2\right)\right) \), where j > 1 and Φ is the normal cumulative distribution function with mean \( {\hat{\theta}}_{Y_{j-1}} \) and standard deviation \( {\hat{\sigma}}_{Y_{j-1}}^2 \) estimated from the data. This model represents the probability of dropout due to lack of efficacy.

Dropout model 2

The mechanism leading to dropout can differ by treatment. [25] To model this, observations in treatment A were more likely to be missing when the outcome, Yj − 1, value was low such that \( P\left({Z}_{ij}=0\right)\propto \left(1-\Phi \left({Y}_{j-1},{\hat{\theta}}_{Y_{j-1}},{\hat{\sigma}}_{Y_{j-1}}^2\right)\right) \), j > 1, and observations in treatment B were more likely to be missing when Yj − 1 values were high such that \( P\left({Z}_{ij}=0\right)\propto \left(\Phi \left({Y}_{j-1},{\hat{\theta}}_{Y_{j-1}},{\hat{\sigma}}_{Y_{j-1}}^2\right)\right) \), j > 1.

Dropout model 3

Model 3 represents missing mechanisms in the opposite direction of model 2 for the treatment arms. For example, lack of efficacy could drive dropout in a placebo arm while those on treatment may be less motivate to return to follow up when they are feeling better, i.e. improved efficacy. Here, treatment B observations were more likely to be missing when the outcome, Yj − 1, value was low such that \( P\left({Z}_{ij}=0\right)\propto \left(1-\Phi \left({Y}_{j-1},{\hat{\theta}}_{Y_{j-1}},{\hat{\sigma}}_{Y_{j-1}}^2\right)\right) \), j > 1, and treatment A observations more likely to be missing when Yj − 1 values were high such that \( P\left({Z}_{ij}=0\right)\propto \left(\Phi \left({Y}_{j-1},{\hat{\theta}}_{Y_{j-1}},{\hat{\sigma}}_{Y_{j-1}}^2\right)\right) \), j > 1.

Dropout model 4

Treatment arm dropout rate can be differential. [26, 27] We modeled substantial differential dropout by including a weight term, \( {w}_{x_{trt}} \), specific to treatment arm, such that \( P\left({Z}_{ij}=0\right)\propto {w}_{x_{trt}}\ast \left(1-\Phi \left({Y}_{j-1},{\hat{\theta}}_{Y_{j-1}},{\hat{\sigma}}_{Y_{j-1}}^2\right)\right) \), where w1 = 0.3 and w0 = 1.

Dropout model 5

Here, Yi was set to missing with probability \( P\left({Z}_{ij}=0\right)\propto \left[\frac{1}{1+{e}^{\left({b}_1{Y}_{j-1}\right)}}\right] \), where j > 1 and b1 = 0.01 modeling drop out due to lack of efficacy using a different mechanism than model 1.

Dropout model 6

We simulated a repeated indicator variable representing occurrence of adverse events (AEs) to represent drug tolerability. The probability of AE depended jointly on treatment arm and occurrence of an AE at the prior visit such that for each assessment for each treatment group

$$ {p}_j^{AE}\left({x}_{trt},\gamma \right)={P}_X\left({AE}_j=1|{AE}_{j-1}=\gamma \right)\ for\ j>2 $$

where xtrt represents the treatment arm and γ represents AE status at j − 1. Probabilities were estimated from actual trial data and were similar to prior published event rates (Table 1). [24] For simplicity we assumed that no AEs occurred at baseline and the probability of AE at j = 2 was 0.3 for xtrt = 1 and 0.5 for xtrt = 0. For each subject we generated AE status at each post-baseline visit as \( {AE}_{ij}\sim Bernoulli\left({p}_j^{AE}\right) \).

Table 1 Conditional probabilities of AEs for j > 2

The response Yi was set to missing with probability \( P\left({Z}_{ij}=0\right)\propto \left[\frac{1}{1+{e}^{\left({b}_1{Y}_{j-1}+{b}_2{AE}_j\right)}}\right] \), where j > 1 and b1 = 0.01 and b2 =  − 0.40 to model the probability of dropout due to lack of efficacy and tolerability. If Yi was set to missing, all subsequent AE were also set to missing.

For all dropout models, we multiplied P(Zij = 0) by a randomly generated uniform variable and determined a cutoff value creating the overall proportion of missing responses at j = 4 to be 30% or 50%. If a patient was missing at any Yj = a then all Yj > a were set to missing.

Analysis and comparison of methods

We determined the required number of simulated datasets per scenario, nsim, by estimating the standard deviation (SD) of \( \hat{\theta} \) to be ≤6.0, based on exploratory simulations and setting the maximum tolerated Monte Carlo standard error (MCSE) of bias to be ≤.15. Given \( MCSE(Bias)=\sqrt{\frac{Var\left(\hat{\theta}\right)}{n_{sim}}} \)’ the required number of datasets was nsim = 1600. [28] For each simulated dataset, we evaluated the proportions of responders in, and the difference between, each arm at j = 4. For IBD MI and DTI MI, all imputation models contained the group indicator, xtrt, and the continuous outcomes Yj. In some imputation models, we included CV, a variable representing a correlated covariate to evaluate the utility of including an auxiliary variable. For DTI MI, the imputation model included the binary response variable, R. Scenarios using dropout model 6 also included AE status at j = 2, 3, 4 in the imputation model. The M = 30 or M = 50 estimates [22] of the difference in proportions and respective standard errors when 30% or 50% of responses at j = 4 were missing, respectively, were combined using Rubin’s Rules. [29] Sample SAS code is included in the Appendix.

We compared percent bias, coverage probability of the 95% confidence interval (CI) from multiple imputation, power, and type 1 error rate to assess the relative performance of NRI, IBD MI and DTI MI to the fully observed simulated data. We calculated percent bias of the difference as:

$$ Percent\ bias\ of\ the\ difference=\frac{\left({\overline{p}}_A-{\overline{p}}_B\right)-\left({\pi}_A-{\pi}_B\right)}{\pi_A-{\pi}_B}\ast 100 $$

where π represents the true proportion of responders, and \( \overline{p} \) is the average proportion of responders among datasets with missing observations. Positive values represent positive biases of the estimated difference in proportions. We calculated coverage probability as the proportion of MI results where the true value was contained within the 95% CI. Power was calculated as the percentage of statistically significant group differences at the significance level of 0.05. Similarly, the type 1 error rate was calculated as the percentage of statistically significant group differences at the significance level of 0.05 when simulating a scenario with no between group difference. We assess performance of the simulation with the MCSE of bias, mean square error (MSE), standard error of the model (SEmod) and the empirical standard error of the difference in proportions (SEemp). Let \( \hat{\theta}={\hat{p}}_A-{\hat{p}}_B \) be the difference in proportions between groups. MSE, calculated as

$$ MSE=\frac{1}{n_{sim}}\sum \limits_{i=1}^{n_{sim}}{\left({\hat{\theta}}_i-\theta \right)}^2 $$

is a combined measure of variance and bias. SEmod is the average standard error of each \( \hat{\theta_i} \), and SEemp, is the standard error of \( \hat{\theta} \), measuring the efficiency of \( \hat{\theta} \). Simulation and analyses were conducted using SAS software version 9.4 (SAS Institute Inc., 2013).


When the response profile was linear with 30% of responses missing, bias was less than 7.3% for all MI approaches and ranged from 8.5 to − 36.7% for NRI (Table 2). Similar results were seen in the non-linear response profile (Appendix A). IBD MI had slightly lower or equal bias relative to DTI MI for all scenarios, and bias was conservative in direction, i.e., negative for 4 out of the 5 dropout models. All MI models included the continuous repeated outcomes as auxiliary variables in the imputation model. When using DTI MI, the addition of the correlated auxiliary variable reduced bias and changed the direction from positive to negative in all scenarios except when there were differential dropout rates. Including the auxiliary variable in the IBD MI model increased the negative bias in all but the scenario with differential dropout.

Table 2 Comparison of simulated responder analysis results using non-response imputation, impute-before-dichotomizing and dichotomize-then-impute multiple imputation1

The probability of dropout in model 6 was related to both treatment arms, through AE status, and outcome score. Including AE status at j = 2, 3, 4 in the imputation model negligibly reduced bias with DTI MI, and maintained a similar level of bias with IBD MI, compared to no auxiliary variables.

NRI suffered from high negative bias and substantial loss of power to detect differences in all but one scenario. The proportion of responders per treatment arm were always underestimated because missing observations were classified as non-responders. When the dropout mechanism affected the two arms differentially (model 4), NRI produced a positively biased difference estimate.

When 50% of responses were missing with the linear response profile, IBD MI had less bias relative to DTI MI without the use of CV for all scenarios, and bias was negative in direction for 5 of the 6 dropout models (Table 3). Specifically, bias with DTI MI (with no auxiliary variables) ranged from − 21.8 to 11.0. Under the same conditions, the bias of IBD MI ranged from − 6.9 to 0.7. In general, power to detect treatment differences was lower using IBD MI compared to DTI MI.

Table 3 Comparison of simulated responder analysis results when 50% responses are missing using non-response imputation, impute-before-dichotomizing and dichotomize-then-impute multiple imputation1

Coverage probabilities of 95% confidence for all MI approaches ranged from 93.2 to 95.3% when 30% of the responses were missing (Table 2). When 50% of responses were missing, the coverage probabilities when imputing the continuous response were closer to the nominal level of 95% compared to imputing the dichotomized response, ranging from 90.1 to 94.4% and 77.5 to 92.6%, respectively (Table 3). NRI coverage was lower than the MI approaches in all scenarios except for when there was differential dropout. Although IBD MI generally had lower power to detect treatment differences compared to DTI MI, the difference was negligible. NRI was more precise as measured through the SEemp of the difference in proportions between groups, compared to all MI approaches (Table 4). However, as a function of the high levels of bias, NRI performed poorly in terms of MSE compared to the MI approaches. The MCSE of bias was between 0.12–0.14, less than our tolerated level of uncertainty, when 30% of responses were missing. NRI had higher precision estimating the group difference, compared to the other approaches as seen with the lower SEemp. The SEmod was similar to the SEemp suggesting bias of SEemp is not a concern.

Table 4 Comparison of Monte Carlo standard error, mean squared error, model and empirical standard error using non-response imputation, impute-before-dichotomizing and dichotomize-then-impute multiple imputation1

Type 1 error rate was controlled at less than 5% for both multiple imputation strategies, reported in Table 5. When dropout rates differed between groups (model 4), NRI had type 1 error rates ranging from 0.16 to 0.31, suggesting false positives are of concern.

Table 5 Type 1 error rate for non-response imputation, dichotomizing before multiply imputing, and multiply imputing before dichotomizing when missing =30%1

The non-linear response profile demonstrated very similar results overall, as shown in the Appendix.

Application to a clinical trial

We applied the above imputation approaches to data adapted from a Phase III randomized, double-blind clinical trial in patients with major depressive disorder. The trial evaluated efficacy of duloxetine 40 mg/d and 80 mg/d versus placebo and a comparator, paroxetine 20 mg/d, to treat emotional and physical symptoms in depressed patients. [30] Details of the original trial design are reported in Goldstein et al. [30] For the purpose of this study, we considered a publicly available dataset modified from the original trial data. [31] The trial included four parallel arms; the modified dataset has two arms: the original placebo arm and a “treatment” arm consisting of a random sample of patients from the three active drug arms. At 6 weeks post randomization, 75% of the patients remained in the study. To further illustrate the effect of imputation choice, we used a MAR mechanism (Dropout model 1) to identify observations to omit so that 60% of patients have outcome values at week 6. The outcome was the total score on the 17-item Hamilton depression rating scale (HAMD-17), measured at baseline and weeks 1, 2, 4, and 6 after randomization. Lower scores indicate less severity; negative change scores indicate improvement. We conducted a responder analysis using a meaningful change threshold of 6 points to assess the proportions of patients who improved at 6 weeks post-baseline, as this threshold coincides with common categories of depression severity, e.g., the difference between mild and moderate depression is 6 points.

Case study results

At baseline N = 172 subjects (n = 84 in the treatment group and n = 88 in the control group) had complete HAMD-17 total scores. The difference in proportions of responders at week 6 was 19.1% (p = 0.009), 21.9% (p = 0.009) and 21.1% (p = 0.007) estimated using NRI, IBD MI and DTI MI, respectively (Table 6). When the number of patient dropouts was increased to 40%, the difference in proportions was reduced from 19.1 to 13.1% (p = 0.064), remained similar at 21.9 and 22.6% (p = 0.007), or increased from 21.1 to 24.6% (p = 0.002) when using NRI, IBD MI and DTI MI, respectively, compared to the original data. We repeated the random sampling using dropout model 1 three times and saw similar results. These results show that as missingness increased, IBD estimates remained similar. NRI estimates decreased (and were no longer able to detect statistically significant differences) and DTI MI estimates increased slightly. Using the IBD method, 56.3% of patients in the treatment arm improved at least as much as 6 points in the HAMD-17 depression scale compared to 36.3% of those in the placebo arm for a between group difference of responders of 21.9 (CI: [5.3, 36.6], p = 0.009).

Table 6 Comparison of imputation results for a clinical trial example. Treatment arm: n = 84; Placebo arm: n = 88


When continuous data are collected in longitudinal trials with the ultimate interest in differences of a binary response, imputing missing as non-response produces positively and negatively biased estimates. Multiply imputing before dichotomization is often slightly less biased than dichotomizing then imputing but both methods perform well when 30% of the responses are missing. When there are higher rates of missing outcomes, dichotomizing before imputing produced estimates with over 10% bias in three scenarios. When applied to real trial data where the true difference in proportions is unknown, the method of imputing prior to dichotomizing produced similar estimates when both 25 and 40% of observations at the endpoint were missing.

Literature addressing IBD and DTI has been contradictory. One reason could be the choice in MI method. For example, Demirtas used a saturated multinomial model to impute the binary outcome. [16] While statistically sound, this MI approach is not readily available in standard statistical software. Another study using the Markov chain Monte Carlo (MCMC) method comparing IBD MI and DTI MI prior to assessing binary outcomes longitudinally via GEEs found an advantage to imputing before dichotomizing, consistent with the work of Yoo. [17] One distinguishing feature of our study was the use of the continuous Yj’s as auxiliary variables in the imputation model making the MAR assumption more likely if they are predictive of missingness, the outcome, or both. [14, 25]

The use of auxiliary variables in addition to the outcomes from interim timepoints in the imputation models provided limited usefulness. It is likely that the correlation between CV and the outcome was not strong enough to systematically increase precision. Further, adverse events were not related to the outcome after conditioning on the treatment group. The use of auxiliary variables are generally useful to reduce the standard error when highly correlated with the outcome or reduce bias when correlated with the outcome and missingness. [22]

It is unclear why NRI is a recommended strategy in light of the highly biased estimates produced in this simulation and others. [7, 8, 32, 33] Practitioners may erroneously believe that NRI always produces conservative results. Indeed, the NRI can only underestimate proportions of responders in treatment groups. However, when the difference in proportions is of interest, which is usually the case, using NRI when there is differential dropout can yield erratic results including positively biased estimates as shown in model 4. [7, 26] Further warnings include those related to composite endpoints [5, 6] and single imputation methods which underestimate the uncertainty of the missing data in the form of overly precise standard errors. [13, 34]

This study aimed to determine the optimal approach to imputing missing observations for responder analysis when a continuous variable is dichotomized. However, it is impossible to simulate all scenarios that could occur in real settings. We simulated outcomes under a normal distribution which may not always happen. For example, the baseline measure will not be normally distributed if the measure is also an inclusion criterion and subjects must meet a cutoff value. Many outcomes, such as PROs, are measured ordinally and imputing a continuous version via a linear regression could produce values not possible on the original scale. Data here were simulated to be MAR yet in real settings missing may be MNAR or a mixture of mechanisms.


We compared imputation methods for missing outcomes in a responder analysis. MI approaches using the longitudinally measured continuous outcome as auxiliary variables performed better than imputing missing observations as failures. Differences in proportions of responders between arms, bias, coverage probabilities of the 95% confidence interval, and other performance measures were similar for both MI approaches with moderate rates of missingness. With high rates of missingness, imputing the continuous outcome prior to dichotomizing was less biased and provided better coverage probability than imputing the already transformed response. Trialists conducting responder analysis by dichotomizing a continuous outcome can benefit from these findings.

Availability of data and materials

The dataset analyzed as the case study during the current study is available at The simulated datasets analysed during the current study are available from the corresponding author on reasonable request.



Adverse events


Confidence interval


Correlated variable


Dichotomize then impute


Full conditional specification


Generalized estimating equation


Hamilton Depression Rating Scale


Impute before dichotomizing


Just another variable


Missing at random


Missing completely at random


Markov chain Monte Carlo


Multiple imputation


Missing not at random


Mean squared error


Non-response imputation


Patient reported outcome


Standard deviation

SEemp :

Standard error, empirical

SEmod :

Standard error, model


  1. Bell ML, Fiero M, Horton NJ, Hsu C-H. Handling missing data in RCTs; a review of the top medical journals. BMC Med Res Methodol. 2014;14:1–8.

    Article  Google Scholar 

  2. LaVange LM, Permutt T. A regulatory perspective on missing data in the aftermath of the NRC report. Stat Med Wiley-Blackwell. 2016;35:2853–64.

    Article  Google Scholar 

  3. Brundage M, Osoba D, Bezjak A, Tu D, Palmer M, Pater J. Lessons learned in the assessment of health-related quality of life: selected examples from the National Cancer Institute of Canada clinical trials group. J Clin Oncol American Society of Clinical Oncology. 2007:5078–81.

    Article  Google Scholar 

  4. Moore AR, Straube S, Eccleston C, Derry S, Aldington D, Wiffen P, et al. Estimate at your peril: imputation methods for patient withdrawal can bias efficacy outcomes in chronic pain trials using responder analyses. Pain. 2012;153:265–8.

    Article  Google Scholar 

  5. Cordoba G, Schwartz L, Woloshin S, Bae H, Gøtzsche PC. Definition, reporting, and interpretation of composite outcomes in clinical trials: systematic review. BMJ British Medical Journal Publishing Group. 2010;341:c3920.

    Article  Google Scholar 

  6. Ferreira-González I, Permanyer-Miralda G, Busse JW, Bryant DM, Montori VM, Alonso-Coello P, et al. Methodologic discussions for using and interpreting composite endpoints are limited, but still identify major concerns. J Clin Epidemiol Pergamon. 2007;60:651–7.

    Article  Google Scholar 

  7. Hall SM, Delucchi KL, Velicer WF, Kahler CW, Ranger-Moore J, Hedeker D, et al. Statistical analysis of randomized trials in tobacco treatment: longitudinal designs with dichotomous outcome. Nicotine Tob Res. 2001;3:193–202.

    Article  CAS  Google Scholar 

  8. Hedeker D, Mermelstein RJ, Demirtas H. Analysis of binary outcomes with missing data: missing = smoking, last observation carried forward, and a little multiple imputation. Addiction. 2007;102:1564–73.

    Article  Google Scholar 

  9. Little RJA, Rubin DB. Statistical analysis with missing data. 2002.

    Book  Google Scholar 

  10. Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, et al. Analyzing incomplete longitudinal clinical trial data. Biostatistics. 2004;5:445–64.

    Article  Google Scholar 

  11. Molenberghs G, Kenward MG. Missing data in clinical studies. Chichester, UK: John Wiley & Sons, Ltd; 2007.

    Book  Google Scholar 

  12. National Research Council (US) Panel on Handling Missing Data in Clinical Trials. The Prevention and Treatment of Missing Data in Clinical Trials. Washington (DC): National Academies Press (US); 2010. Available from:

  13. Bell ML, Fairclough DL. Practical and statistical issues in missing data for longitudinal patient reported outcomes. Stat Methods Med Res. 2013;625:1–20.

    Google Scholar 

  14. Carpenter JR, Kenward MG. Missing data in randomised controlled trials — a practical guide. 2007;1–206.

  15. Rubin DB. Multiple imputation for nonresponse in surveys. Vol. 81. Wiley, 2004.

  16. Demirtas H. Practical advice on how to impute continuous data when the ultimate interest centers on dichotomized outcomes through pre-specified thresholds. Commun Stat Simul Comput Taylor & Francis Group. 2007;36:871–89.

    Article  Google Scholar 

  17. Yoo B. The impact of dichotomization in longitudinal data analysis: a simulation study. Pharm Stat Pharm Stat. 2010;9:298–312.

    Article  Google Scholar 

  18. Von Hippel PT. How to impute interactions, squares, and other transformed variables. Sociol Methodol. Wiley/Blackwell (10.1111); 2009;39:265–91.

  19. Seaman SR, Bartlett JW, White IR. Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Med Res Methodol BioMed Central. 2012;12:46.

    Article  Google Scholar 

  20. Van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul Taylor & Francis. 2006;76:1049–64.

    Article  Google Scholar 

  21. Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res NIH Public Access. 2011;20:40–9.

    Article  Google Scholar 

  22. White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30:377–99.

    Article  Google Scholar 

  23. Bell ML, McKenzie JE. Designing psycho-oncology randomised trials and cluster randomised trials: variance components and intra-cluster correlation of commonly used psychosocial measures. Psychooncology. 2013;22:1738–47.

    Article  Google Scholar 

  24. Lipkovich I, Duan Y, Ahmed S. Multiple imputation compared with restricted pseudo-likelihood and generalized estimating equations for analysis of binary repeated measures in clinical studies. Pharm Stat Wiley-Blackwell. 2005;4:267–85.

    Article  Google Scholar 

  25. Fairclough D. Design and analysis of quality of life studies in clinical trials; 2010.

    Book  Google Scholar 

  26. Bell ML, Kenward MG, Fairclough DL, Horton NJ. Differential dropout and bias in randomised controlled trials: when it matters and when it may not. BMJ BMJ Group. 2013;346:e8668.

    Google Scholar 

  27. Leucht S, Corves C, Arbter D, Engel RR, Li C, Davis JM. Second-generation versus first-generation antipsychotic drugs for schizophrenia: a meta-analysis. Lancet. 2009;373:31–41.

    Article  CAS  Google Scholar 

  28. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019 Jan 16.

  29. Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92.

    Article  Google Scholar 

  30. Goldstein DJ, Lu Y, Detke MJ, Wiltse C, Mallinckrodt C, Demitrack MA. Duloxetine in the treatment of depression: a double-blind placebo-controlled comparison with paroxetine. J Clin Psychopharmacol. 2004;24:389–99.

    Article  CAS  Google Scholar 

  31. London School of Hygeine and Tropical Medicine [internet]. [cited 2018 Oct 7]. Available from:

  32. Yamaguchi Y, Misumi T, Maruo K. A comparison of multiple imputation methods for incomplete longitudinal binary data. J Biopharm Stat Taylor & Francis. 2017 Sep 8:1–23.

  33. Nelson DB, Partin MR, Fu SS, Joseph AM, An LC. Why assigning ongoing tobacco use is not necessarily a conservative approach to handling missing tobacco cessation outcomes. Nicotine Tob Res. 2009;11:77–83.

    Article  Google Scholar 

  34. Mallinckrodt CH. Preventing and treating missing data in longitudinal clinical trials: a practical guide. Cambridge University Press, 2013.

Download references




This work was not supported by grant funding.

Author information

Authors and Affiliations



LF conceived and designed the study, analyzed and interpreted the data and drafted the manuscript. MB contributed to the conception and design of the study, the interpretation of the results, and editing of the content. All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Lysbeth Floden.

Ethics declarations

Ethics approval and consent to participate


Consent for publication


Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Floden, L., Bell, M.L. Imputation strategies when a continuous outcome is to be dichotomized for responder analysis: a simulation study. BMC Med Res Methodol 19, 161 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: