Imputation strategies when a continuous outcome is to be dichotomized for responder analysis: a simulation study

Background In many clinical trials continuous outcomes are dichotomized to compare proportions of patients who respond. A common and recommended approach to handling missing data in responder analysis is to impute as non-responders, despite known biases. Multiple imputation is another natural choice but when a continuous outcome is ultimately dichotomized, the specifications of the imputation model come into question. Practitioners can either impute the missing outcome before dichotomizing or dichotomize then impute. In this study we compared multiple imputation of the continuous and dichotomous forms of the outcome, and imputing responder status as non-response in responder analysis. Methods We simulated four response profiles representing a two-arm randomized controlled trial with a continuous outcome at four time points. We omitted data using six missing at random mechanisms, and imputed missing observations three ways: 1) replacing as non-responder; 2) multiply imputing before dichotomizing; and 3) multiply imputing the dichotomized response. Imputation models included the continuous response at all timepoints, and additional auxiliary variables for some scenarios. We assessed bias, power, coverage of the 95% confidence interval, and type 1 error. Finally, we applied these methods to a longitudinal trial for patients with major depressive disorder. Results Both forms of multiple imputation performed better than non-response imputation in terms of bias and type 1 error. When approximately 30% of responses were missing, bias was less than 7.3% for multiple imputation scenarios but when 50% of responses were missing, imputing before dichotomizing generally had lower bias compared to dichotomizing before imputing. Non-response imputation resulted in biased estimates, both underestimates and overestimates. In the example trial data, non-response imputation estimated a smaller difference in proportions than multiply imputed approaches. Conclusions With moderate amounts of missing data, multiply imputing the continuous outcome variable prior to dichotomizing performed similar to multiply imputing the binary responder status. With higher rates of missingness, multiply imputing the continuous variable was less biased and had well-controlled coverage probabilities of the 95% confidence interval compared to imputing the dichotomous response. In general, multiple imputation using the longitudinally measured continuous outcome in the imputation model performed better than imputing missing observations as non-responders.


Background
Clinical trials can be evaluated by differences in rates of successful response. In so-called responder analysis, subjects are classified as responders, often by dichotomizing a continuous outcome, if they improve by a specified threshold. For example, responder definitions could be a 5% change in body mass index or an improvement in symptoms by 10 points on a 100-point symptom scale. Responder analysis is commonly used with patient-reported outcomes (PROs) because results are easily interpretable to patients and other stakeholders and can lend language to drug labels and promotional claims.
When the outcome is measured for all subjects at baseline and the timepoint of interest, responder status can be calculated, and the analysis is routine. However missing data are ubiquitous in longitudinal trials and responder status cannot be determined for subjects missing the outcome. [1] One approach for handling missing data in responder analysis, recommended in the regulatory setting [2][3][4] is to impute subjects missing the outcome as nonresponders, termed non-response imputation (NRI). However, it is a strong assumption to assume unobserved outcomes are uniformly "failures" rather than come from the distribution of subjects who do not improve. NRI can be thought of as a composite outcome of response and a dropout indicator. Methodologists warn that composite endpoints can be misleading, for example, when the components have varying degrees of severity and treatment effects of each component differ between groups. [5,6] This could be true if dropout depended at least partly on a tolerability. For example, a cancer treatment may offer a favorable toxicity profile relative to a comparator. Using NRI, the response rate of the comparator arm more than in the treatment arm would reflect the effect of tolerability, i.e., have more non-responders, and could widening the between arm difference. While some may view NRI as a conservative approach (since the proportions of responders can only decrease), treating missing as response failure can result in unpredictable differences in proportions between treatment groups. [7,8] In longitudinal trials, missing observations can be intermittent, as in a missed study visit, but dropout accounts for most missing data. We focus this article on monotone missing patterns, implying that observations are observed up until one is missing and all subsequent observations are missing. Little and Rubin [9] provide a framework to describe categories of missing data mechanisms given the relationship with observed and unobserved values. When the probability of missingness is independent of the observed and unobserved data the mechanism is said to be missing completely at random (MCAR). Data are missing at random (MAR) if the probability of missingness is independent of the unobserved data after conditioning on observed data.
Finally, data are considered missing not at random (MNAR) if they are neither MCAR or MAR and the missing mechanism depends on the unobserved values, given the observed data.
The MAR assumption is usually reasonable in the context of longitudinal trials and current guidance outlines a framework that includes sensitivity analyses to assess the extent to which analytic approaches are robust to missing data assumptions. [10][11][12] Appropriate analyses that assume MAR include mixed models using maximum likelihood estimation, extensions of generalized estimating equations (GEEs) such as weighted GEE, and multiple imputation (MI). [13,14] Of these, MI is the only approach that can be used with any analytic model. MI is a three-stage process. First, missing values are filled M times by a random draw from a posterior distribution of the imputation model to generate M complete datasets. Secondly, the M datasets are analyzed via any statistical approach and thirdly, results are combined using a set of rules that accounts for the uncertainty of the imputed values. [15] The imputation model must be congenial, i.e., include the same variables, but does not have to be consistent with the substantive model. Thus, the imputation model can include variables predictive of missingness such as the outcome from intermittent timepoints, making MI a natural choice in responder analysis using a test of proportions. For these reasons we focus this paper on MI.
When a continuous outcome is ultimately dichotomized, the specifications of the imputation model come into question. Practitioners can either impute the missing outcome before dichotomizing the response (IBD) or dichotomize the outcome then impute the response (DTI). Demirtas evaluated efficiency and accuracy of the estimated proportions of responders using IBD under the multivariate normal assumption compared to DTI using a saturated binomial model for the dichotomous response indicator, and concluded that DTI was superior across most scenarios. [16] This finding is in contrast to Yoo's work that concluded MI with GEEs performs better when the underlying continuous outcome is imputed prior to dichotomizing. [17] More generally, Von Hippel's work supports the use of just-another-variable (JAV), analogous to DTI, to impute a quadratic and interaction term under a linear regression analysis model with a conceptual argument extending to the logistic setting. [18] Others demonstrated poor performance using JAV when data were MAR particularly with logistic regression [19], prompting some researchers to discourage this practice. [14] In trial settings where the dichotomized response of a continuous outcome is of interest, there is no clear best way to handle missing data. The aim of this paper is to clarify inconsistent results in the performance of multiply imputing the IBD or DTI in responder analysis and compare with the commonly recommended non-response imputation.

Notation and analysis
Let the underlying continuous measure which is to be dichotomized into the response indicator be Y ij for subject i where i = 1, …, n measured at the j timepoint. Measurements are repeated over time such that j = 1, …, t i are the observed measurements for each subject and t i represents the time of dropout or end of the study, T. Without loss of generality, assume that higher values of Y indicated better outcomes. Let Y ij > 1 − Y i1 = C i represent change from baseline to time j > 1. Subject i is classified as a responder if C i ≥ λ for some threshold λ, defined as R i = I(C i ≥ λ). Consider a randomized controlled trial with treatment and control arm.
The objective of responder analysis is to evaluate the difference in proportion of responders at the endpoint between treatment arms.

Multiple imputation approach
When data have either an intermittent or monotone missing pattern, multiple imputation using the Markov chain Monte Carlo (MCMC) method and fully conditional specification (FCS, also known as imputation by chained equations) method are two popular options. [20] Both are relatively flexible to specify, straightforward to understand, and easy to apply with standard statistical software. The FCS assumes the existence of, but does not rely on, a multivariate distribution. [20] Specifically, the FCS approach assumes conditional densities for each partially observed variable and uses a corresponding regression model to sequentially generate imputations, e.g., linear regression for continuous variables and logistic regression for categorical variables. We used FCS MI for imputing both the unobserved continuous outcomes for IBD MI and the missing responder status for DTI MI, both using the continuous outcomes intermittent timepoints as auxiliary variables, and in some cases, additional covariates related to the outcome, detailed below. Thus, the comparison is not in the MI method but rather the specification of the imputation model.
In general, the FCS procedure can be described in the following steps. [21,22] Consider a set of variables X = X 1 , …, X q in the imputation model. First, starting values for unobserved measures are filled in sequentially for each variable in the order specified. Continuous variables are filled in by regressing one variable, say, X 1 , on the other X 2 , …, X q covariates and using the resulting set of parameters to fill in the missing values of X 1 . Binary variables are filled in similarly using logistic regression. The next imputation phase replaces the filled in values with imputed values. For a set of observed values of one variable, X 1 , the corresponding imputation model is fit using both the observed and filled-in values of all other q − 1 variables as the independent variables and X 1 as the dependent variable. In this study, the binary variable, R, is fit using logistic regression and the continuous variables, Y j , are fit with linear regression. The resulting set of parameters are used to impute the first set missing values. The latter two steps are repeated on the remaining q − 1 variables to comprise a cycle. The algorithm runs through a number of cycles updating the imputed values until convergence, at which point the current values of all X 's complete the first imputed dataset. The process is repeated for M datasets.
To calculate the estimand θ using IBD MI, we imputed the missing continuous outcomes Y j , calculated the responder status, R i , estimated the difference and combined estimates using Rubin's rules in the final step. For DTI MI, we calculated responder status prior to imputing and included the partially observed responder status, R i , in the imputation model. Using the imputed R i , we calculated the difference in proportions between treatment arms on the M datasets and combined using Rubin's rules.

Data generation
We simulated twenty-four scenarios to represent a randomized trial with two treatment arms with N = 200, and a continuous outcome measured at baseline and three subsequent timepoints. The scenarios described two response profiles with the same mean difference at the final assessment, six mechanisms of dropout, and two dropout rates. One response profile was linear where only treatment A was effective. In the other response profile, treatment A is effective after a period of worsening and treatment B demonstrates no effectiveness after a period of improving, hence the mean trajectories of treatment A and B cross. The third and fourth response profiles had no treatment differences at the final timepoint and were used to evaluate type 1 error.
Data for the continuous response were simulated to represent a PRO scale with equal allocation to treatment groups. Let Y ij represent a continuous measure for the i th individual at the j th timepoint where j = 1, …, 4. Specifically, data were simulated according to the underlying model: where x trt = 1 for treatment arm A and 0 for treatment arm B, β j denotes the effect of the j th timepoint and δ j * x trt is the interaction of treatment group and the timepoint.
Let Y i4 − Y i1 = C i represent change from baseline to timepoint j = 4. To achieve 80% power to detect the difference of response rates between the two arms, the dichotomized response was defined as R i4 = I(C i ≥ 12.4). Using this definition, response rates for the first and second response profiles for treatment A and B were 25.6 and 10.6, respectively. (Exploratory result using thresholds ranging from 10 to 20 produced similar trends.)

Missing data
We used six probability models representing plausible trial scenarios to delete post-baseline observations using a MAR mechanism. Let Z ij = 0 if outcome Y ij is missing and 1 otherwise.

Dropout model 1
For the first model of dropout, the probability of missing response is dependent on the value of the observed outcome at Y j − 1 such that PðZ ij ¼ 0Þ∝ð1−ΦðY j−1 ;θ Y j−1 ;σ 2 Y j−1 ÞÞ , where j > 1 and Φ is the normal cumulative distribution function with meanθ Y j−1 and standard deviationσ 2 Y j−1 estimated from the data. This model represents the probability of dropout due to lack of efficacy.

Dropout model 2
The mechanism leading to dropout can differ by treatment. [25] To model this, observations in treatment A were more likely to be missing when the outcome, Y j − 1 , value was low such that Y j−1 ÞÞ, j > 1, and observations in treatment B were more likely to be missing when Y j − 1 values were high such that PðZ ij ¼ 0Þ∝ðΦðY j−1 ;

Dropout model 3
Model 3 represents missing mechanisms in the opposite direction of model 2 for the treatment arms. For example, lack of efficacy could drive dropout in a placebo arm while those on treatment may be less motivate to return to follow up when they are feeling better, i.e.
improved efficacy. Here, treatment B observations were more likely to be missing when the outcome, Y j − 1 , value was low such that PðZ ij ¼ 0Þ∝ð1−ΦðY j−1 ; θ Y j−1 ;σ 2 Y j−1 ÞÞ , j > 1, and treatment A observations more likely to be missing when Y j − 1 values were high such that

Dropout model 4
Treatment arm dropout rate can be differential. [26,27] We modeled substantial differential dropout by including a weight term, w x trt , specific to treatment arm, such that

Dropout model 5
Here, Y i was set to missing with probability PðZ ij ¼ 0Þ where j > 1 and b 1 = 0.01 modeling drop out due to lack of efficacy using a different mechanism than model 1.

Dropout model 6
We simulated a repeated indicator variable representing occurrence of adverse events (AEs) to represent drug tolerability. The probability of AE depended jointly on treatment arm and occurrence of an AE at the prior visit such that for each assessment for each treatment group where x trt represents the treatment arm and γ represents AE status at j − 1. Probabilities were estimated from actual trial data and were similar to prior published event rates (Table 1). [24] For simplicity we assumed that no AEs occurred at baseline and the probability of AE at j = 2 was 0.3 for x trt = 1 and 0.5 for x trt = 0. For each subject we generated AE status at each post-baseline visit as AE ij $ Bernoulliðp AE j Þ. The response Y i was set to missing with probability where j > 1 and b 1 = 0.01 and b 2 = − 0.40 to model the probability of dropout due to lack of efficacy and tolerability. If Y i was set to missing, all subsequent AE were also set to missing.
For all dropout models, we multiplied P(Z ij = 0) by a randomly generated uniform variable and determined a cutoff value creating the overall proportion of missing Table 1 Conditional probabilities of AEs for j > 2 Timepoint responses at j = 4 to be 30% or 50%. If a patient was missing at any Y j = a then all Y j > a were set to missing.

Analysis and comparison of methods
We determined the required number of simulated datasets per scenario, n sim , by estimating the standard deviation (SD) ofθ to be ≤6.0, based on exploratory simulations and setting the maximum tolerated Monte Carlo standard error (MCSE) of bias to be ≤.15. Given MCSEðBiasÞ ¼

ffiffiffiffiffiffiffiffiffiffiffi ffi
VarðθÞ n sim r ' the required number of datasets was n sim = 1600. [28] For each simulated dataset, we evaluated the proportions of responders in, and the difference between, each arm at j = 4. For IBD MI and DTI MI, all imputation models contained the group indicator, x trt , and the continuous outcomes Y j . In some imputation models, we included CV, a variable representing a correlated covariate to evaluate the utility of including an auxiliary variable. For DTI MI, the imputation model included the binary response variable, R. Scenarios using dropout model 6 also included AE status at j = 2, 3, 4 in the imputation model. The M = 30 or M = 50 estimates [22] of the difference in proportions and respective standard errors when 30% or 50% of responses at j = 4 were missing, respectively, were combined using Rubin's Rules. [29] Sample SAS code is included in the Appendix. We compared percent bias, coverage probability of the 95% confidence interval (CI) from multiple imputation, power, and type 1 error rate to assess the relative performance of NRI, IBD MI and DTI MI to the fully observed simulated data. We calculated percent bias of the difference as: where π represents the true proportion of responders, and p is the average proportion of responders among datasets with missing observations. Positive values represent positive biases of the estimated difference in proportions. We calculated coverage probability as the proportion of MI results where the true value was contained within the 95% CI. Power was calculated as the percentage of statistically significant group differences at the significance level of 0.05. Similarly, the type 1 error rate was calculated as the percentage of statistically significant group differences at the significance level of 0.05 when simulating a scenario with no between group difference. We assess performance of the simulation with the MCSE of bias, mean square error (MSE), standard error of the model (SE mod ) and the empirical standard error of the difference in proportions (SE emp ) . Letθ ¼p A −p B be the difference in proportions between groups. MSE, calculated as is a combined measure of variance and bias. SE mod is the average standard error of each b θ i , and SE emp , is the standard error ofθ , measuring the efficiency ofθ . Simulation and analyses were conducted using SAS software version 9.4 (SAS Institute Inc., 2013).

Results
When the response profile was linear with 30% of responses missing, bias was less than 7.3% for all MI approaches and ranged from 8.5 to − 36.7% for NRI (Table 2). Similar results were seen in the non-linear response profile (Appendix A). IBD MI had slightly lower or equal bias relative to DTI MI for all scenarios, and bias was conservative in direction, i.e., negative for 4 out of the 5 dropout models. All MI models included the continuous repeated outcomes as auxiliary variables in the imputation model. When using DTI MI, the addition of the correlated auxiliary variable reduced bias and changed the direction from positive to negative in all scenarios except when there were differential dropout rates. Including the auxiliary variable in the IBD MI model increased the negative bias in all but the scenario with differential dropout. The probability of dropout in model 6 was related to both treatment arms, through AE status, and outcome score. Including AE status at j = 2, 3, 4 in the imputation model negligibly reduced bias with DTI MI, and maintained a similar level of bias with IBD MI, compared to no auxiliary variables.
NRI suffered from high negative bias and substantial loss of power to detect differences in all but one scenario. The proportion of responders per treatment arm were always underestimated because missing observations were classified as non-responders. When the dropout mechanism affected the two arms differentially (model 4), NRI produced a positively biased difference estimate.
When 50% of responses were missing with the linear response profile, IBD MI had less bias relative to DTI MI without the use of CV for all scenarios, and bias was negative in direction for 5 of the 6 dropout models (Table 3). Specifically, bias with DTI MI (with no auxiliary variables) ranged from − 21.8 to 11.0. Under the same conditions, the bias of IBD MI ranged from − 6.9 to 0.7. In general, power to detect treatment differences was lower using IBD MI compared to DTI MI.
Coverage probabilities of 95% confidence for all MI approaches ranged from 93.2 to 95.3% when 30% of the responses were missing (Table 2). When 50% of responses were missing, the coverage probabilities when imputing the continuous response were closer to the nominal level of 95% compared to imputing the dichotomized response, ranging from 90.1 to 94.4% and 77.5 to 92.6%, respectively ( Table 3). NRI coverage was lower than the MI approaches in all scenarios except for when there was differential dropout. Although IBD MI generally had lower power to detect treatment differences compared to DTI MI, the difference was negligible. NRI was more precise as measured through the SE emp of the difference in proportions between groups, compared to all MI approaches (Table 4).
However, as a function of the high levels of bias, NRI performed poorly in terms of MSE compared to the MI approaches. The MCSE of bias was between 0.12-0.14, less than our tolerated level of uncertainty, when 30% of responses were missing. NRI had higher precision estimating the group difference, compared to the other approaches as seen with the lower SE emp . The SE mod was similar to the SE emp suggesting bias of SE emp is not a concern.
Type 1 error rate was controlled at less than 5% for both multiple imputation strategies, reported in Table 5. When dropout rates differed between groups (model 4), NRI had type 1 error rates ranging from 0.16 to 0.31, suggesting false positives are of concern. The non-linear response profile demonstrated very similar results overall, as shown in the Appendix.

Application to a clinical trial
We applied the above imputation approaches to data adapted from a Phase III randomized, double-blind clinical trial in patients with major depressive disorder. The trial evaluated efficacy of duloxetine 40 mg/d and 80 mg/d versus placebo and a comparator, paroxetine 20 mg/d, to treat emotional and physical symptoms in depressed patients. [30] Details of the original trial design are reported in Goldstein et al. [30] For the purpose of this study, we considered a publicly available dataset modified from the original trial data. [31] The trial included four parallel arms; the modified dataset has two arms: the original placebo Results are from a linear response profile with 50% data missing at random, N = 200. In fully observed data, % responders in Treatment A and B was 25.6 and 10.6, respectively for a difference of 15.0 and power = 0.80 arm and a "treatment" arm consisting of a random sample of patients from the three active drug arms. At 6 weeks post randomization, 75% of the patients remained in the study. To further illustrate the effect of imputation choice, we used a MAR mechanism (Dropout model 1) to identify observations to omit so that 60% of patients have outcome values at week 6. The outcome was the total score on the 17-item Hamilton depression rating scale (HAMD-17), measured at baseline and weeks 1, 2, 4, and 6 after randomization. Lower scores indicate less severity; negative change scores indicate improvement. We conducted a responder analysis using a meaningful change threshold of 6 points to assess the proportions of patients who improved at 6 weeks post-baseline, as this threshold coincides with common categories of depression severity, e.g., the difference between mild and moderate depression is 6 points.

Discussion
When continuous data are collected in longitudinal trials with the ultimate interest in differences of a binary response, imputing missing as non-response produces positively and negatively biased estimates. Multiply imputing before dichotomization is often slightly less biased than dichotomizing then imputing but both methods perform well when 30% of the responses are missing. When there are higher rates of missing outcomes, dichotomizing before imputing produced estimates with over 10% bias in three scenarios. When applied to real trial data where the true difference in proportions is unknown, the method of imputing prior to dichotomizing produced similar estimates when both 25 and 40% of observations at the endpoint were missing. Literature addressing IBD and DTI has been contradictory. One reason could be the choice in MI method. For example, Demirtas used a saturated multinomial model to impute the binary outcome. [16] While statistically sound, this MI approach is not readily available in standard statistical software. Another study using the Markov chain Monte Carlo (MCMC) method comparing IBD MI and DTI MI prior to assessing binary outcomes longitudinally via GEEs found an advantage to imputing before dichotomizing, consistent with the work of Yoo. [17] One distinguishing feature of our study was the use of the continuous Y j 's as auxiliary variables in the imputation model making the MAR assumption more likely if they are predictive of missingness, the outcome, or both. [14,25] The use of auxiliary variables in addition to the outcomes from interim timepoints in the imputation models provided limited usefulness. It is likely that the correlation between CV and the outcome was not strong enough to systematically increase precision. Further, adverse events were not related to the outcome after conditioning on the treatment group. The use of auxiliary variables are generally useful to reduce the standard error when highly  *Response is defined as improvement ≥6 on the HAMD-17 total score from baseline to week 6 correlated with the outcome or reduce bias when correlated with the outcome and missingness. [22] It is unclear why NRI is a recommended strategy in light of the highly biased estimates produced in this simulation and others. [7,8,32,33] Practitioners may erroneously believe that NRI always produces conservative results. Indeed, the NRI can only underestimate proportions of responders in treatment groups. However, when the difference in proportions is of interest, which is usually the case, using NRI when there is differential dropout can yield erratic results including positively biased estimates as shown in model 4. [7,26] Further warnings include those related to composite endpoints [5,6] and single imputation methods which underestimate the uncertainty of the missing data in the form of overly precise standard errors. [13,34] This study aimed to determine the optimal approach to imputing missing observations for responder analysis when a continuous variable is dichotomized. However, it is impossible to simulate all scenarios that could occur in real settings. We simulated outcomes under a normal distribution which may not always happen. For example, the baseline measure will not be normally distributed if the measure is also an inclusion criterion and subjects must meet a cutoff value. Many outcomes, such as PROs, are measured ordinally and imputing a continuous version via a linear regression could produce values not possible on the original scale. Data here were simulated to be MAR yet in real settings missing may be MNAR or a mixture of mechanisms.

Conclusion
We compared imputation methods for missing outcomes in a responder analysis. MI approaches using the longitudinally measured continuous outcome as auxiliary variables performed better than imputing missing observations as failures. Differences in proportions of responders between arms, bias, coverage probabilities of the 95% confidence interval, and other performance measures were similar for both MI approaches with moderate rates of missingness. With high rates of missingness, imputing the continuous outcome prior to dichotomizing was less biased and provided better coverage probability than imputing the already transformed response. Trialists conducting responder analysis by dichotomizing a continuous outcome can benefit from these findings.