A nonparametric multiple imputation approach for missing categorical data

Background Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. Methods We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. Results The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. Conclusions We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability. Electronic supplementary material The online version of this article (doi:10.1186/s12874-017-0360-2) contains supplementary material, which is available to authorized users.


Background
In population studies of public health, the health status of participants is a research outcome of interest and is commonly demonstrated using ordinal categories such as "Excellent", "Good", "Fair", and "Poor". However, these variables are typically subject to missing data. In practical analyses, researchers often create a specific new problems. It consists of iterative expectation and maximization steps for estimating the parameter [4]. However, this approach only relies on the information from a working model predicting the missing values/outcome, and ignores the information embedded in missingness probabilities, that is, the probabilities of being missing (or nonresponse probabilities). Therefore the corresponding estimates might not be robust to certain misspecifications of the outcome model.
A more robust approach is to improve the estimation using additional information from the missingness probabilities, such as the use of calibration estimator (CE) [2], which is an extension of the inverse probability weighting method [7]. The estimator is a result of expressing the target parameter as a sum of two components, the modelbased predictions and inverse probability-weighted prediction errors. These two components have a trade-off effect so that using their sum can achieve a doubly robust property. That is, the estimator remain consistent as long as one of the prediction models is correctly specified. In addition, CE uses covariates to fit working models for predicting missing values and missingness probabilities. Additional details about CE can be found in the "Methods" section.
Multiple imputation (MI) [13] is an attractive approach to missing data problems. It accounts for uncertainty in estimation due to missingness by imputing each missing observation multiple times [13,15]. To the best of our knowledge, only a handful of studies have investigated the possible MI approaches to missing categorical data, such as MI using a loglinear model, MI using a latent class model, and MI using chained equations, [16,20], and concluded that MI using a latent variable model had the best performance among the tested methods. However, we note that all these methods are only built on the working model predicting the missing values. An apparent disadvantage is that they might be sensitive to certain misspecifications of the working model.
To weaken the aforementioned, pure reliance on the working model for predicting the missing outcome, we propose a nearest neighbor-based MI (NNMI) approach to missing categorical data. The approach uses two working models, one for predicting missing outcome values (the outcome model) and the other for predicting the missingness probabilities using covariate information (the missingness model). Each working model is used to generate a predictive score. Their weighted sum is used to measure the "distance" between a missing case and observed cases. For each missing observation, its imputing set consists of observed cases which are "near" in terms of the distance function. The missing value can thus be imputed (replaced) by one of the donors in its imputing set. Because information from both the outcome and missingness models are used to impute the missing data, we surmise that the proposed method might possess some doubly-robust property.
Similar ideas have been proposed in [11] and [8]. In those contexts, the NNMI approach is applied to impute missing at random continuous variables and produces reasonable results under a variety of model misspecifications. However, a notable difference in our paper is that the NNMI approach is applied to MAR categorical variables. More specifically, if the number of categories is M, M − 1 predictive scores can then be derived from the working model for the outcome, and a single predictive score can be derived from the working model for the missingness probabilities. It is of interest to investigate the optimal weighting schemes for the M predictive scores. We also note that there exist alternative ways of specifying models for categorical outcomes, which further complicates the problem. In addition, despite having a theoretical doubly robust property, CE might produce unstable results when missingness probabilities take some extreme values. It is of additional interest to assess the performance of NNMI in this scenario.
This article is organized as follows. In the "Methods" section, we specify notations, briefly introduce the CE method, and present the NNMI approach. In the "Simulation study" section, we investigate the properties of NNMI for finite samples. In the "Data example" section, we analyze a dataset from 2013 Behavioral Risk Factor Surveillance System (BRFSS) survey. Finally, we conclude our study with a discussion and suggest directions for future work.

Notation
Let Y denote the categorical outcome variable of interest with missing values, and suppose that Y has M categories of which the proportions need to be estimated. Let δ denote the missingness indicator, δ = 0 if Y is missing and δ = 1 if Y is observed. Let X = (X 1 , . . . , X p ) denote a set of fully observed covariates that are predictive of Y and δ. Suppose that there are n independent subjects in the study.

Calibration estimator
The calibration estimator is the earliest doubly robust method [2], which is based on two working models: one for the variable with missing values Y, and the other for the missingness indicator δ. For example, one can calculate the estimates of mean for a continuous Y by a sum of prediction and inverse probability-weighted prediction errors, where π(X) = E(δ|X).
For a categorical Y taking values m = 1, . . . , M, the estimator of probability that Y = m can be written aŝ whereP i,Y =m is the predicted probability of Y = m based on, for example, a multinomial logistic regression model for E(Y |X) using complete cases, w i = 1/π(X i ) is the inverse of the estimated probabilities of having case i being observed (computed using, for example, a logistic regression model from E(δ|X)), and I represents the indicator function of Y = m. The first term essentially predicts/imputes Y values using a model for E(Y |X) based on complete cases. The second term is a sum of prediction errors from the model for E(Y |X), adjusted by the inverse-probability weights usingπ(X) based on all cases. The doubly robust property dictates that the estimates for the probability of each category of Y would be consistent if at least one of the two models (E(Y |X) and E(δ|X)) is correctly specified.

Nearest-neighbour multiple imputation
We first present the outcome model for Y. Given its categorical-feature, we consider two alternative modeling specifications as follows.
Second, we present the model for missingness probabilities. It is common to use a logistic regression model for the missingness indicator δ. That is, where X δ ⊆ X is a set of covariates predicting δ and β = (β 1 , . . . , β p ) T is a vector of regression coefficients. The corresponding predictive score is The aforementioned specifications do not encompass all possible modeling choices for Y and δ. For instance, the logit link functions can be replaced by probit or other link functions. In general, appropriate specifications should follow careful exploratory analysis and model diagnostics of the real data. Here we only use these specifications for illustrative purposes.
The general strategy of NNMI works as follows. The aforementioned predictive scores ({Z MLR,m } or {Z CLR,R m } and Z M ) can be calculated for each observation, missing Y or not. They are then standardized so that the effect from each covariate can be summarized on approximately the same scale. A weighted sum of the standardized scores is generated to balance the contribution from the two work-ing models. This (weighted) predictive score is then used to quantify the "distance" between subjects. The scores from cases with observed Y are used to form imputing sets for subjects with missing Y. For each incomplete subject, imputations are randomly drawn from subjects (in the imputing set) that have smaller "distances" to the subject with a missing Y. To ensure the "properness" of multiple imputation [13], the estimation of scores and formation of imputing sets are conducted on bootstrap samples of the original data set to incorporate parameter uncertainty.
More specifically, the procedure consists of the following steps: Step 1: Bootstrap -Bootstrap the original data set (including the missing observations) {Y, X} to obtain the bootstrap sample {Y * , X * }.
Step 2: Calculating the predictive scores -From the bootstrap sample we estimate the regression coefficients for the outcome and missingness models using the maximum likelihood method. For illustration suppose we use specification I for the outcome model and denote the corresponding regression coefficients as α * MLR,m , m = 1, . . . M − 1. Let β * be the regression coefficients for the propensity model. We use these regression coefficients and the original covariate X to form the scores. The predictive score for the propensity model is Each of the M predictive scores {Z * MLR,1 ,. . ., Z * MLR,M−1 , Z * M } is standardized by subtracting its mean and dividing by its standard deviation. The resulting standardized scores are denoted by S ≡ (S 1 , . . . , S M ).
Step 3: Forming the imputing set -We calculate a distance function to define the similarity between subject i with missing Y in the original data set and subject j with observed Y in the bootstrap sample based on the M predictive scores, S 1 , . . . , S M . Specifically, the distance between subjects i and j is defined as where ω 1 , . . . , ω M are non-negative weights for the predictive scores, satisfying M i=1 ω m =1. The way to calculate the similarity between subject i with missing Y in the original data and subject j with observed Y in the bootstrap sample was initiated from predictive mean matching and was described by Heitjan and Little [6], which was an extension of a method by Rubin [14]. Morris and his colleagues also studied on this nonparametric MI method compared with other methods [12]. The imputing set for subject i with a missing Y in the original data is the NN nearest neighborhood (i.e., the number of donors, a positive integer specified prior to imputation), R(i, NN, ω 1 , . . . ; ω M ), consisting of NN subjects with observed Y in the bootstrap sample and having the smallest NN distances d(i, j).
Step 4: Imputation -From the imputing set R(i, NN, ω 1 , . . . ; ω M ), an observation is randomly drawn (with equal probability) to replace the missing Y in subject i. This imputation is conducted for all i's. Once all missing observations of Y are imputed, one fully imputed data set is obtained. Return to Step 1.
Step 5: Analyzing multiply imputed data sets -Steps 1 to 4 are independently repeated K time to obtain K imputed data sets for estimation. For each imputed data set, an estimate of the probability of Y = m is calculated byPŶ =m = n m /n, m = (1, . . . , M), where n m is the number ofŶ = m observed in the imputed data set and n is the sample size. DenotePŶ =m (k) as the estimate for the kth imputed data set. Using Rubin's combining rules [10], the final NNMI estimator is the average across K imputed data sets asPŶ =m = (1/K) K k=1PŶ =m (k). Its variance can be estimated using the sum of a between-imputation and within-imputation component as where sŶ =m (k) is the standard error for the probability ofŶ = m in the k-th data set based on a Bernoulli distribution for the event I(Ŷ = m).
We use R to perform all the simulations and data analysis. Multinomial logistic regression models can be done using the multinom function from the nnet package of R. [17].

Simulation study
Past literature [8,11] has suggested that the use of two working models for predicting missing values and missingness probabilities in NNMI can induce a double robustness property when Y is continuous. We surmise that would hold as well when Y is categorical. This section presents a simulation study for assessing the performance of the NNMI under some model misspecifications for imputing a three-category nomial missing outcome and evaluating the estimation of proportions of all three categories.
For simplicity, we consider an incomplete trichotomous outcome Y (i.e. M=3). The estimand of evaluation is the probability P(Y = m), m = 1, 2, 3. The methods compared include: fully observed (FO) analysis, which is treated as the gold standard because the analysis is applied before some of the Y s are removed; complete-case (CC) analysis, which excludes cases with missing Y ; the calibration estimator (CE); a parametric MI (PMI), which imputes the missing values by taking the predictive values from a multinomial logistic regression model for the missing values; and the proposed NNMI approach. For the latter approach, the method using multinomial logistic regressions for the outcome model is denoted as NNMI MLR (NN, ω 1 , . . . ; ω M ), and that using cumulative logistic regressions is denoted as The previous work on NNMI [8,11] has demonstrated that bias increases while SD and SE decreases when NN increased. It is suggested that NN= 3 or 5 in general result in slightly lower MSEs. In the simulation, NN = 5 is chosen for NNMI. Based on our previous experience, ten-time MI is usually sufficient to control for the uncertainty. In this article, we compared one table using K=10 and K=50 (Table 1 and Additional file 1: Table S12). The results shows no clear difference in bias and slightly lower SD and SE using K=50.
Sample size n=400 and n=200 are considered for simulation. For each tested scenario, the simulation is conducted 500 times. The criteria include the average estimate (EST), the empirical standard deviation (SD), the average standard error (SE), and the coverage rate (CR) of 95% confidence intervals (CI), all of which are calculated from the 500 simulations. To assess the performance of the simulations, Monte Carlo Errors (MCEs) are calculated for each measure using the formulas from White's paper [19]. In FO and CC, the SEs are calculated assuming a Bernoulli distribution forŶ = m. In CE, the SE's are calculated using bootstrap. In PMI and NNMI, Rubin's combining rules are used to calculate SE's.
Five observed covariates X = (X 1 , . . . , X 5 ) are independently generated from U(−1, 1). The trichotomous outcome, Y, is generated from a multinomial distribution with probabilities as Pr( where g is a link function. We consider both the logit and probit link functions for g. We assume that Y is independent of δ given X. That is, the missingness probability of Y is only dependent on X (i.e., MAR). Two models are considered for the missingness indicator: ). Again, we consider both the logit and probit link functions for g. In general, Model M1 generates missingness probabilities that are mostly bounded away from 1 or 0 with a bellshape distribution, whereas Model M2 generates more missingness probabilities that are close to 1 or 0 with a Ushape distribution. That is, the missingness probabilities generated from model M2 are more extreme than those from model M1. In both schemes, the overall missingness probability is approximately 50%.
The primary goal of the simulation is to assess the performance of the NNMI various misspecifications of the working models for Y and δ. In general, two types of model misspecification for working models are tested: 1) including a reduced set of predictors (i.e., X 1 , X 2 , and X 3 ) for working models; and 2) misspecified link functions for working models. More specifically, we consider the following five scenarios: For example, we firstly simulate Y using X 1 −X 5 through logit link and simulate δ using X 1 − X 5 through logit link. The misspecification Scenario 1 fit a working model for Y with same logit link function but using X 1 , X 2 and X 3 only, and fit a working model for δ using same five covariates, X 1 − X 5 , and same logit link function.
We also assess the effects of the extremeness of missingness probabilities on the performance of the methods. This is motivated by the fact that CE, despite being theoretically consistent if one of the working models is correctly specified, might have unstable results when the propensity model generates extreme missingness probabilities.
Another aim of the simulation is to investigate the optimal strategy for specifying weights ω m 's. Previous literature [8,11] has suggested that a small, non-zero Table 1 Simulation results from probability estimation for Y, where Y is generated using a logit link function with five covariates, δ is generated using a logit link function with not extreme missingness probabilities (M1) based on five covariates, N = 400  [8,11]. In addition, positive weight(s) for the scores from the working model(s) for predicting missing values (ω 1 and ω 2 ) are necessary when only the working model for the missingness probability is misspecified. Therefore, the following simulation results present the results from fixing ω 3 = 0.2 and different combinations for ω 1  For brevity, we only present and illustrate the results from n = 400. With smaller sample size of 200 observations (See Additional file 1), SEs from all methods slightly increase as expected, and the major comparative pattern does not vary dramatically. Most of the MCEs are less than 5%. Table 1 summarizes the results when logit link functions are used to generate Y and δ. In addition, the missingness probabilities are generated by Model M1 and do not have many extreme values. As expected, CC estimates have substantial biases and the coverage rates are low. When the working models for Y and δ are both correctly specified, the bias is negligible for CE. PMI yields a good performance as well. NNMI MLR produces slightly larger biases yet smaller SD compared with CE. NNMI CLR produce slightly worse yet comparable results. When only the working model for Y is misspecified with 3 covariates, both NNMI methods are superior to other methods in term of bias and coverage rate, apparently for estimating Pr(Y = 1). In this case, CE estimates have much larger biases and variations, as well as low CRs. PMI breaks down dramatically due to its sole reliance on the working model predicting Y. The methods are more or less similar for estimating Pr(Y = 2). When only the working model for δ is misspecified with 3 covariates, CE performs well. Both NNMI produce mostly comparable results with CE. Table 2 presents the results when logit link functions are used to generate Y and δ. Here more extreme missingness probabilities are generated by Model M2. Compared with Table 1, the performances of estimators degrade in general due to the fact that these extreme missingness probabilities are more difficult to estimate and thus render more instability to the estimates. When both working models are correctly specified, CE produces little bias yet extremely large SD and SE. Both NNMI methods have small biases which lead to lower-than-nominal CRs. However, their variations (SD) and the estimates (SE) are much smaller and more reasonable compared with CE. Between the two approaches, NNMI CLR is somewhat inferior to NNMI MLR . When only the working model for Y is misspecified with 3 covariates, CE estimates become largely biased for estimating Pr(Y = 1). In this case, however, both NNMI methods produce smaller biases and much lower SDs and SEs. When only the working model for δ is misspecified with 3 covariates, interestingly CE works better than NNMI methods. We surmise this is due to the fact that, although misspecified, the working propensity model for CE avoids most of the extreme missingness probabilities. In other words, a correctly specified working propensity model might do more harm (i.e., brings more variation to the estimates) to CE compared with NNMI if the missingness probabilities are more extreme.
Tables 3, 4 and 5 include results when the working models use misspecified link functions, when the missingness probabilities are not extreme. In Table 3, a probit link function is used to generate Y and a logit link is used to generate δ. When the link function for the working outcome model is misspecified as a logit function, NNMI performs slightly better than CE and PMI (more apparently for estimating Pr(Y = 2) with smaller biases and SDs. In Table 4, a logit link function is used to generate Y and a probit link function is used to generate δ. When the link function for the working propensity model is misspecified as a probit function, NNMI methods produce slightly larger biases than CE and PMI, yet with smaller SDs and SEs, compared with CE and PMI. In Table 5, probit link functions are used to generate both Y and δ. When both working models employ logit link functions, CE produces good CRs yet relatively large SDs and SEs. PMI degrades with larger biases and low CRs, more apparently for estimating Pr(Y = 2). NNMI produces more robust results with smaller biases, lower SDs and SEs, and good CRs. For data with more extreme missingness probabilities (See Additional file 1), the comparative pattern remain similar but all methods perform worse.
In summary, the NNMI strategy can well accommodate misspecified working models when missingness probabilities are not extreme. PMI can break down if its working outcome model is misspecified. When missingness probabilities for some observations are extreme, CE estimates tend to have considerably higher SDs and SEs, while those from NNMI tend to be more stable. Between the two NNMI strategies, NNMI using multinomial logistic/probit regression models performs better than NNMI using cumulative logistic/probit regressions. This might be due to the fact that the data-generating models for Y in the simulation follow the multinomial regression scheme. In addition, there exists no apparent effect of specifying different weights for ω 1 and ω 2 , as long as ω 3 > 0.

Data example
We illustrate the proposed approach using an analysis of 2013 Behavioral Risk Factor Surveillance System (BRFSS) survey sample data [3]. Established in 1984, BRFSS is a nation-wide system of health-related telephone surveys that annually collects state data about U.S. residents regarding their health-related risk and preventive behaviors, health conditions, and information about health services. In this analysis, we are interested in the satisfaction level with health care received for the Hispanic Table 2 Simulation results from probability estimation for Y, where Y is generated using a logit link function with five covariates, δ is generated using a logit link function with extreme missingness probabilities (M2) based on five covariates, N = 400 population who were unable to work and had annual household income less than 15000 dollars. From the public-use BRFSS data system, a subset of 1430 participants are selected with fully-observed data of potentially associated covariates. More specifically, this question, the outcome of interest Y, consists of 3 categories: 1, Very satisfied (n=624); 2, Somewhat satisfied (n=357); and 3, Not at all satisfied (n=86). To demonstrated the method, Table 3 Simulation results from probability estimation for Y, where Y is generated using a probit link function with five covariates, δ is generated using a logit link function with not extreme missingness probabilities (M1) based on five covariates, N = 400 we considered to treat those participants who answered "Don't know/not sure", "Not applicable", "Refused", and those not asked or missing as missing data (δ=0, n = 363). In reality those who answered "Don't know/not sure", "Not applicable", "Refused" are not necessary to be missing. In this example, the overall missingness rate is 25.4%. We conduct some exploratory analyses to select predictors for this outcome and the missingness indicator (δ) using a multinomial logistic regression model and binary Table 4 Simulation results from probability estimation for Y, where Y is generated using a logit link function with five covariates, δ is generated using a probit link function with not extreme missingness probabilities (M1) based on five covariates, N = 400  Table 5 Simulation results from probability estimation for Y, where Y is generated using a probit link function with five covariates, δ is generated using a probit link function with not extreme missingness probabilities (M1) based on five covariates, N = 400 logistic regression model, respectively. From these analyses, the satisfaction levels of health care received is shown to be significantly associated with gender, general health, education level, having health care coverage, and having delayed getting medical care. These five covariates are used to fit a multinomial logistic regression model for predicting the satisfaction levels of health care. On the other hand, the missingness indicator is significantly associated with general health, education level, having health care coverage, and having delayed getting medical care. The four variables are included in a logistic regression model for predicting the missingness probability. Table 6 shows the results by applying different missing data methods for estimating the marginal distribution of Y. For simplicity, we only list the estimates for the category of "Very satisfied" and "Somewhat satisfied". Compared with CC, the estimates of CE, PMI and NNMI all give lower probabilities for "Very satisfied" and higher probabilities for "Somewhat satisfied". It is probable that participants who received excellent health care are more likely to respond to this question, and thus CC overestimates the probability for "Very satisfied" and underestimates those for the remaining groups. The estimates from CE, PMI, and NNMI are largely similar, and thus they provide some robustness check against potential model misspecifications.
Note that this simple example is merely used to illustrate the proposed statistical methodology, and the results should not be considered for subject-matter interests. More in-depth analyses targeted for the latter purpose should follow the guidelines provided in [3].

Discussion
In this article, we investigate a nearest-neighbour multiple imputation procedure for missing categorical data. The method applies predictive working models to identify observed cases in the neighbors for each missing observation. The use of information from two working models, one for Y and one for δ, might result in a double robustness property, which induces consistent estimates even if one of the two working models is misspecified. We use two types of modeling strategies for Y with more than two categories, one multinomial logistic/probit regression model and the other is based on m − 1 cumulative logistic/probit regression models. The results show some but not significant differences between the two strategies, indicating the flexibility of NNMI in terms of the modeling choices. The simulation results also suggest that the proposed approach in general yield satisfactory performances. The setup of the weighted sum of predictive scores would facilitate some sensitivity analyses.
In simulation, we observe that the correctness of the working model for predicting missingness probabilities might be more important than that for the outcome model for categorical Y. This implies that a non-zero weight (ω 3 ) should be applied to the predictive scores from the working model for missingness probabilities. Therefore, it might be more important to seek good models for predicting missingness probabilities for categorical Y, compared with continuous Y.
The CE results can be unstable with high SEs when missingness rates are relatively high. This is because when the number of complete cases is small, the working model fitted for Y might not be accurate and so would negatively affect CE. In contrast, the NNMI estimates tend to be more robust, possibly due to less reliance on the working models for Y than CE and instead using the nearestneighbour approach. In addition, CE performs badly with high SDs when the missingness probabilities were close to 0 or 1, while NNMI suffers less from this problem.
Furthermore, whether the working models, especially the one for the outcome, are correctly specified does not substantially impact NNMI if proper weights are specified. However, CE is more sensitive to the specification of the working model for the categorical variable with missing values. Among all tested specifications, the simulation results suggest that a multinomial logistic/probit regression for predicting missing values and a non-zero weight on the missingness probability predictive score, e.g.NNMI MLR (0.4, 0.4; 0.2), are preferred to impute categorical data with three or more levels in the absence of prior knowledge on the working models.
This study does not compare results using different NN (number of donors). Further research can be conducted for selecting the optimal size of the nearest neighborhood. Another extension is to apply NNMI to impute missing continuous and categorical data simultaneously. Also, the next step can assess the estimation of regression coefficients as well as the performance on imputing missing categorical covariates.
We have demonstrated that the proposed NNMI approach can be applied on missing at random categorical outcome variables with more than two levels for estimating the marginal mean, which broadly exist in public health studies, e.g., health status, health disparity and quality of life. For example, in a study of caries-risk [1], 97 out of 577 subjects dropped out during the recruiting procedure. Among the dropouts, 82 did not attend the examination and 15 did not completely fill out the questionnaire. It was doubtful that the missing mechanism for the 82 subjects was Missing Complete at Random, since subjects with caries were less likely to attend the examination. If the missingness was MAR, then CC would be valid for estimating the association with caries risk in their Table 3 but invalid for estimating the proportions of caries risk levels in their Table 2. Instead of using complete cases, the proposed NNMI approach could be applied to impute the caries risk level for the 82 subjects. Therefore, after the imputed data is obtained, the sample size could increase to 562, and it could help estimate the distribution of the caries risk level (low, medium, or high). Considering that the missing rate is not high (14.6%), 10-time MI could be sufficient to perform the imputation-according to a rule of thumb by Rubin [13]-with a multinomial logistic regression for predicting the missing values and a logistic regression model for predicting the missingness probability with non-zero weights, e.g., (0.4,0.4;0.2). The number of donor could be chosen as 5, based on previous studies [8,11]. For studies with a higher missing rate, the number of multiple imputations might need to increase to the missing rate in order to achieve both reliable point esimates and reliable standard errors, according to a rule of thumb described by Hippel [18]. Therefore, the proposed approach is applicable for researchers with an interest in assessing the distribution of a categorical outcome with MAR values.

Conclusions
In conclusion, the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome with more than two levels for evaluating the distribution of the outcome. The NNMI approach can work better than PMI when the working model for missing outcome is wrong. When the missing probabilities are extreme, NNMI performs more stably than CE, which results in relatively larger SE.