Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis

Background Multiple imputation is a recommended method to handle missing data. For significance testing after multiple imputation, Rubin’s Rules (RR) are easily applied to pool parameter estimates. In a logistic regression model, to consider whether a categorical covariate with more than two levels significantly contributes to the model, different methods are available. For example pooling chi-square tests with multiple degrees of freedom, pooling likelihood ratio test statistics, and pooling based on the covariance matrix of the regression model. These methods are more complex than RR and are not available in all mainstream statistical software packages. In addition, they do not always obtain optimal power levels. We argue that the median of the p-values from the overall significance tests from the analyses on the imputed datasets can be used as an alternative pooling rule for categorical variables. The aim of the current study is to compare different methods to test a categorical variable for significance after multiple imputation on applicability and power. Methods In a large simulation study, we demonstrated the control of the type I error and power levels of different pooling methods for categorical variables. Results This simulation study showed that for non-significant categorical covariates the type I error is controlled and the statistical power of the median pooling rule was at least equal to current multiple parameter tests. An empirical data example showed similar results. Conclusions It can therefore be concluded that using the median of the p-values from the imputed data analyses is an attractive and easy to use alternative method for significance testing of categorical variables. Electronic supplementary material The online version of this article (doi:10.1186/s12874-017-0404-7) contains supplementary material, which is available to authorized users.


Background
Logistic regression modelling is a frequently applied method in epidemiological and medical studies. Although researchers try to avoid it, missing data occurs in all kinds of different study designs, and inevitably, also when logistic regression modelling is used. There are many different methods available to handle incomplete data [1,2]. The most recommended method is multiple imputation (MI).
MI is currently implemented in almost all statistical software packages and therefore within reach of many researchers. Hence, it will likely be applied more often. MI generates multiple imputed datasets, where after complete data analysis can be applied to each imputed dataset. Finally, parameter estimates can be combined using Rubin's Rules (RR) [3].
For logistic regression modelling in combination with MI, the pooled regression coefficients and standard errors can easily be obtained by using RR. The pooled coefficient is derived by averaging the regression coefficient estimates from each complete data analysis result across the imputed datasets. The standard error is obtained by pooling the between imputation variance and the within imputation variance which account for sampling and imputation uncertainty, respectively. The pooled standard error is used to calculate 95% confidence intervals. For dichotomous and continuous covariates in a logistic regression model after MI, RR can easily be applied in combination with a single Wald statistic to obtain a p-value for significance [4].
To consider whether a categorical variable with more than two levels as a whole significantly contributes to the model, the methods to derive a pooled p-value are less straightforward. One method that can be used is to combine multiple chi-square values that result from a multiple parameter Wald or likelihood ratio test in each imputed dataset [5]. Alternatively, the pooled multivariate sampling variances of the regression model can be used to conduct a test that resamples a multivariate Wald statistic [6]. Meng and Rubin proposed another method in which the likelihood ratio test statistics are combined to provide a pooled p-value [7]. Unfortunately, none of these pooling methods are available in mainstream statistical packages. Consequently, the application of these methods may be complex for epidemiologists or other applied researchers, especially the Meng and Rubin pooling method. Furthermore, earlier studies showed that these methods do not always obtain optimal power levels, which is important for significance testing [7]. For that reason, it may be tempting for researchers to fall back on naïve methods, i.e., single imputation procedures, which often result in incorrect parameter estimates for statistical testing [2].
Van de Wiel et al. introduced the median of p-values in a cross-validation setting for inferring differences in prediction accuracies [8]. This setting is comparable to MI, because prediction accuracy is first obtained from separate (but related) generated versions of the data and subsequently inferred from those stochastically dependent data sets. This method showed proven control of the type I error rate and also good power results in different simulated data situations. It may therefore be a potential attractive method for significance testing of categorical variables.
Until now, methods to derive a pooled p-value for significance testing of categorical variables in logistic regression models have never been compared for their control of the type I error rate and power levels in different epidemiological data situations. Therefore, the aim of this study is to compare different pooling methods for significance testing of categorical and also continuous covariates in a logistic regression model after multiple imputation. Specifically type I error control and power after MI in a large simulation study will be evaluated. Moreover the characteristics of the pooling methods are further evaluated in an empirical dataset. In the Multiple imputation section the procedure of multiple imputation is more extensively described. In the Statistical hypothesis testing of a variable after MI section the different pooling methods for statistical testing after MI are discussed and in the Simulation section a simulation study is described that compared the different methods for pooling p-values of categorical variables. The methods are applied to a clinical dataset in the Application section.

Multiple imputation
Multiple imputation is an advanced method to handle missing data, commonly performed in three phases: imputation, complete data analysis and pooling. In the imputation phase the missing values are replaced with m sets of plausible values. These values are estimated from a series of regression models to generate a posterior predictive distribution of the missing values that is used to draw the imputed values from. Each variable can be modeled according to its own distribution, i.e., continuous variables are modeled with linear regression and dichotomous variables with logistic regression. Imputations are generated within several sequential iteration rounds or chains, referred to as Multivariate Imputation by Chained Equations (MICE) [9,10].
In the complete data analysis phase each imputed dataset is analyzed separately. The analysis performed is the same method that would have been applied had the data been complete. Accordingly, the analysis phase results in m sets of complete data results. The complete data analysis results from each imputed dataset will differ, because the imputed datasets differ

Rubin's rules (RR)
After the analyses the results are combined using pooling by RR. For parameter estimates (e.g., regression coefficients), the combined estimate θ À is the average of the estimates from the imputed data analyses: The standard errors of the parameter estimates are combined by using the within-imputation variance and the between-imputation variance [11]. The within imputation variance Var( θ À ) within is the average variance from the imputed data analyses: The between imputation variance Var( θ À ) between is the sum of the squared deviation of the parameter estimate of each imputed data analysis from the pooled parameter estimate weighted by m-1: The variance of the parameter estimates is then calculated by combining the within and between variance:

Statistical hypothesis testing of a variable after MI
For logistic regression analysis, statistical testing of covariates after MI can be performed by different methods. The methods to pool the statistical tests after MI will be elaborated below with the focus on testing whether a categorical variable as a whole significantly contributes to the model.

Univariate testing
For two-sided hypothesis testing of single regression coefficients in a logistic regression model after MI the Wald statistic W can be calculated as follows: where θ À and Var θ À ð Þ are the pooled coefficient and corresponding variance, respectively and θ 0 is the value under the null hypothesis. W single follows a chi-square distribution with 1 degree of freedom. After MI, the Wald statistic can be calculated from the RR pooled statistics, which makes this an easy to apply method for continuous variables.

Multivariate testing
For categorical variables in the logistic regression model only the pooled statistics for each separate level of a categorical variable can be obtained by RR, not the overall statistic. RR requires access to the variance-covariance matrices. Accordingly, each category can be tested, but the categorical variable as a whole cannot be tested without adapting the method. The several different multivariate pooling methods are discussed below. The formulas for these methods can be found in detail in Additional file 1.

Multiple parameter Wald test (CHI pooling)
One possibility is to pool the chi-square values from the multiple parameter Wald or likelihood ratio tests with multiple degrees of freedom (CHI pooling) [5]. The multiple parameter values are obtained after applying the test to each imputed datasets separately.

The pooled sampling variance (VAR pooling) method
Alternatively, a combination of the pooled parameter estimates and the pooled sampling variances can be used to construct a test that resembles a multivariate Wald test (VAR pooling) [12]. This test pools within and between covariance matrices that are obtained in each imputed dataset and finally corrects the total parameter covariance matrix of the multivariate Wald test by including the average relative increase in variance to account for the missing data [13].

Meng and Rubin pooling (MR pooling)
Meng and Rubin proposed a method to test overall categorical variables indirectly based on the likelihood ratio test statistic (MR pooling) [7]. For each regression parameter, two nested models are fitted in each imputed dataset: one restricted model where the parameter is not included in the model and one full model where the parameter is included. The pooled likelihood ratio tests are then compared to obtain pooled p-values for each parameter. The MR pooling method requires fitting multiple models for each variable in the data, hence it is an indirect approach. This can be a very time-consuming process.

The median P rule (MPR)
For the median P rule one simply uses the median pvalue of the significance tests conducted in each imputed dataset (MPR pooling). Hence, it depends on p-values only and not on the parameter estimates. The MPR was developed in a cross-validation setting for comparing predictive performances of two methods [8]. In that setting, multiple splits of the same data set into training and test sets render multiple dependent p-values. Various bounds for the type I error when using thresholds like median P < 0.05 were proven under a variety of assumptions. One of these is the multivariate normal null-distribution (MVN) for p-values that are transformed to a standard normal scale: then, median P < α implies a type I error smaller than α. In the imputation setting, dependence between the multiple p-values is caused by the fact that all imputed datasets share the same observed data. Therefore, the dependency is likely to be strong. In the remainder of the paper we will use the significance rule median P < α, which we refer to as "median P rule" (MPR). For a real dataset the underlying MVN assumption cannot be checked, because we observe only one instance of the p-value vector. In simulations, however, it can be checked using the asymptotic χ 2 (p) distribution for the observed Mahalanobis distance as computed from the p-values transformed to the standard normal scale. Fig. A1 in Additional file 1 shows the empirical distribution of Mahalanobis distance d together with the χ 2 (10) distribution for 10 imputations (which allows reliable estimation of the inverse covariance matrix, required for computing d). These are based on 1000 simulations (see Simulation Section). Indeed, we observe a good match between the two for both the imputed categorical and the imputed continuous variable. Hence, we conclude that at least in this setting the MVN assumption is reasonable.
The MPR was evaluated by using the p-values from the likelihood ratio test for multiple parameters for the categorical variables in the multivariable model. In addition, we performed extensive simulations to support the validity of the MPR (Simulation section) and we supply a bootstrapping scheme that allows anyone to check the appropriateness of this rule (and the aforementioned alternatives) for a given data set (Application section).

Simulation design
To study the performance of the CHI, VAR, MR and MPR pooling methods after multiple imputation we conducted a simulation study. In this study, data was generated for 250 cases. The data contained one categorical variable with four categories (Factor1) and four continuous variables (Covar1 -Covar4). The categorical variable was first created as a continuous variable, and then categorized by the quartiles of the variable. The categorical variable and the four continuous variables were the covariates in a model for a dichotomous outcome. The predictors were related to the outcome by multiplying coefficient loadings with the data matrix, and the resulting predictor matrix was used to estimate the probability of the outcome using a log-normal transformation of the linear predictor. The categorical variable was coded in the matrix by three dummy variables.
To create a variety of settings the data characteristics were varied. The correlation between the variables was varied between 0.2, 0.4, 0.6 and 0.8. Furthermore we varied the relation of the variables with the outcome by adjusting the coefficient values (betas). The betas for the continuous variables were varied from 0 to 1 with steps of 0.1. The betas of the dummy variables were varied by drawing the coefficients from a normal distribution with mean zero and a variance that also varied from 0 to 1 with steps of 0.1. Hence, for each correlation variation, ten different coefficient situations were simulated for 1000 datasets. This resulted in 40 conditions with 1000 datasets in each of these conditions.
We created missing values in the categorical variable (i.e., Factor1) and in the first continuous variable (i.e., Covar1). The percentage of missing values in both variables was set to either 25% or 40%. Accordingly, we created 40 conditions with 25% missing data, and 40 conditions with 40% missing data. The missing data was related to the other continuous variables in the data in order to simulate a missing at random missing data situation [3]. Each dataset with missing data was then imputed by multiple imputation. The number of imputations was set to 100. The data were analyzed using a generalized linear model.

Comparing methods
We compared the performance of five methods to pool the p-values of the variable tests. The first method that was used is RR. For the continuous variables this method is used by default in the MICE algorithm in R. However, for the categorical dummy variables, this method will produce three pooled p-values in our study; one for each dummy. So no overall p-value is obtained. The second method is MR pooling, the third method the chi-square test with multiple degrees of freedom (CHI pooling), the fourth method the multivariate sampling variance method (VAR pooling), and the fifth method the MPR, which pooled the p-values from the overall likelihood ratio test in each imputed dataset by taking the median.
For each of the simulated data conditions, the average type I errors and powers of all pooling methods were compared for the incomplete categorical variable and the incomplete continuous variable. We compared the results of the pooled p-values to the p-values from the complete data. Those 'full data' p-values were obtained by applying the generalized linear model to the simulated data without missing values, followed by computing the average type I error and power over the 1000 simulated data sets per condition.
Note that it has been shown that for the purpose of regression coefficient estimation, inclusion of the outcome variable in the imputation model (i.e., outcomebased imputation) is recommended [14]. However, for hypothesis testing, outcome-based imputation may lead to over-optimistic p-values, rendering the pooled test result as too liberal. In the simulation study we investigated this aspect of the imputation model extensively by comparing the performances of the pooling methods for outcome-based imputation with the results when the outcome was excluded from the imputation model. Table 1 presents the type I error for each pooling method compared to the complete data type I error, which is considered as the full data type I error, in the simulated condition when the beta equaled zero and 25% missing data. For all existing pooling methods outcome-based imputation was used, whereas for the MPR, results are presented both with inclusion (MPR in ) and without (MPR out ). Table 1 shows that the type I errors for all existing pooling methods, and also for the MPR out , were close to the target in all simulation conditions. For the MPR in method the type I error was sometimes too liberal. These findings are confirmed in the situation where 40% of the data is missing (Additional file 2, Table B1).

Description of results
The power for each of the pooling methods was evaluated for the categorical and continuous variable after imputation of the missing data. Note that the standard errors of the estimated type I error/power, denoted byp , equal sdp ð Þ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffif , which equals 0.007 for target type I errorp = 0.05 and which has its maximum forp =0.5, rendering 0.015. Given that these standard errors are relatively small we have not displayed these in the Figures. Results are presented separately for outcomebased imputation and imputation where the outcome was excluded from the imputation model. Lastly, we present a summary result where the best strategy for each method is depicted.
Outcome-based imputation Figure 1 depicts the power for each pooling method after outcome-based imputation for the condition of 25% missing data and a correlation of 0.4, for each simulated coefficient value (beta). Figure 1 shows that for the categorical variable, the power for the CHI, MR and VAR pooling methods was smaller than for the full data situation. The power for the MPR was closer to the full data power. This occurs for all beta values in all data conditions, however as was concluded from Table 1, the type I error for MPR was slightly inflated. For the continuous variable, the MPR results in inflated power levels compared with the full data and the other pooling methods up to a beta value of 0.3. For beta values beyond 0.3 the power for MPR was larger compared to the other methods and closer to the full data power. The conditions with 40% missing data resulted in larger differences between the MPR, CHI, MR, VAR and RR pooling methods than in the 25% missing data conditions (see Additional file 2 for a full overview of the results).
Outcome excluded from imputation model  Figure 2 shows that for the categorical variable, the power of the CHI, MR and VAR pooling methods is smaller than that of both the full data analysis as of the MPR method. This is also the case for the continuous variable as well as for the situation with 40% missing data (see Additional file 2 for a full overview of the results). Figure 3 displays the results of the power analysis when for the RR, CHI, MR and the VAR pooling methods outcome-based imputation was applied, as recommended, and for the MPR when the outcome is excluded for imputation (MPR out ). It is clear from these figures that the power for the MPR out method for categorical variables is higher than for all other pooling methods and closer to the full data results at all beta values. Opposite results are found for the continuous variables where RR, CHI, MR and the VAR pooling methods show better power results. The results for 40% missing data and correlations of 0.2, 0.6 and 0.8 confirm these findings with larger differences in power levels between the MPR out and the other pooling methods (see Additional file 2).

Application
To illustrate our methods we used an example dataset adapted from a study about low back pain. The study population consisted of 299 workers that were listed as sick for a period of three weeks due to low back pain. Three treatment groups, high-intensity back school, low-intensity back schools and usual treatment by the occupational physician, were compared in a randomized clinical trial. The results for the short-term effects were published previously [15]. The primary outcome was the difference in pain after three months, measured  Table 2.
Missing values were generated in categorical and continuous variables by the missing at random mechanism, so the probability for missing data was related to other variables in the data, to create a realistic data situation. About 25% of the cases had missing data on BMI, education, heavy lifting, and physical functioning. This incomplete dataset was imputed 100 times without including the outcome variable in the imputation model for MPR pooling of categorical variables (MPR out ) and outcome-based imputation was used for the other variables and the RR, MR, CHI and VAR pooling methods. The variables without missing observations were included in the imputation model and multivariate imputation by chained equations was used to impute missing values. The same multivariable logistic regression model was fitted as for the complete data analysis on each imputed dataset. Subsequently, the p-values for each independent variable were pooled according to the four compared methods: RR, MR pooling, CHI pooling, VAR pooling and the MPR out . Listwise deletion was also applied and presented as comparison, where only the cases with completely observed data are included in the analysis. The resulting p-values are depicted in Table 3, along with the complete reference data p-values ('Full data') in the first column.
In this example we observe that the MPR performed similar to the Meng and Rubin method, the chi-square pooling with multiple degrees of freedom and pooled sampling variance method. The smaller p-values were often closer to the p-values from the complete data analysis for the MPR than for MR, CHI and VAR. Furthermore, multiple imputation improved the results in particular for Group and Sitting when comparing the pooled p-values to the p-values after listwise deletion.
We use this data example to show how one can verify control of type I error and power for the various pooling methods using a data-based simulation. We used the estimated means and covariance matrix from our own data example to generate 1000 bootstrap samples of the covariates from a multivariate normal distribution. The sample size from the original data was used for the bootstrap samples (n = 299). Subsequently, we used the pooled coefficient estimates after MI from the analysis performed above to create the dichotomous outcome variable. Note that to obtain the coefficient estimates we used outcome-based imputation as recommended [14]. For simulating the outcome variable we only used the coefficients that were significantly contributing to the model by using the MPR p-values with a threshold of The numbers between brackets indicate the dummy variables; SE Standard Error P < 0.05. So the outcome was predicted by BMI, sitting, vibration tools, pain at baseline and disability. Then missing data were generated according to the pattern that was used in the example above and MI was applied with the different methods to pool the p-values. Table 4 presents the probability for each variable to obtain a significant result (P < 0.05). From the variables above the dashed line (that are present in the model), we conclude that the power of MPR is larger compared to RR, MR, CHI and VAR pooling. For example, we observe that the power for the detection of the association between the variable 'Disability' and the outcome was about 21% higher for MPR than for MR pooling. From the variables below the dashed line (not present in the model) we observe that the type I error of MPR is generally more on target than that of the other methods, although MPR was for some variables slightly anti-conservative with a type I error slightly over 0.05, which is not very problematic in practice. These findings are in concordance with those of our simulation study.

Discussion
Multiple imputation is a frequently used method when covariate data is missing in prognostic models. In the process of defining the prognostic model one is most interested in the overall test to make a decision on the relevance of the categorical variable as a whole for the model. However, such a test is not easily applied in the case of MI and asks for (complex) adjustments in the pooling process or for switching between different software packages. If these packages are not available by the researcher, the researcher may fall back to simple but erroneously single imputation techniques. We showed in a simulation study that the MPR is an easy-to-use method for statistical testing of categorical variables in a multiple imputation context. The performance of the MPR is tested in many different data conditions and proved to be consistently satisfactory. In   [7], the MPR performs equally well and the resulting pooled p-value for the categorical variable is often more on target than the pooled p-values derived from the other methods.
To obtain a powerful significance test for continuous and dichotomous variables with RR after MI, the MI procedure has to include the outcome variable, as was also indicated by Moons et al. [16]. For overall significance testing of categorical variables by using the multiple parameter Wald test (Chi Pooling), the pooled sampling variance (VAR pooling) method and Meng and Rubin pooling (MR pooling), we suggest following this recommendation. It should be mentioned though that these results assume a correct imputation model, and establishing robustness against misspecification of this model requires further study. The imputation should be outcome-based for continuous variables that are pooled with RR, or when applying the more complex pooling methods (i.e., CHI pooling, VAR pooling or MR pooling), because for these methods the pooling parameters are estimates that are pooled after which the result of the hypothesis test is obtained. However, to obtain correct and powerful pooled pvalues for significance testing of categorical variable as a whole with the MPR, the outcome should be omitted from the imputation model. Note that omitting the outcome for imputation may have a robustness advantage, because such imputation does not assume a specific model for the relation between outcome and covariates. In MPR, the hypothesis test results are directly pooled; the pooling parameters are the p-values. In this case, using the outcome in the imputation model would lead to over-optimistic hypothesis test results, as was shown in the simulation. What are the practical consequences of our results? We suggest the following guideline: If the continuous variable(s) are most important: use one of the available pooling after outcome-based imputation methods. If the categorical variable(s) are of primary interest: use MPR with outcome-excluded imputation. In both cases, both procedures render valid results for the other variables as well (in terms of type I error control), but may lack power. Hence, if p-values for the other type of variables are just above 0.05, we recommend applying the alternative procedure to gain power. This comes at some computational cost, but, the MPR rule is very easy to apply in any software package that can perform MI and therefore time-saving in itself.
The usability of the pooling methods depends largely on their availability in statistical software. Software packages vary in methodology to pool parameters after MI. For example, In Stata and Mplus the multiple parameter pooling method (CHI pooling) can be used [17,18].
There is separate SAS add-on code available for CHI pooling and a translated version was developed for R [19]. In Mplus, SAS and R the Meng and Rubin test (MR pooling) is available [20]. For R this procedure is available in the MICE package [9,20]. The pooled sampling variance method is available in Mplus, SAS and R. The strength of the MPR rule is that this rule can easily be applied posterior to MI in any software package. This is a large advantage for the many researchers that are most familiar with the use of statistical software package SPSS. These researchers do not have to switch to other software programs for the MI procedure in order to pool p-values of categorical variables.
The parameters that are pooled with RR or in MR, CHI, and VAR pooling follow a normal distribution. To pool these parameters, the mean is used. The parameters that are pooled with MPR are p-values, which do not follow a normal distribution. For that reason, it is warranted to pool using the median instead of the mean. As described in the paper by Marshall et al. [4], for other parameters, such as the proportion of variance explained or discrimination indices, the median may be a good summary estimator after MI. In future research, the application of MPR can be explored in many other situations where there is not yet a widely available pooling method at hand. Examples are non-parametric testing of variables after MI such as the pooled p-values for spearman rho correlation coefficients. But also to pool p-values from the F-tests of an analysis of variance (ANOVA). Van Ginkel and Kroonenberg developed a method to pool the F-tests of an ANOVA, but this procedure is rather complicated and not available in all software packages [21]. Also situations where likelihood ratio test statistics have to be pooled, i.e. comparing multilevel models in multiply imputed datasets, may benefit from the application of a pooling procedure such as MPR.

Conclusions
In conclusion, the MPR is an attractive rule for statistical inference of categorical variables with more than two levels because it has at least equal power as the multi parameter tests that are currently used but is much easier to apply in any software package.