Results for each of the pooling methods, as well as the complete case analysis and full data analysis, are shown in Fig. 2 (normal outcome) and Fig. 3 (binary outcome).

In Fig. 2, we find the pooling methods have a somewhat consistent order in regard to proportion of tests rejected at *p* < 0.05 across imputation methods and variables, with the proportion of rejections being highest for the Cauchy combination method, followed by the MPV rule, a single imputation approach, a mean *p*-value rule, and then the two D2 methods. The exceptions to this ordering occur for the third and sixth covariates, for which single imputation rejects a larger proportion of tests than the MPV rule.

The Cauchy combination rule seems to be the highest-powered approach for pooling; in all situations it has the highest proportions of tests with *p* < 0.05, sometimes suspiciously finding more significant findings than the “gold standard” models run on the full data. However, for the x_{6} variable which has a null effect, the Cauchy combination rule significantly over-rejects tests compared to other methods. Curiously, almost all methods over-rejected tests (i.e., rejected tests at a rate higher than 5%) for the x_{6} variable, including the full data analysis and complete case analysis. In terms of power, the MPV rule performs only slightly worse than the full data analysis for variables x_{1} through x_{5} and performs much better than the complete case analysis in most settings. On the other hand, both D2 methods perform rather poorly (e.g., low power) in most situations compared to the other approaches, although they still typically perform better than the complete case analysis. The D2 methods additionally have low Type I error rates (rates that are much lower than those for the full data or complete case approaches).

Findings for the binary outcome data under a MAR framework are very similar, as shown in Fig. 3. Additionally, findings regarding the capability of these pooling methods were similar in simulations conducted under MCAR and MNAR frameworks (see Additional file 1: appendix). A point of peculiarity is that in both the normal and dichotomous outcome simulations, the Type I error rate in the perfect, full-data data case deviates further than anticipated from 5%. We conjecture that the *p*-value distribution even in the full-data case is biased due to the inherent variability in the smoothing parameter selection that is not being accounted for when testing the null hypothesis. However, as mentioned previously, these GAMs were fit with a ML approach which in preliminary analyses had lower Type I error rates than REML or GCV. So, although the Type I error rate is not as low as one might hope for in the perfect, full-data data case, we have utilized what we believe to be currently the most conservative method of all fitting options.

A natural follow-up question that arises for the MPV is: how many imputed datasets are necessary to achieve satisfactory performance? To shed light on this, we illustrate the proportion of tests with *p* < 0.05 using the MPV and several choices of the number of imputed datasets in Fig. 4. Generally, the performance of the MPV rule improved with an increased number of imputations; as the number of imputed datasets increased, power increased for x_{1}, x_{2}, and x_{4}, and remained constant for x_{5}, while the type I error rate decreased toward the full-data level. Improvements were greatest for low number of imputations and tapered off after 10 imputations. Strangely, this was not the case for x_{3}, where the MPV rule’s power decreased with additional imputations. This is because both PMM and RF have difficulties capturing the nature of the x_{3} relationship to Y (see Additional file 1: Appendix Figure A7), particularly PMM, since the relationship is non-linear in such a fashion that the linear relationship is biased toward the null (U-shaped).

PMM imputation seemed to yield results more consistent with the full data approach than RF imputation, with the exceptions of x_{3} and the null variable x_{6}, where the type I error rate was higher for PMM imputed data. As stated above, the better performance of RF over PMM for x_{3} is likely due to RF imputation’s ability to better identify and model non-linear relationships.

In addition to GAMs, the effectiveness of the MPV in comparison to other methods was also examined on B-spline models. B-spline models, like GAMs, are a technique for fitting a smooth curve through data, where the curviness of the line is possible due to the combination of “knots” being placed along the x-axis and constrained polynomial terms. In our investigation, we examined cubic B-spline regression models with 10 degrees of freedom with knots placed at seven internal equally-spaced quantiles of the covariate. B-splines were examined only with normal data, and inference was conducted using likelihood ratio tests (LRT). Figure 5 presents results for B-spline models on normal data with a baseline MAR missing data structure. The alternative D2 method was not used in this analysis, as the number of degrees of freedom in the case of B-splines is more tractable.

Again, Fig. 5 shows that the MPV performs comparably well in terms of power for covariates x_{1} through x_{5}. Now, we also see a satisfactory performance of MPV for the null case (x_{6}), in which, the MPV rule is closest to the expected Type I error rate while remaining slightly conservative. Further B-spline results are shown in the Additional file 1: Appendix for MCAR and MNAR missing data patterns. Figure 6 shows that the performance of the MPV generally improves or stays steady with additional imputations when used with LRTs of B-spline models, with the same exceptions as GAMs with the x_{3} variable.

It is clear we observe bias away from the null in both B-spline and GAM settings. To investigate this, we plot the *p*-values for the null effect of x_{6} with the GAM results Fig. 7. We see that not only are PMM and RF *p*-values for null effects non-uniform (with a heavy lean towards smaller *p*-values), but the complete case and full data situations likewise are left-heavy. Although we found that the type I error rate was better controlled for B-splines than for GAMs, similar histograms indicate that the B-splines still show evidence of bias away from the null after imputations (Fig. 8). However, the full data and complete case rejections of the null are near-uniform with rejection rates almost exactly at 5%, in contrast to the higher rates seen with GAMs. In conclusion, we have found that default imputation methods (PMM and RF) bias results slightly against the null, but also that GAM *p*-values are biased against the null regardless of imputation technique as demonstrated by the inflated type I error rates even in the full data analysis.

### Normal outcome application: six-minute walk distance based on elevation in PAH patients

A recent paper [20] examined the effect of home elevation from sea-level (based on ZIP code) on distance walked during six minutes in patients presenting with pulmonary arterial hypertension. The six-minute walk distance (6-MWD) is an important clinical metric used for evaluating progression of disease. Several variables in this dataset were missing non-negligible amounts of data and the study authors used multiple imputation via chained equations with *m* = 25 imputations to address the missing data problem, using predictive mean matching as their imputation method for continuous variables and logistic or multinomial regression for categorical variables.

A central model of interest from this study was a GAM in which 6-MWD was modeled by various demographic and clinical covariates, as well as a smoothed term for continuous elevation. Elevation was provided based on patient home address. It was in this context that we apply the various rules of GAM combination across multiple imputations to evaluate how these methods perform on real-world data. Figure 9 shows the relevant covariate curves of 25 GAMs fit with ML to data from each of the *m* imputations. Although standard errors of the curves are not plotted for the sake of clarity, it is visibly evident how some curves might suggest a stronger relationship between elevation and distance walked than others.

This heterogeneity across imputations is further demonstrated in Fig. 10, which plots all *p*-values from each imputation’s GAM as well as the results from all investigated pooling methods. The three panels represent three GAM fitting methods: ML, REML, and GCV, the last of which is the default GAM fitting method. The motivation for conducting GAMs with all three fitting methods was derived from preliminary concerns about inflated Type I error rates with the GCV approach.

For all model fitting options, the MPV approach yielded a *p*-value as anticipated – in the middle of all single imputation models' *p*-values. The Cauchy combination method gave smaller *p*-values than the other combination methods, while the D2 method and alternative D2 method produced *p*-values on the high end, if not outside the range, of the single imputation *p*-values. In this application, the selected pooling method could clearly impact the results of the study if the authors leaned heavily on the alpha = 0.05 criteria for statistical significance of *p*-values.

### Binary outcome application: Intubation risk by C-reactive protein level

Our second application uses data from Peterson [21] and Windham et al. [22] which modeled the risk of intubation for hospitalized patients infected with COVID-19 based on patient characteristics including C-reactive protein (CRP) levels, age, body mass index (BMI), race, sex, lactate dehydrogenase, and additional clinical measures. Limited clinical insight of risk factors for COVID-19 complications early in the pandemic necessitated the need for flexible modeling without restrictive parametric assumptions. Therefore, a GAM with a logit link was chosen with a smoothed term for CRP level to maximize its predictive ability on intubation risk.

Missing data were non-negligible and dependent on observed patterns in the data. Therefore, we assumed a MAR framework and employed multiple imputation via chained equations with *m* = 25 imputations for analysis. As before, we examine the behavior of the MPV rule in this analysis in comparison to other pooling methods. ML was used in this application as the fitting method. Figure 11 shows the distribution of *p*-values after MI. The rank order of the combined *p*-values was similar between predictive mean matching and random forest methods. Again, we see that the Cauchy combination rule results in the smallest *p*-value, while the D2 and mean *p*-value rules have larger *p*-values. Unlike the prior application however, all methods lead to the same inferential decision (using a 5% significance level), so the choice of pooling method is less consequential. It is worth noting that a listwise deletion approach yields a *p*-value of 0.098, much higher than all other methods. Even though there were only 8 missing values for CRP out of the original 158 participants, only 51 of participants had complete data for all covariates, so listwise deletion cuts the sample to only 32% of its original size. Hence, multiple imputation was necessary in this example to fully leverage the available data and to maximize power/precision.