 Research
 Open access
 Published:
Multimetric comparison of machine learning imputation methods with application to breast cancer survival
BMC Medical Research Methodology volume 24, Article number: 191 (2024)
Abstract
Handling missing data in clinical prognostic studies is an essential yet challenging task. This study aimed to provide a comprehensive assessment of the effectiveness and reliability of different machine learning (ML) imputation methods across various analytical perspectives. Specifically, it focused on three distinct classes of performance metrics used to evaluate ML imputation methods: postimputation bias of regression estimates, postimputation predictive accuracy, and substantive modelfree metrics. As an illustration, we applied data from a realworld breast cancer survival study. This comprehensive approach aimed to provide a thorough assessment of the effectiveness and reliability of ML imputation methods across various analytical perspectives. A simulated dataset with 30% Missing At Random (MAR) values was used. A number of single imputation (SI) methods  specifically KNN, missMDA, CART, missForest, missRanger, missCforest  and multiple imputation (MI) methods  specifically miceCART and miceRF  were evaluated. The performance metrics used were Gower’s distance, estimation bias, empirical standard error, coverage rate, length of confidence interval, predictive accuracy, proportion of falsely classified (PFC), normalized root mean squared error (NRMSE), AUC, and Cindex scores. The analysis revealed that in terms of Gower’s distance, CART and missForest were the most accurate, while missMDA and CART excelled for binary covariates; missForest and miceCART were superior for continuous covariates. When assessing bias and accuracy in regression estimates, miceCART and miceRF exhibited the least bias. Overall, the various imputation methods demonstrated greater efficiency than completecase analysis (CCA), with MICE methods providing optimal confidence interval coverage. In terms of predictive accuracy for Cox models, missMDA and missForest had superior AUC and Cindex scores. Despite offering better predictive accuracy, the study found that SI methods introduced more bias into the regression coefficients compared to MI methods. This study underlines the importance of selecting appropriate imputation methods based on study goals and data types in timetoevent research. The varying effectiveness of methods across the different performance metrics studied highlights the value of using advanced machine learning algorithms within a multiple imputation framework to enhance research integrity and the robustness of findings.
Background
Models built using clinical, histological, biological, and radiological variables often face the challenge of missing data, as clinical databases may not have complete records for every patient. Understanding the nature of missing data and choosing the most appropriate imputation method is crucial to ensure reliable and accurate results. This is especially true in prognostic studies, where the presence of missing data poses significant challenges. Missing values in certain covariates for a subset of patients can lead to a substantial reduction in the sample size available for analysis. This data insufficiency can produce biases in parameter estimates, impacting the validity of the subsequent analytical conclusions.
Rubin’s taxonomy provides a framework for understanding the nature of missing data, classifying them into three distinct mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR) [1, 2]. In the MCAR mechanism, the probability of missingness does not depend on either observed or unobserved data. In contrast, for MAR, the probability of missingness depends only on the observed data; it is independent of the unobserved data. Finally, in the MNAR mechanism, which is the most complex of the three, the probability of missingness relies on unobserved data, even when conditioning on observed data.
At its core, a missing data imputation method seeks to replace missing values using a specific model or algorithm to provide the most accurate representation of the original data. One frequent approach to handle missing data is completecase analysis (CCA). This method limits the analysis to individuals who have complete data (i.e., all values for all the included variables). Consequently, the sample size may be dramatically reduced. CCA may not always be the optimal strategy to use due to the potential for biases and a reduction in statistical power [3].
Two alternative strategies have been suggested to overcome the limitations of CCA: single imputation (SI) and multiple imputation (MI). SI is a straightforward method that replaces a missing value with just one value, such as the mean or mode, resulting in a single dataset. Therefore, no postimputation operations are necessary. In contrast, MI creates multiple copies of the dataset, with each missing value in each dataset copy replaced with independent random draws from the imputation model [4].The analysis model processes each of these datasets for the primary analysis, with results later pooled to account for imputation uncertainty. The subsequent step in MI is the application of the analysis model to each of the imputed datasets. After this step, Rubin’s rules are employed to pool the results from all these analyses into a singular inference [5].
In order to eliminate potential biases in MI, it is essential to include all variables used in the analysis model as well as any interactions between variables in the imputation model [6]. Moreover, the imputation model can incorporate auxiliary variables, which are not included in the analysis model. These variables serve to make the MAR assumption more plausible and to provide additional information on missing values [7].
Given the increasing prevalence of machine learning (ML) imputation algorithms, a variety of strategies based on ML have emerged to tackle the challenge of missing data. However, the performance of these strategies varies greatly depending on the specific context of the problem at hand; this complicates the identification of a gold standard ML approach for handling missing data.
Numerous studies have sought to statistically compare the performance of these different MLbased imputation algorithms using a variety of performance metrics. These range from metrics related to the quantities estimated by the analysis model (postimputation), to metrics assessing the predictive accuracy of the analysis model, and those assessing the imputation accuracy regarding the variables imputed. Depending on the purpose of the specific statistical analysis, one method may be preferable to another. However, it is often difficult to determine which method produces the most accurate results [8]. For example, the method with the highest predictive power does not necessarily produce the least biased estimates. Conversely, a method that minimizes bias does not necessarily guarantee the best predictive performance.
ML imputation algorithms are based on constructing a predictive model to predict missing values using the available data. Popular ML algorithms, such as RandomForest and KNN, have been extensively used as imputation methods in various prognostic studies in both SI and MI settings. Results from a wealth of comparative studies on these methods highlight the controversy surrounding their relative performance [9,10,11,12].
For instance, KNN, which imputes missing values by considering subjects with similar outcome patterns, exhibits the lowest imputation error when compared with other imputation approaches [13]. Another example is missForest, which is based on RandomForest, an algorithm praised for consistently low imputation error and superiority when variables have high intercorrelations [14, 15]. While the performance of missForest tends to improve with increasing correlation levels, it does not always perform better than other methods when the missing data mechanism is MCAR [16].
Another interesting approach to use ML imputation algorithms is their integration into the MICE (Multiple Imputation by Chained Equations) framework. The miceCART algorithm, a nonparametric approach for implementing MICE, uses sequential regression trees as the conditional models. It is recognized for its adeptness in capturing complex relationships with minimal tuning by the data analyst, as conditional models do not need to be explicitly specified [10]. However, studies measuring its performance have observed that while its imputations for continuous values generated via recursive partitioning may preserve interactions, the miceCART algorithm may underestimate main effects [17].
Despite the numerous comparative studies on the performance of different ML imputation algorithms, their study designs tend to lack consistency; moreover, they often solely focus on either the precision of imputed data or the bias following imputation (the latter being tied to parameter estimates) [18]. Additionally, in healthrelated studies, discussions about how one imputation method might influence epidemiological interpretation, especially within the context of survival analysis, are rare.
The present study aimed to investigate whether the use of various metrics to assess the performance of machine learningbased algorithms can lead to significant differences in the interpretation of results. To this end, we compared eight ML algorithms adapted to missing data imputation. Specifically, we performed a simulation analysis to identify the most effective imputation method according to different performance metrics. Subsequently, we applied the chosen method to realworld breast cancer survival data. This comprehensive analysis provides insights into how different performance metrics can influence the understanding and effectiveness of machine learning algorithms in practical healthrelated applications.
Methods
Motivating example: breast cancer case study
To apply our most effective method to a realworld situation, we used data from a retrospective cancer survival study of women with breast cancer was conducted in Morocco, involving 711 incident cancers diagnosed in 2009 and followed until December 2014 [19]. That study aimed to identify prognostic variables, including epidemiological, clinical, pathological, biomarker expression, and treatment characteristics. Additionally, other risk factors such as oral contraceptive use, a family history of breast cancer, and obesity were evaluated. The outcome of interest was eventfree survival, calculated from the date of surgery or initiation of chemotherapy to the earliest date for either locoregional recurrence or distant metastasis. Thirteen baseline covariates were included in the cancer study’s dataset : age, body mass index (BMI), radiotherapy, mammographic size, ScarffBloomRichardson (SBR) grade, nulliparity, lymph nodes (N0, N1, N2, N3), oral contraception, PgR, vascular invasion, trastuzumab, hormone therapy, HER2, and ER. Age refers to the patient’s age in years. BMI is a measure of body fat based on height and weight. Radiotherapy indicates whether the patient received radiation treatment. Mammographic size measures the size of the tumor in centimeters. The ScarffBloomRichardson (SBR) grade categorizes tumor aggressiveness into three grades: I, II, and III. Nulliparity refers to whether a woman has never given birth. Lymph nodes are classified into four stages (N0, N1, N2, N3) based on the extent of cancer spread, following the TNM classification system. Oral contraception indicates the use of birth control pills. PgR (Progesterone Receptor) indicates hormone therapy responsiveness. Vascular invasion refers to the presence of cancer cells in blood vessels. Trastuzumab is a targeted therapy drug. Hormone therapy involves treatments that block hormones to slow down or stop cancer growth. HER2 (Human Epidermal Growth Factor Receptor 2) identifies a protein that can promote cancer growth. ER (Estrogen Receptor) status helps determine the cancer’s growth sensitivity to estrogen, guiding hormone therapy choices. Table 1 presents the distribution of the variables for the dataset along with the percentage of missing values for each variable. The proportion of missing values varied between covariates, ranging from 0.6 to 43%. Missing values were observed in 607 women, resulting in an almost 85% reduction in sample size in CCA, leaving a study sample of only 105 cases. This example highlights the need to appropriately handle missing data in order to optimally analyse the study data, while both minimizing the risks of loss of efficacy and reducing potential bias.
Machine learning imputation algorithms
Using various performance metrics, our analysis consisted in comparing the performances of eight different MLbased imputation methods using SI and MI (Table 2).
A comprehensive presentation follows for each of the eight algorithms investigated. We will explore SI strategies for KNN, missMDA, CART, missForest, missRanger, and missCforest, and MI strategies for miceCART and miceRandomForest.
Knearest neighbours (KNN)
The KNN imputation method, which is similar to the hotdeck method, uses donor observations. The value imputed is an aggregation of the values of the \(\:k\) closest neighbors. The method of aggregation depends on the type of variable. For continuous variables, the default aggregation is the median, while for categorical variables it is the most frequent category among the \(\:k\) values.
This method computes a distance to determine the nearest neighbours using a version of Gower’s distance that can handle different types of variables, specifically binary, categorical, ordered, continuous, and semicontinuous variables [20]. The distance between two observations is a weighted average of the contributions of each variable, with the weights reflecting the importance of each variable.
For continuous variables, KNN calculates the absolute distance between two observations and then divides it by the total range of that variable. The same approach is used for ordinal variables, after converting them to integer variables. For nominal and binary variables, a binary distance of 0/1 is used [21]. KNN imputation preserves the inherent structure and relationships within the data during the imputation process. As a nonparametric method, it refrains from making assumptions about data distribution, and consequently offers a nonparametric solution ideal for datasets with unknown or nonnormal distributions. However, its effectiveness is closely tied to the critical choice of \(\:k\). Additionally, its reliance on a distance metric that accommodates diverse variables introduces sensitivity to this choice, necessitating careful weighting.
Classification and regression trees (CART)
CART, often referred to as decision trees, is a versatile class of ML algorithms. The core distinction between classification and regression trees lies in the criteria used for data splitting and tree pruning; a more detailed discussion on these aspects can be found in [22, 23]. A key feature of the CART algorithm is its inherent ability to preserve interactions between the data in the same dataset.
CART identifies predictors and cutoff points within these predictors that can be used to divide the sample. These partitions split the sample into subsamples of greater homogeneity. The splitting procedure is repeated on the subsamples, producing a binary tree structure. The target variable in these models can either be discrete (classification trees) or continuous (regression trees).
When employing CART for imputation of missing data, the algorithm treats the variable with missing values as the dependent variable (target), and the remaining variables as predictors. The procedure involves building a decision tree where the splits are based on the predictors, the aim being to predict the missing values of the target variable. For continuous variables, regression trees are employed, predicting a value that minimizes the variance within nodes. For categorical variables, classification trees are used, where the prediction corresponds to the most frequent class within a node.
Decision trees possess properties that make them particularly appealing for imputation tasks [24]. They are robust against outliers, can handle multicollinearity and skewed distributions, and are flexible enough to accommodate interactions and nonlinear relationships. Moreover, many aspects of model fitting have been automated, meaning minimal tuning by the imputer [10].
Imputation with multivariate data analysis (missMDA)
missMDA is based on the Factor Analysis of Mixed Data (FAMD) method. It handles missing values in mixed data types, including both continuous and categorical variables. This method considers the similarities between individuals and the associations between variables. It performs a Principal Component Analysis (PCA) for continuous variables and a Multiple Correspondence Analysis (MCA) for categorical variables.
The imputation process begins with initial estimates for missing values, using the mean for continuous variables and modes for categorical variables. For mixed data, the Factor Analysis of Mixed Data (FAMD) method, which integrates the principles of PCA and MCA, is utilized to handle both data types cohesively. The algorithm iteratively updates these estimates as follows: in each cycle, FAMD predicts missing values based on observed data; these predictions replace previous estimates, and a new model incorporating these updates is then fitted. The process continues until a convergence criterion is met, typically when changes in imputed values between iterations fall below a set threshold. This ensures that the final imputed values are both statistically robust and contextually aligned with the dataset’s structure [25].
Nonparametric imputation using RandomForest (missForest)
The missForest algorithm, a nonparametric imputation technique, utilizes a Random Forest model to iteratively predict missing data values. Initially, it estimates the missing values, typically using the mean or mode, and then employs a Random Forest model to impute missing values for each variable based on observed values from the other variables. The process is iterative, with predictions being updated in each cycle. A key component of this process is the utilization of OutOfBag (OOB) error estimates to assess the accuracy of imputations after each iteration. The algorithm replaces missing values with new predictions and refits the Random Forest models to these updated data, continuing until the OOB error suggests that further iterations would not result in more accurate imputations. This method is adept at handling mixedtype data and provides an estimate of the imputation error, thereby offering an indication of the reliability of the imputed values.
Random Forests differ from the CART methodology by generating multiple trees instead of just one. By averaging numerous trees, the variance of unstable trees is significantly reduced, leading to a more reliable and robust model [23]. The introduction of variability in individual trees yields a more resilient solution, which subsequently enhances the method’s accuracy. This variability can be generated through different processes, including bootstrapping and random input selection [22].
Alternative implementation of missForest algorithm using predictive mean matching (missRanger)
An alternative to missForest is the implementation of missRanger, which incorporates in the imputation process the Predictive Mean Matching (PMM) option [26]. Missing values for a variable are imputed using predictions made by a Random Forest, which employs all remaining variables as covariates. The algorithm performs repeated iterations across all variables. This process continues until there is no further improvement observed in the average OOB prediction error of the models; this serves as the stopping criterion for the iterations. An option within this algorithm is the use of the PMM method [27]. For each missing value, PMM creates a donor pool, which consists of complete cases that have predicted values of the outcome closest to the predicted value of the missing entry. From this donor pool, PMM then randomly selects one donor case, and the actual observed value of this selected case is used to replace the missing value. This way, the PMM aims to restore the variance of resulting conditional distributions to a realistic level and preserves the original distribution of the data.
The missRanger imputation method provides a robust alternative to the widelyused missForest algorithm. In particular, the intergrated PMM option ensures the credibility of imputed values [27].
Imputation based on ensemble conditional trees (missCforest)
The missCforest can be utilized for the imputation of numerical, categorical, and mixedtype data [28]. Through ensemble prediction using Conditional Inference Trees (Ctree) as base learners, missing values are imputed [29]. Ctree is a nonparametric class of regression and classification trees that combines recursive partitioning with conditional inference theory [30]. missCforest redefines the imputation problem as a prediction problem using a single imputation approach. The missing values are predicted iteratively based on the set of complete cases, which is updated at each iteration. There is no predefined stopping criterion, and the imputation procedure stops when all missing data have been imputed. Since the recursive partitioning of Conditional Trees is based on numerous test procedures, this algorithm is resistant to outliers and pays special attention to the statistical relationship between covariates (i.e., variables used for imputation) and the outcome (i.e., variable to be imputed).
Multiple imputation based on CART algorithm (miceCART)
The MICE (Multivariate Imputation by Chained Equations) algorithm is an iterative method for handling missing data. It operates in a variablebyvariable manner, using the observed data to generate initial random imputations for the missing data [31].
miceCART is a variation of the CART algorithm, designed to work within the framework of MI [24]. Initial missing values are estimated by drawing a random distribution from the observed values of each associated variable. A tree is then fitted to the first variable with at least one missing value, using the remaining variables as predictors.
Only individuals with observed values for the outcome are considered. This produces a tree with multiple leaves, each containing a subset of the data. An individual with a missing value for the outcome is placed inside one of these leaves. A random value from this leaf’s subset is then selected and used for imputation. This procedure is carried out for each variable with missing data and is ultimately repeated multiple times, resulting in multiple imputed datasets.
Multiple imputation based on RandomForest algorithm (miceRF)
MICE RandomForest uses the RandomForest algorithm to predict missing values based on observed data, updating imputations until convergence is reached. By combining RandomForest with MI, a degree of uncertainty can be introduced into the imputation model, which makes it more suitable for generating parameter estimates with desirable characteristics [24].
The miceRFt approach can yield more accurate and reliable imputations when the data exhibits complex patterns that standard regression models (used in traditional MICE) might fail to capture [32].
However, as with any imputation method, the resulting imputations should be validated through methods such as sensitivity analyses or comparisons with CCA to ensure the robustness of the imputation procedure.
Simulation study
In this simulation study, we aimed to evaluate the performance of different ML methods for missing data imputation using different types of performance metrics.
Our simulation design was tailored to incorporate mixedtype covariates (i.e., both continuous and categorical), include nonlinear interdependencies between variables, and address the key issue of missing data. The approach used for this comprehensive assessment was as follows:

(1)
Generate survival data for all cases.

(2)
Estimate the Cox Proportional Hazards (PH) model using complete data.

(3)
Introduce missing values into covariates.

(4)
Impute missing data using different methods.

(5)
Reestimate the Cox PH model using imputed data.

(6)
Compare the performance of the imputation methods using several performance metrics.
Data generation
We generated a total of 500 allcase datasets, each containing 700 observations (no missing values). The datasets included three covariates \(\:{X}_{1}\), \(\:{X}_{2}\) and \(\:{X}_{3}\) in addition to status and observed survival times  which were simulated as follows:

(i)
\(\:{X}_{1}\) a continuous covariate, drawn from a standard normal distribution \(\:\left(mean=0,sd=1\right)\)

(ii)
\(\:{X}_{2}\) a continuous covariate nonlinearly dependent on \(\:{X}_{1}\), with the following distribution: \(\:{X}_{2}={X}_{1}^{2}+{X}_{1}+U\) where \(\:U\sim\:\text{Uniform}\left(0,1\right)\)

(iii)
\(\:{X}_{3}\) a binary covariate, generated from a binomial distribution with \(\:p=0.5\)
To generate individual survival times, we employed the cumulative hazard inversion method, as described by [33]. The method initiates with the specification of loghazards ratios for covariates \(\:{X}_{1}\), \(\:{X}_{2}\), and \(\:{X}_{3}\), which were set at \(\:0.1\), \(\:0.3\), and \(\:0.6\), respectively.
This technique is based on inverting the survival function, formulated as \(\:{S}_{i}^{1}\left(u\right)={H}_{0}^{1}\left(log\left(u\right)exp\left({X}_{i}\beta\:\right)\right)\), where \(\:{S}_{i}^{1}\left(u\right)\) denotes the inverted survival function for the \(\:i\)th individual, \(\:{H}_{0}^{1}\left(u\right)\) represents the inverted cumulative baseline hazard function, \(\:{X}_{i}\) is the vector of covariates for the individual, and \(\:\beta\:\) comprises the corresponding effect parameters on survival.
To generate individual simulated event times \(\:{T}_{i}\), we applied this formula: \(\:{T}_{i}={S}_{i}^{1}\left({U}_{i}\right)\), where \(\:{U}_{i}\) is a random variable drawn from a uniform \(\:U\left(0,1\right)\) distribution.
For our analysis, we adopted a generalized Weibull distribution to model the baseline hazard, chosen for its clinically plausible risk shape which closely follows the pattern observed for the risk of certain cancers. This parametric framework is particularly suitable for representing the aggressive nature of cancer risks in the early observation stages, with a notable decrease in risk following treatment. The generalized Weibull distribution is therefore an ideal fit for scenarios that require a realistic depiction of baseline hazards, as it effectively captures the dynamic risk pattern over time, a trend that aligns with observations highlighted in the present work [34].
The individual censoring times \(\:{C}_{i}\) were drawn from an exponential distribution. We calibrated the parameter of exponential distribution to obtain a censoring rate of approximately \(\:30\text{\%}\). The individual observed survival times were then determined as the minimum of the uncensored (\(\:{T}_{i}^{\text{*}}\)) and censored survival times \(\:{T}_{i}=min\left({T}_{i}^{\text{*}},{C}_{i}\right)\), with the event status adjusted accordingly. We applied administrative censoring at seven years.
Missing values
All complete datasets were subjected to the generation of missingness in variables \(\:{X}_{2}\) and \(\:{X}_{3}\). Specifically, we introduced missing values under the MAR mechanism, which resulted in 30% of the observations in the datasets exhibiting missing values. The probability of missingness in variables \(\:{X}_{2}\) and \(\:{X}_{3}\) was determined by the linear predictor of a logistic model, which comprised the fully observed \(\:{X}_{1}\), the event status indicator, and the observed time [9].
Imputation models
All imputation models included the fully observed covariate \(\:{X}_{1}\), event status, survival times, and marginal NelsonAalen cumulative hazard, in accordance with the approach detailed in [35].
Estimation models
Irrespective of the method used for imputation, the substantive model, which is the model used in the estimation step (postimputation), was a multivariate Cox Proportional Hazards [36]. In the case of SI methods, a single model per dataset was estimated; for MI methods, we created ten imputed datasets (\(\:m=10\)), and combined the loghazard ratios from the Cox model using Rubin’s rules [1].
Performance metrics
To compare the performance of the imputation methods, we considered three distinct types of performance metrics, each based on different criteria. We describe the different metrics of all three in the following sections.
Substantive modelfree performance metrics
This set of metrics assesses the performance of imputation methods independently of any estimation model. They provide a direct measure of the quality of the imputed data.
Gower’s distance
Gower’s Distance serves as a measure to quantify the dissimilarity between datasets, as outlined by [20]. Specifically, it is adept at handling datasets composed of mixed variable types (i.e., including both categorical and continuous variables). In our study, Gower’s Distance was employed to evaluate the discrepancy between two distinct datasets: one with complete cases and the other with imputed data. The calculation of Gower’s Distance, denoted as \(\:D\), between any two data points \(\:i\) and \(\:j\) is given as:
Where \(\:p\) is the total number of variables within the dataset; \(\:{\delta\:}_{ijk}\) is a binary variable that is assigned a value of 1 when both data points \(\:i\) and \(\:j\) possess a nonmissing value for variable \(\:k\), and 0 when either data point has a missing value for that variable; finally, \(\:{d}_{ijk}\) represents the normalized distance between data points \(\:i\) and \(\:j\) for variable \(\:k\).
For continuous variables, \(\:{d}_{ijk}\) is often calculated as the absolute difference between the values for \(\:i\) and \(\:j\), normalized by the range of variable across all data points. For categorical variables, \(\:{d}_{ijk}\) is typically 0 if \(\:i\) and \(\:j\) have the same category for variable \(\:k\), and 1 otherwise.
To obtain the overall Gower’s distance for the dataset, we calculate \(\:D\left(i,j\right)\) for all pairs of data points. The overall Gower’s distance is then computed by averaging the pairwise dissimilarities across all pairs of data points, which can be expressed as:
where \(\:n\) is the total number of data points in the dataset. The factor of \(\:\frac{2}{n\left(n1\right)}\) ensures that the average is computed correctly by considering all unique pairs without redundancy. This provides a comprehensive measure of the dissimilarity between the complete (original) and imputed datasets.
Normalized root mean squared error (NRMSE)
For continuous variables, the NRMSE is utilized to quantify the divergence of imputed values from their actual counterparts [37]. This metric serves as an indicator of imputation accuracy, with a lower NRMSE signifying greater precision in imputation.
Here, \(\:{X}_{\text{true}}\) represents the actual values of the covariate, and \(\:{X}_{\text{imp}}\) denotes the corresponding imputed values. This normalized measure offers a standardized assessment of the deviation between original and the imputed values.
Proportion of false classified (PFC)
For categorical covariates, the PFC is a metric designed to evaluate the accuracy of imputation. It metrics the proportion of instances in which the imputed binary values deviate from their original counterparts.
Where \(\:n\) represents the total number observations; \(\:{X}_{\text{true},i}\) and \(\:{X}_{\text{imp},i}\) denote the actual and imputed values for the \(\:i\)th observation, respectively; \(\:I\left(\cdot\:\right)\) is an indicator function, equaling 1 when actual and imputed values differ, and 0 otherwise.
The PFC metric therefore provides a straightforward and intuitive measure of imputation accuracy specifically tailored to categorical data, quantifying the frequency of misclassification introduced through the imputation process.
Postimputation bias, accuracy, and reliability of regression estimates
This set of metrics evaluates the impact of imputation on the bias, accuracy, and precision of regression model estimates.
Postimputation bias
The postimputation Bias is a metric used to evaluate the extent to which the imputation process influences the estimation of regression coefficients. It is computed by comparing the estimated coefficients derived from imputed data with those obtained from all cases data.
Where \(\:{\beta\:}_{\text{true}}\) is the coefficient estimated from the all cases data; and \(\hat{\beta}_{\text{imp}}\) is the average estimated coefficient for a given covariate obtained from the imputed data.
This measure quantifies the deviation in the regression coefficients due to the imputation of missing data, thereby assessing the impact of the imputation technique on the regression analysis. A smaller value of postimputation bias indicates that the imputation process has minimal distortion on the regression estimates, suggesting a more accurate and reliable imputation methodology.
Empirical standard error (empirical SE)
The empirical SE is a statistical metric that quantifies the variability of estimated regression coefficients derived from imputed data. It serves as an indicator of the precision of these estimates, with a smaller SE suggesting greater precision.
Where \(\:m\) represents the number of imputed datasets; \(\beta_{\text{imp},k}\) is the estimated coefficient from the \(\:k\)th imputed dataset; and \(\hat{\beta}_{\text{imp}}\) is the average of the estimated coefficients across all \(\:m\) imputed datasets.
This measure essentially calculates the standard deviation of the regression coefficients across multiple imputed datasets, providing insight into the spread or dispersion of the coefficient estimates. A lower Empirical Standard Error implies that the coefficient estimates across different imputed datasets are more consistent, indicating a higher level of reliability and stability in the imputation process.
Empirical coverage rate (ECR)
The ECR is a metric used to evaluate the accuracy of imputed data in terms of statistical inference. It measures how often the 95% CI, calculated from the imputed data, encompasses the true regression coefficients.
Where \(\:p\) is the total number of coefficients being estimated; \(\:{\beta\:}_{\text{true},k}\) represents the true value of the \(\:k\)th coefficient; \(\:{\text{CI}}_{95\text{\%},k}\) is the 95% confidence interval for the \(\:k\)th coefficient, as calculated from the imputed data; finally, \(\:I\left(\cdot\:\right)\) is an indicator function that equals 1 if the true coefficient \(\:{\beta\:}_{\text{true},k}\) falls within the corresponding 95% confidence interval \(\:{\text{CI}}_{95\text{\%},k}\), and 0 otherwise.
The ECR quantifies the proportion of times these confidence intervals accurately capture the true parameter values. A value close to 95% indicates that the confidence intervals derived from the imputed data are reliable and effectively represent the uncertainty surrounding the parameter estimates. This metric is crucial for assessing the validity of statistical inferences made from imputed datasets.
Postimputation predictive accuracy
This set of metrics assesses how well a predictive model performs when trained on imputed data, compared to its performance on all cases.
Timedependent area under the ROC curve (AUC)
The AUC is a metric used to assess the discriminatory power of a predictive model in a timedependent context, which is particularly relevant in cancer survival analysis. This measure evaluates the ability of the model to correctly distinguish between different outcome classes (e.g., event vs. no event) at various time points [38]. The timedependent AUC is calculated as follows.
Where \(\:\text{AUC}\left(t\right)\) is the area under the ROC curve up to time \(\:t\); \(\:\text{Sensitivity}\left(u\right)\) is the true positive rate (sensitivity) at time \(\:u\); and \(\text{Specificity}(u)\) is the derivative of the true negative rate (specificity) with respect to time at \(\:u\).
In practical terms, this metric integrates the sensitivity and the rate of change of specificity over time, providing a comprehensive measure of the model’s performance in accurately classifying individuals at risk over the entire followup period. A higher TimeDependent AUC indicates better discriminative ability of the model at different time points, which is crucial for the accurate prediction of cancer survival outcomes in clinical research.
Cindex
The Cindex, or Concordance index, is a widely used metric in cancer survival analysis to evaluate the concordance between predicted and observed outcomes. It provides a measure of the predictive accuracy of a model, particularly in the context of censored survival data [39]. The Cindex is computed as follows.
Where \(\:{T}_{i}\) and \(\:{T}_{j}\) are the observed survival times for pairs of individuals \(\:i\) and \(\:j\), respectively; \(H_{i}\) and \({H}_{j}\) are the predicted instant hazards (or risks) for individuals \(\:i\) and \(\:j\), respectively; and \(\:I\left(\cdot\:\right)\) is an indicator function that equals 1 if the condition is true and 0 otherwise.
The Cindex quantifies the proportion of all usable patient pairs in which the predictions and outcomes are concordant. A pair is considered usable if one of the pair’s members experiences the event of interest and the other is either censored or experiences the event at a later time. A Cindex of 0.5 suggests no predictive discrimination (random chance), while a Cindex of 1.0 indicates perfect discrimination. In practice, a high Cindex indicates that the model has good predictive accuracy, successfully ranking individuals in the order of their observed times to the event, which is particularly important in the analysis of imputed datasets in survival studies.
Importantly, in our evaluation approach, each of these metrics offers a unique perspective on the performance of imputation methods, covering aspects from the precision of imputed data to the impact on subsequent statistical analyses and predictive modeling. This comprehensive evaluation ensures a robust understanding of the strengths and limitations of the different imputation methods studied here.
Results
Simulation study: performance metrics
Substantive modelfree performance metrics
Gower’s distance
The evaluation of imputed versus fully observed datasets revealed that CART and missCforest yielded the most accurate results. In other words, they exhibited the smallest Gower’s distance values, at 0.0128 and 0.013, respectively, indicating high fidelity of their imputations to the fully observed datasets (Fig. 1).
NRMSE
In the context of the continuous covariate (\(\:{X}_{2}\)), missForest and miceCART outperformed the other six methods, with the smallest NRMSE values at 0.2964 and 0.3065, respectively, demonstrating their effectiveness at imputing continuous data (Fig. 1).
PFC
For the binary covariate (\(\:{X}_{3}\)), missMDA and CART were the most effective methods, achieving the lowest PFC values at 0.4231 and 0.4328, respectively, suggesting superior performance in accurately classifying binary data (Fig. 1).
Postimputation Bias and Accuracy of Regression estimates
Bias
For the continuous covariate (\(\:{X}_{2}\)), CART and miceRF showed minimal bias towards the null, with bias metrics of 0.0012 and − 0.0042, respectively, and relative biases of 0.3931% and − 1.4110%, respectively. Conversely, missMDA exhibited the highest bias at 0.0455 with a relative bias of 15.1613% (Table 3). For the binary covariate (\(\:{X}_{3}\)), miceRF and miceCART were the least biased methods, with bias metrics of 0.005 and − 0.0077, respectively, and relative biases of 0.8383% and − 1.29%, respectively, while missMDA showed the highest bias at 0.1780 with a relative bias of 29.6638% (Table 4).
Efficacy
The empirical SE indicated that all imputation methods offered more efficient estimates than the CCA, the latter yielding the lowest average efficiency estimates (Tables 3 and 4).
ECR
Except for miceCART and miceRF, all six other imputation methods provided suboptimal ECR for both covariate types. miceCART achieved 95.6% and 96% coverage for continuous (\(\:{X}_{3}\)) and binary covariates (\(\:{X}_{3}\)), respectively, while miceRF ensured 97% and 94.8% coverage, respectively (Tables 3 and 4).
Postimputation Predictive Accuracy of Cox Models
AUC and Cindex
In predictive accuracy, missMDA and missCforest provided the best results. The missMDA attained an AUC of 0.734 and a Cindex of 0.675, showcasing high predictive power, while the missCforest had an AUC of 0.726 and a Cindex of 0.667. In contrast, missRanger scored lowest for both the AUC (0.704) and Cindex (0.652) metrics (Fig. 2).
Overview of the three type of performances metrics
For the continuous covariate, the plot (Fig. 3) visualizes the tradeoffs between bias and the NRMSE for each imputation method, with the size of the points proportional to the Cindex. The SI methods CART and missForest managed to maintain a delicate equilibrium, highlighting low bias and NRMSE, while achieving a moderate to high Cindex. Conversely, although missMDA presented a higher bias and NMRSE, it achieved a high Cindex, as indicated by its larger point size. Additionally, of the eight MI methods, miceCART and miceRF stood out by providing the least biased estimates. Nevertheless, miceRF exhibited a relatively high NRMSE.
For the binary covariate, the plot (Fig. 4) visualizes the tradeoffs between bias and the PFC for each imputation method, with the size of the points reflecting the Cindex value. Notably, the SI methods missForest and KNN exhibited a balanced performance with relatively low bias and moderate PFC, while maintaining a moderate to high Cindex. At the other end of the spectrum, missMDA demonstrated stronger bias but compensated with a low PFC, and a high Cindex (reflected by its larger point size). Moreover, the MI methods miceCART and miceRF provided the less biased estimates but relatively high PFC.
Application to the motivating example: Cox PH model with imputed dataset
In this section we describe the application of our work to realworld breast cancer study. We used the Cox PH model to estimate the effect of one or several variables on the time to event occurrence without requiring the specification of the underlying hazard function. This model is defined as:
where \(\:h\left(t\right)\) is the instantaneous hazard function at time \(\:t\), \(\:{h}_{0}\left(t\right)\) the baseline hazard at time \(\:t\), \(\:\beta\:\) is a column vector of coefficients \(\:{\left({\beta\:}_{1},{\beta\:}_{2},.,{\beta\:}_{p}\right)}^{T}\), and \(\:X\) is a column vector of covariates \(\:{\left({X}_{1},{X}_{2},.,{X}_{p}\right)}^{T}\). These coefficients, estimated from the beta coefficients, describe how the risk of the event changes per oneunit increase in covariates. A positive \(\:\beta\:\) reflects a higher risk, while a negative reflects a lower risk. The model assumes proportional hazards and log linearity, indicating constant covariate effects over time and linear relationships on the loghazard scale. In our application, we used a simple Cox model without interaction terms and assumed a proportional hazard similar to that reported in the original article [19].
Our objective in this application was to highlight the variability in outcomes produced by CCA (i.e., with no imputation) and miceRF, as any variability could lead to contradictory epidemiological and clinical interpretations. We focused on five prognostic factors widely recognized in existing literature as follows: Radiotherapy, Lymph nodes, PgR, Hormone therapy, and ER status (see above) [40, 41].
To handle missing data in our motivating example, we used the miceRF algorithm for imputation. This method, when investigated by [9], demonstrated that it produced unbiased estimates with better ECR. Our simulation results comfirmed this. Moreover, they indicated that thanks to the algorithm’s tendency to yield less biased estimation coefficients with the most credible coverage rates, it is very effective, particularly in contexts where the impact on estimation coefficients implies significant consequences for the validity and reliability of research findings.
With no imputation (i.e., CCA), a positive ER status was associated with a favorable prognosis (HR = 0.09, 95% CI: 0.02–0.38, p = 0.001), indicating a potential protective effect. Similarly, hormone therapy was also linked to a positive prognosis (HR = 0.11, 95% CI: 0.02–0.64, p = 0.015), acting as a protective factor. Furthermore, a negative PgR had a significant protective effect (HR = 0.01, 95% CI: 0.000.21, p = 0.003). In contrast, radiotherapy did not exhibit a significant impact on eventfree survival (HR = 2.91, 95% CI: 0.48–17.50, p = 0.244) (Table 5).
Interestingly, preimputation results (i.e., CCA) diverged significantly from postimputation findings after imputing with miceRF. For example, after imputation, radiotherapy was significantly associated with better eventfree survival (HR = 0.49, 95% CI: 0.31–0.78, p = 0.003), suggesting a protective role. Moreover, unlike the CCA results, ER after imputation no longer had a significant prognostic effect (HR = 0.82, 95% CI: 0.47–1.47, p = 0.493). This was also true for PgR status (HR = 0.79, 95% CI: 0.40–1.59, p = 0.509), indicating no clear impact on eventfree survival. Similarly, hormone therapy no longer showed a significant prognostic effect after imputation (HR = 0.88, 95%CI: 0.51–1.51, p = 0.635). However, a significant effect on eventfree survival was observed in the case of N3 lymph node status (HR = 2.64, 95% CI: 1.27–5.5, p = 0.011) (Table 5).
Discussion
Missing data imputation methods that are easy to implement but which are potentially biased, such as CCA, are still commonly used, even if they rely on strong assumptions and despite the availability of potentially more appropriate techniques [42]. In order to investigate this issue, in the present study, we evaluated the performance of eight ML imputation methods for missing data through simulations using several performance metrics.
An imputation method is typically deemed superior when it demonstrates certain key characteristics: minimal bias and relative bias in regression coefficient estimates, coverage rates closely aligning with the nominal coverage probability, and lower values of PFC and NRMSE. However, an important distinction arises in the performance of SI versus MI methods. The former often result in more biased regression coefficients compared to the latter. On the other hand, they tend to outperform the latter in terms of predictive accuracy. Therefore, it is essential to carefully consider the specific performance metrics being used to evaluate imputation methods. These metrics can range from the precision of regression estimates and the predictive accuracy of the model to the accuracy of variable imputation itself [8]. The choice of one metric over another can significantly influence the perceived performance of an imputation method.
Our study revealed that for binary covariate imputation, the algorithms missMDA and CART were the most precise, whereas missForest and miceCART excelled in imputing continuous covariates. Concerning the regression estimates, while the lowest bias in imputation of continuous covariates was observed with CART and miceRF; for binary covariates, the lowest bias was found with miceRF and miceCART. Additionally, miceCART and miceRF were superior in achieving optimal ECR. With regard to predictive accuracy, missMDA and missForest stood out for their high accuracy.
Overall, our findings underline that Random Forest within an MI framework is superior to SI techniques and CCA. This finding justifies the choice of applying this method to breast cancer in a context where the prognostic effect of factors is of great importance.
The discrepancy found in the prognostic effects of the four covariates examined before and after imputation is striking. Indeed, this discrepancy could potentially lead to misleading clinical conclusions, if not properly accounted for. Accordingly, we strongly recommend choosing robust imputation methods when handling missing data in clinical prognostic studies. Of course this choice depends on whether the objective of the study itself is predictive or inferential in nature. Moreover, this discrepancy demonstrates the importance of explicitly indicating whether the analysis is based on complete cases or whether imputation methods were used, and if so, which one.
Although the impact of missing data on a study’s final results can be quite significant, many studies do not sufficiently address it. A review of timetoevent studies in oncology highlighted this issue [43]. Specifically, out of 148 studies reviewed, 79 (53%) reported using complete cases, compared to 33 (22%) for MI. Importantly, of all the studies, 128 (86%) did not specify the assumptions their chosen analysis method made regarding missing data. These findings would suggest that while missing data are a common issue in many studies, there seems to be a lack of transparency as to how they were addressed or what assumptions were made.
Surprisingly, empirical studies investigating the performance of different imputation methods use inherently different metrics. For instance, missForest was reported as a effective tool for handling missing values, providing highly accurate imputations, and outperforming other common imputation methods. Yet these evaluations of missForest were based on its predictive accuracy [14, 44]. This approach may be insufficient, especially in epidemiological studies which aim to identify factors significantly associated with the outcome. An imputation method that merely minimizes the prediction error can be problematic, as it does not attempt to preserve the joint distribution of the data. Such an approach could lead to biased parameter estimates [8, 9], as found in our simulation study. Therefore, studies comparing imputation methods should systematically report postimputation bias and its coverage rate, and not just postimputation predictive accuracy.
Our comparative study provides a succinct summary of the performance metrics associated with the eight evaluated imputation methods. It focuses on minimizing bias and imputation errors while simultaneously aiming to maximize predictive accuracy. Indeed, the choice of metrics to use when an imputation method is selected greatly depends on the analytical goal, be it estimation or prediction. Accordingly, the appropriateness of using a metric in an imputation method is intrinsically tied to the specific objectives of the analysis.
SI methods are typically discouraged for handling cases where more than 10% of data is missing, as they tend to underestimate the variability in the data, which leads to inaccuracies in coverage rates [13]. However, when incorporated into a MI framework  such as CART combined with MICE (Multiple Imputation by Chained Equations)  the application of SI methods becomes more feasible and straightforward. MICE works by iteratively fitting a predictive model to each variable with missing values, while using other variables as predictors. In this context, any predictive model can be employed to estimate missing values. This flexibility is evident in the use of various MICE models, including miceCART and miceRF.This demostrates that even methods which are not traditionally recommended in situations where there is a high level of missingness can be effectively adapted in MI.
Moreover, in the face of model specification difficulties and/or the presence of complex interactions among variables, nonparametric algorithms like treebased algorithms (CART, Cforest, Random Forest) present an excellent alternative. These types of algorithms do not require the same assumptions as conventional parametric models; this makes them more flexible and more able to capture complex relationships in the data [8, 9, 17].
Furthermore, a promising approach to the challenge of misspecification in an imputation model is the heterogeneous ensemble imputation strategy. One such approach is Super Learning, a method grounded in heterogeneous ensemble learning that integrates multiple algorithms within a single metaalgorithm. This strategy might effectively tackle the issue of imputation model missspecification. Imputers such as MISL (Multiple Imputation by Super Learning) and SuperMICE exemplify the application of the Super Learner algorithm, a metaalgorithm that aggregates predictions from an array of base algorithms, thereby offering compelling properties [45, 46]. A more indepth exploration is required to evaluate its potential advantages over the methodologies currently in use.
It is worth noting that our analyses and conclusions are based on the assumption that the data are MAR. If the missing data mechanism deviates from MAR, especially in the case of MNAR, the effectiveness of the imputation methods and the applicability of our conclusions may differ. While our results would not significantly change under an MCAR assumption, the generalizability of our findings could be limited under different missingness conditions, particularly MNAR, which is difficult to check in practice.
Future extensions of this research could focus on exploring the imputation performance of MLO methods, especially in contexts where covariates exhibit timevarying and nonlinear effects. These complex scenarios, which pose considerable challenges for traditional statistical methods, might significantly benefit from the adaptability and sophistication offered by ML algorithms.
Conclusion
Addressing missing data in observational timetoevent studies is crucial for preserving the integrity and accuracy of findings. Different performance metrics can evaluate imputation methods from various perspectives; this highlights the importance of method selection based on the study’s final objective. Whether the goal is inference, predictive performance, or minimizing original data distortion independent of any analysis model, the choice of a particular imputation method should be carefully considered.
When a significant predictor has a high proportion of missing values, the results should be interpreted as a foundation for hypothesis generation rather than as definitive conclusions. This nuanced approach highlights the necessity of a robust strategy for handling missing data in cancer survival studies.
Although CCA remains a common strategy, we encourage researchers to explore modern techniques, including ML algorithms like Random Forest within an MI framework. These advanced methods offer advantages over SI techniques and CCA by adeptly managing complex missing data scenarios. Adopting a consistent methodology for managing missing data can significantly reduce bias in parameter estimates, thereby enhancing the credibility, reliability, and robustness of research findings.
Data availability
All data used in the simulation part were generated following our specific simulation design. The necessary code to reproduce our simulation is available via https://github.com/ielbadisy/MLimputersBCsimulation. The data used in the working example were kindly provided by its original author.
References
Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92.
Graham JW. Missing data analysis: making it work in the real world. Ann Rev Psychol. 2009;60:549–76.
White IR, Carlin JB. Bias and efficiency of multiple imputation compared with completecase analysis for missing covariate values. Stat Med. 2010;29:2920–31.
Little RJA, Rubin DB. Single imputation methods. In: Statistical analysis with missing data. 2002. pp. 59–74.
Rubin DB. Multiple imputation for nonresponse in surveys. John Wiley; Sons; 2004.
Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338.
White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30:377–99.
Hong S, Lynn HS. Accuracy of randomforestbased imputation of missing data in the presence of nonnormality, nonlinearity, and interaction. BMC Med Res Methodol. 2020;20:1–12.
Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol. 2014;179:764–74.
Burgette LF, Reiter JP. Multiple imputation for missing data via sequential regression trees. Am J Epidemiol. 2010;172 9:1070–6.
Jerez JM, Molina I, Garcı́aLaencina PJ, Alba E, Ribelles N, Martı́n M, et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010;50:105–15.
Lakshminarayan K, Harp SA, Goldman RP, Samad T. Imputation of missing data using machine learning techniques. In: KDD. 1996.
Schwender H. Imputing missing genotypes with weighted k nearest neighbors. J Toxicol Environ Health Part A. 2012;75:438–46.
Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 2013;3:002847.
Tang F, Ishwaran H. Sci J. 2017;10:363–77. Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data.
Solaro N, Barbiero A, Manzi G, Ferrari PA. A simulation comparison of imputation methods for quantitative data in the presence of multiple data patterns. J Stat Comput Simul. 2018;88:3588–619.
Doove LL, Van Buuren S, Dusseldorp E. Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal. 2014;72:92–104.
Oberman HI, Vink G. Toward a standardized evaluation of imputation methodology. Biom J. 2024;66:2200107.
Slaoui M, Mouh FZ, Ghanname I, Razine R, Mzibri ME, Amrani M. Outcome of breast cancer in Moroccan young women correlated to clinicpathological features, risk factors and treatment: a comparative study of 716 cases in a single institution. PLoS ONE. 2016;11:0164841.
Gower JC. A general coefficient of similarity and some of its properties. Biometrics. 1971;27:857.
Kowarik A, Templ M. Imputation with the r package VIM. J Stat Softw. 2016;74:1–16.
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC; 1984.
Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: Data IntroductionMining. Inference Prediction. 2009;2.
Doove LL, Buuren S, Dusseldorp E. Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal. 2014;72:92–104.
Josse J, Husson F, missMDA. A package for handling missing values in multivariate data analysis. J Stat Softw. 2016;70:1–31.
Mayer M, Mayer MM, Package. missRanger. R Package; 2019.
Wright MN, Ziegler A, Ranger. A fast implementation of random forests for high dimensional data in c + + and r. 2015.
EL Badisy I, missCforest. Ensemble conditional trees for missing data imputation. 2023.
Hothorn T, Hornik K, Zeileis A, Ctree. Conditional inference trees. Compr R Archive Netw. 2015;8.
Strasser H, Weber C. On the asymptotic theory of permutation statistics. 1999.
Buuren S, GroothuisOudshoorn K, Mice. Multivariate imputation by chained equations in r. J Stat Softw. 2011;45:1–67.
Slade E, Naylor MG. A fair comparison of treebased and parametric methods in multiple imputation by chained equations. Stat Med. 2020;39:1156–66.
Bender R, Augustin T, Blettner M. Generating survival times to simulate cox proportional hazards models. Stat Med. 2005;24:1713–23.
Giorgi R, Belot A, Gaudart J, Launoy G. The performance of multiple imputation for missing covariate data within the context of regression relative survival analysis. Stat Med. 2008;27:6310–31.
White IR, Royston P. Imputing missing covariate values for the cox model. Stat Med. 2009;28:1982–98.
Cox DR. Regression models and lifetables. J Royal Stat Soc Ser B. 1972;34:187–202.
Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S. A bayesian missing value estimation method for gene expression profile data. Bioinformatics. 2003;19:2088–96.
Chambless L, Diao G. Estimation of timedependent area under the ROC curve for long‐term risk prediction. Stat Med. 2006;25.
Gerds TA, Kattan MW, Schumacher M, Yu C. Estimating a timedependent concordance index for survival prediction models with covariate dependent censoring. Stat Med. 2013;32:2173–84.
Group EBCTC, et al. Effects of radiotherapy and of differences in the extent of surgery for early breast cancer on local recurrence and 15year survival: an overview of the randomised trials. Lancet. 2005;366:2087–106.
Karihtala P, Jääskeläinen A, Roininen N, Jukkola A. Prognostic factors in metastatic breast cancer: a prospective singlecentre cohort study in a Finnish university hospital. BMJ open. 2020;10:e038798.
Marshall A, Altman DG, Holder RL. Comparison of imputation methods for handling missing covariate data when fitting a cox proportional hazards model: a resampling study. BMC Med Res Methodol. 2010;10:1–10.
Carroll OU, Morris TP, Keogh RH. How are missing data in covariates handled in observational timetoevent studies in oncology? A systematic review. BMC Med Res Methodol. 2020;20:1–15.
Ramosaj B, Pauly M. Predicting missing values: a comparative study on nonparametric approaches for imputation. Comput Stat. 2019;34:1741–64.
Carpenito T, Manjourides J. MISL: multiple imputation by super learning. Stat Methods Med Res. 2022;31:1904–15.
Laqueur HS, Shev AB, Kagawa RMC. SuperMICE: an ensemble machine learning approach to multiple imputation by chained equations. Am J Epidemiol. 2022;191:516–25.
Acknowledgements
We extend our gratitude to Meriem Slaoui for generously providing the Moroccan Breast Cancer dataset, which was originally published as open data (https://datadryad.org/stash/dataset/doi:10.5061/dryad.2pc43). This dataset served as the motivating example in our study. Centre de Calcul Intensif d’AixMarseille is acknowledged for granting access to its High Performance Computing resource.
Funding
This research received no specific grant from any funding agency in the public, commercial, or notforprofit sectors.
Author information
Authors and Affiliations
Contributions
IE: Conceptualization, Formal Analysis, Writing, Review, and Editing. NG: Review and Editing. MK: Supervision. RG: Review and Editing, Supervision, Validation.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
El Badisy, I., Graffeo, N., Khalis, M. et al. Multimetric comparison of machine learning imputation methods with application to breast cancer survival. BMC Med Res Methodol 24, 191 (2024). https://doi.org/10.1186/s12874024023053
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874024023053