Overview of our simulation
Figure 3 shows an overview of our simulation approach. We compare the performance of MICE and SEM on incomplete categorical (discrete) data, and both against doing nothing (e.g., using only complete cases). The main steps are as follows:
1. Generate a random graph. This random graph is also referred to as the original structure in the final step for comparison.
2. Sample data points from the random graph to get the complete data.
3. Introduce missing values to the complete data.
4. Learn the Bayesian network structure, either: (a) from all complete cases, (b) from the data set completed via MICE, or (c) using SEM.
5. Compare learned Bayesian network structures with the original structure.
We analysed networks with numbers of variables ranging from 2 to 20. For each number of variables, we analysed a range of missing proportions from 0.1 to 0.6 at intervals of 0.1. Each variable number/missing proportion was repeated 100 times. We completed the whole analysis for each of 1000, 5000 and 10,000 sampled data points.
Simulated data
Random networks and sampled data
We first generated a randomly connected network structure with the specified number of nodes (variables) using method Ide’s and Cozman’s Generating Multi-connected DAGs (ic-dag) algorithm in the function random.graph from R package bnlearn [22]. We set maximum in-degree for any node at 3, and each node had 3 discrete levels. Various descriptive statistics of these random network structures are shown in Additional file 1; the networks had expected changes: increasing out-degrees, reduced density and clustering, and increased diameter with larger networks. We obtained conditional probability tables (CPTs) for each node by generating random vectors from the Dirichlet distribution using function rdirichlet from R package MCMpack [23]. The parameter \(\alpha\) of Dirichlet distribution was 0.5 for nodes with parents and 5 for nodes without parents. This provided our random parameterised BN. We then randomly sampled 1000, 5000 or 10,000 data points from the parameterised BN to get our sampled data using the function rbn from R package bnlearn [22].
Missing data
For each missing data mechanism, we introduced different amounts of missing data to the sampled data using the function ampute from R package mice [24]. This function requires a complete data set and specified missing patterns (i.e., the variable or variables that are missing in a given sample). We used the default missing pattern matrix for all simulations, in which the number of missing patterns is equal to the number of variables, and one for each variable is missing. We also used the default relative frequency vector for the missing patterns, so that each missing pattern has the same probability to occur. Thus, the probability of being missing is equal across variables. The data is split into subsets, one for each missing pattern. Based on the probabilities of missingness, each case in each subset can be either complete or incomplete. Finally, the subsets are merged to generate the required incomplete data. The allocated probability for each value to be removed in each subset depends on the specified missing proportion and missing data mechanism [25]:
MCAR The missingness is generated by chance. Each value in the sampled data has the same probability to be incomplete and such probability is computed once the missing proportion is specified [25].
MAR The probability of each value being incomplete is dependent on a weighted sum score calculated from values of other variables. We used the default weights matrix in our simulation, in which all variables except the missing one contribute to the weighted sum score [25].
MNAR Simulating MAR and MNAR data share most procedures during amputation. The only difference is that it is the value of the potential missing value that contributes to the probability of its own missingness [25].
Bayesian network structure learning
During the whole study, we used the same BN structure learning procedures to learn from data either before processing or after. That is, procedures were all the same for methods “None”, “MICE” and “SEM” in Fig. 3: we used a score and search algorithm, using the BDe score [2] and the tabu search algorithm for searching the best network structure [26]. The imaginary sample size used by BDe was set equal to 1 (default value). A test for the impact of scoring function was performed by also assessing structures learned using the BIC and BDs scores for one dataset configuration (MNAR data, 1000 data points, 0.3 missingness; BDs imaginary sample size set to 1 as default; BIC also used default value for penalty coefficient: log(number data points)*0.5). For “None” and “MICE”, we applied the tabu function from R package bnlearn [22]; for SEM the search was incorporated into the iterative steps as described below.
No imputation
We used the complete cases of simulated incomplete data for BN structure learning.
Structural EM
We applied the SEM algorithm to the incomplete data using the function structural.em from R package bnlearn [22]. We used the default imputation method (“parents”) in the E-step, which imputes missing data values based on their parents in the current network. We applied tabu search and BDe scoring matrix for structure learning and the default method Maximum Likelihood parameter estimation (mle) for parameter learning in the M-step. The maximum number of iterations was 5 as default.
Multiple Imputation by Chained Equations
As all the variables in this study were categorical and unordered, we used the polytomous logistic regression model for prediction using the function mice from R package mice [24]. The number of iterations was 5 as default.
Evaluation of recovered network structures
To compare the learned BN structures with the original ones, we compared their skeletons using functions compare and skeleton from R package bnlearn [22]. We compared skeletons, which represent all links in the network as undirected links, to deal with variation of link direction due to different equivalence classes. We explored comparison of equivalence classes, but a single missing/extra link could significantly change equivalence class, giving erroneous results for those dependencies accurately recovered. For example, a link which was directed in the equivalence class of the simulated network could, due to a missing link elsewhere, be undirected in the equivalence class of the recovered network; this would result in not only recording one missing link but also an additional, incorrect, extra link. Comparison of the undirected skeletons resolved this issue. We measured the performance of each method by computing the precision and recall (sensitivity) based on their comparison results. Precision measures the level of a method making mistakes by adding false arcs to the network, while recall evaluates the sensitivity of a method to capturing positive arcs from the targets. Their equations are as follows:
$$\begin{aligned} Precision = \frac{True\;Positive}{True\;Positive+False\;Positive} \end{aligned}$$
(1)
$$\begin{aligned} Recall = \frac{True\;Positive}{True\;Positive + False\;Negative} \end{aligned}$$
(2)
where \(True\;Positive\) represents finding arcs present in the original structure, \(False\;Positive\) represents finding arcs that are not in the original structure, and \(False\;Negative\) represents lack of an arc that is present in the original structure (Fig. 4).
We divided the number of variables into 6 groups for analysis: having number of variables 2-5, 6-8, 9-11, 12-14, 15-17 and 18-20. For each group with each missing proportion in each sampled data amount, we performed a one-way ANOVA to test whether there were any statistically significant differences between the means of the three methods. We applied a Bonferroni correction to correct the resulting p-values in these multiple comparisons. If there were significant Bonferonni-corrected results (p < 0.05) in a variable group/missing proportion combination, we performed the honestly significant difference (Tukey’s HSD) test on the pairwise comparisons between the three methods. For both precision and recall, the same procedures were applied.
Evaluation of imputed data values
We explored the accuracy of MICE’s and SEM’s imputation, using a subset of the simulations. We extracted the completed datasets from the last iteration of SEM and MICE for each missing mechansim (MCAR, MAR, MNAR) for 1000 data points at missing proportion 0.3, using 10 datasets each of 10 and 20 variables. We calculated the Hamming distance between the imputed datasets from the original (no missing values) simulated dataset. We performed Student’s t-test to test whether there were any statistically significant differences between the means of the Hamming distance of imputed versus original data of the two methods.
Real-world data application
We use self-reported and nurse-collected data from the United States Health and Retirement Study (HRS) [27,28,29], a representative study of adults aged 50 and older. We merged the interview data (N = 42233) [27] collected in 2016, the harmonised data (N = 42233) [29] and the laboratory data (N = 7399) [28] that were collected in the same year. As we are focusing on imputation methods, we set any provided imputed values to missing (i.e., to use our method). To ensure a representative sample of older respondents, and due to the focus on multimorbidity, we excluded those aged below 50 (N = 279). To ensure biomarker and survey data were collected concurrently, we excluded respondents whose interviews were finished in 2017 and 2018 (N = 1394). Our analysis dataset consisted of 29 categorical variables each with two to four levels. Supplementary Table 1 in Additional file 1 shows the detailed description of each variable. This cleaned subset contained 5726 observations, in which only 2688 cases were complete (corresponding to a missingness proportion of 0.53).
We applied the best-working method, SEM (see Results), to this real-world data. Because SEM includes random elements in the algorithm, we averaged across multiple repeats to capture the most complete picture of relationships among real-world variables. To accomplish this, we set different random seeds using the base function set.seed in the R environment, before applying the function structural.em from R package bnlearn [22] (using tabu search and BDe scoring metrics in the M-step, as above). In this way, we learned 100 network structures using the SEM algorithm from the whole incomplete data set. We determined the average network across the 100 repetitions based on an arc strength of each learned structure, calculated from the completed partially directed acyclic graph using the function arc.strength also from bnlearn. As the resulting arc strengths were strongly bimodal (see Results), we included in a final average network all links in the higher mode. While the resulting networks were partially directed, we show as results the skeletons – all links as undirected – because we do not wish to imply causal relationships between these measured variables; we are presenting statistical associations only.
We then further explored relationships among real-world variables based on the network structure by applying hierarchical divisive clustering from the R package igraph [30] to detect the densely connected variables in the learned average network. This identifies community groups consisting of nodes that are densely connected together but sparsely connected to others based on the edge betweenness of the edges without considering the directions.