 Research
 Open access
 Published:
A wide range of missing imputation approaches in longitudinal data: a simulation study and real data analysis
BMC Medical Research Methodology volume 23, Article number: 161 (2023)
Abstract
Background
Missing data is a pervasive problem in longitudinal data analysis. Several singleimputation (SI) and multipleimputation (MI) approaches have been proposed to address this issue. In this study, for the first time, the function of the longitudinal regression tree algorithm as a nonparametric method after imputing missing data using SI and MI was investigated using simulated and real data.
Method
Using different simulation scenarios derived from a real data set, we compared the performance of cross, trajectory mean, interpolation, copymean, and MI methods (27 approaches) to impute missing longitudinal data using parametric and nonparametric longitudinal models and the performance of the methods was assessed in real data. The real data included 3,645 participants older than 18 years within six waves obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS). The data modeling was conducted using systolic and diastolic blood pressure (SBP/DBP) as the outcome variables and included predictor variables such as age, gender, and BMI. The efficiency of imputation approaches was compared using mean squared error (MSE), rootmeansquared error (RMSE), median absolute deviation (MAD), deviance, and Akaike information criteria (AIC).
Results
The longitudinal regression tree algorithm outperformed based on the criteria such as MSE, RMSE, and MAD than the linear mixedeffects model (LMM) for analyzing the TCGS and simulated data using the missing at random (MAR) mechanism. Overall, based on fitting the nonparametric model, the performance of the 27 imputation approaches was nearly similar. However, the SI trajmean method improved performance compared with other imputation approaches.
Conclusion
Both SI and MI approaches performed better using the longitudinal regression tree algorithm compared with the parametric longitudinal models. Based on the results from both the real and simulated data, we recommend that researchers use the trajmean method for imputing missing values of longitudinal data. Choosing the imputation method with the best performance is widely dependent on the models of interest and the data structure.
Background
Longitudinal data collected from the same subjects over time are frequently used in observational studies and clinical trials. Traditional models for longitudinal data analysis are generalized linear mixedeffects models (LMM), marginal models like generalized estimating equations (GEE), and transitional models. In various types of studies, especially longitudinal ones, researchers frequently face significant challenges, such as missing data. During followup, some subjects may withdraw or become lost to followup at planned visits. Subjects who participate only during a particular study period may complete only a subset of the information [1, 2].
Most conventional statistical models deal only with complete cases, and missing data are omitted before fitting statistical models (this is the default in most statistical software programs and is called the listwise deletion method). Excluding these observations has disadvantages, including loss of information, loss of precision, reduction in statistical power, and potentially biased estimates [3]. Therefore, different approaches have been introduced to impute missing values and can be classified as either singleimputation (SI) or multipleimputation (MI).
MI methods for imputing missing data in software programs are based on two approaches: joint modeling (JM) and fully conditional specification (FCS). JM approaches for MI are based on the multivariate distribution or the joint distribution of incomplete variables (often, the multivariate normal (MVN) distribution is considered and can be referred to as multivariate normal imputation (MVNI)) [4]. In FCS approaches, missing observations of each incomplete variable are imputed given all the other predictor variables, cycling iteratively through a sequence of univariate imputation models [5].
Several JM and FCS methods, like JMMVN (joint multivariate normal imputation) and FCSstandard approaches, have been proposed to handle missing values in crosssectional studies [6,7,8,9,10,11]. These approaches are also appropriate for imputing missing values in balanced longitudinal data where longitudinal measurements are obtained at fixed time intervals. In this case, the JMMNV and FCSstandard methods treat timedependent variables as distinct variables (wide format) for the imputation of balanced missing longitudinal data [3, 5].
Sometimes, longitudinal data are collected at unequal time intervals along with many longitudinal predictor variables. Then standard JMMVN and FCS methods cannot be used for imputing missing values in this case because large numbers of timedependent predictor variables may lead to problems like overfitting and multicollinearity among distinct predictor variables.
However, ignoring the longitudinal and multilevel structures when imputing missing values of longitudinal data and multilevel data may lead to biased inferences for the estimates of regression coefficients and their standard errors [12, 13]. Recently, several studies extended MI methods for imputing missing values in multilevel data [14, 15] and longitudinal data [4, 16, 17]. These extensions are also available in several software programs such as R [9, 18,19,20,21,22,23,24,25,26], Mplus [27], STATA [28, 29], Blimp [30], REALCOMIMPUTE [31], SAS [32], and StatJR [33]. In addition, in 2013 and 2016, Genolini et al. introduced several SI approaches to impute monotone/dropout and nonmonotone/intermittent missing data in longitudinal studies [34, 35].
A few studies compared MI approaches for imputing missing values in longitudinal studies [16, 17, 36]. These studies used parametric approaches like LMM as an analysis model of interest. However, it is uncertain how well the different MI approaches perform when the statistical model of interest is a nonparametric longitudinal model. In addition, there is no comparison of SI and MI approaches in the literature, where the target analysis is a nonparametric longitudinal model. Hence, the present study is the first to consider nonparametric estimation methods for longitudinal data analysis following missing data imputation with SI and MI approaches.
In this study, the nonparametric longitudinal method of interest is the longitudinal regression tree algorithm proposed by Sela et al. This algorithm is named the random effects expectation–maximization (REEM) tree algorithm [37].
The primary purpose of this study is to evaluate the performance of MI and SI approaches for imputing missing values in longitudinal data. The longitudinal data for this study were obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS) to assess the association between diastolic/systolic blood pressure (DBP/SBP) and predictor variables such as age, gender, and body mass index (BMI).
Methods
Tehran cardiometabolic genetic study (TCGS)
Subjects of the study are extracted from TCGS, an ongoing cohort study based on the framework of the Tehran Lipid and Glucose Study (TLGS). TLGS is the first prospective cohort study in West Asia, and was conducted in Tehran, the capital of Iran. This study was designed to assess the epidemiology of noncommunicable diseases of participants from district 13 of Tehran with 24 years of followup. The first or baseline phase of the TLGS study was started in February 1999, and the individuals were selected via a multistage stratified cluster random sampling method with follow up every three years. The primary purpose of the TLGS is to determine the prevalence of cardiovascular diseases (CVD) and risk factors for the Tdevelopment of these diseases. The design of the TLGS study has been reported elsewhere [38,39,40,41,42].
In the present study, some participants of TCGS were used to evaluate the association between DBP/SBP and predictor variables such as age, sex, and BMI [43, 44]. The predictor variables such as sex and age had no missing values (Table S1 in supplementary file provides sample data for the first 20 individuals of TCGS). However, the BMI and outcome variables (DBP/SBP) had missing values in all six waves of the TCGS study. SI and MI approaches were used to impute missing values of incomplete variables. Parametric linear mixedeffects and nonparametric longitudinal regression tree models were used for longitudinal data analysis after imputing missing data. The study structure for selecting individuals and statistical analysis plan is shown in Fig. 1. In the following sections, each step for data analysis is fully described.
Missing data mechanisms
Three missing data mechanisms are defined for generating missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR [3]. We can distinguish between the mechanisms by defining outcome Y = (Y^{observed}, Y^{missing}) and missing indicator R (1: observed and 0: missing). Each missing data mechanism suggests a general imputation approach.
If missingness probability is not related to observed and unobserved data, then data are MCAR. According to this mechanism, the distribution of missing data P \(\left(\mathrm{R}\mathrm{Y}\right)\) = P(R). While this assumption is unrealistic, the listwise deletion method is an unbiased method for dealing with missing data when this assumption is established. Multiple imputation (MI) approaches are developed under the MAR assumption, which states that the probability of missingness is related to the observed data and is not dependent on the unobserved data. In this case, the distribution of missing data is defined as P \(\left(\mathrm{R}\mathrm{Y}\right)\) = P \(\left(\mathrm{R}{\mathrm{Y}}^{\mathrm{observed}}\right)\), and MI methods can generate unbiased and efficient results [45]. Traditional models for longitudinal data analysis like GEE and models based on the maximum likelihood estimation like GLMMs also lead to valid estimates when missing mechanisms are MCAR and MAR, respectively [46].
The MNAR mechanism occurs when the probability of missingness is related to observed and unobserved data, or the distribution of missing data is equal to P \(\left(\mathrm{R}{\mathrm{Y}}^{\mathrm{observed}},{\mathrm{Y}}^{\mathrm{missing}}\right)\). Selection and patternmixture models have been introduced to handle MNAR data [47]. Assuming that the mechanism is MCAR is a very strong assumption; there are several tests to assess this assumption against not MCAR [48, 49]. The MAR assumption is most common in practice, and this can be tested against the MCAR assumption. However, it is never possible to rule out the assumption of MNAR. The missing data mechanism of multiple variables may be a mixture of any or all mechanisms described here.
Missing data patterns
In longitudinal studies, missing values are based on nonmonotone/intermittent and monotone/dropout processes [50]. The nonmonotone pattern is created when study information is not available for a subject at one time point, but the subject returns at a subsequent time point. A monotone pattern is unlike a nonmonotone pattern; if a subject misses a particular followup, then this subject is not available again. In practice, these two patterns can occur together for different measures.
SI approaches to impute missing values in longitudinal data
In SI approaches, a single value is estimated for each missing data point. In 2013 and 2016, Genolini et al. introduced several SI approaches to impute monotone/dropout and nonmonotone/intermittent missing data in longitudinal studies [34, 35].
To understand the computational strategy related to this section, a data set of n clusters (subjects) is considered. A timedependent variable is recorded at t time points for each cluster. In this case, a trajectory for cluster i and a crosssectional measurement for a particular time point k, is defined as the sequence y_{i.} = (y_{i1,} y_{i2,} …, y_{it}) and as the vector y_{.k} = (y_{1k,} y_{2k,} …, y_{nk}), respectively. Let y_{ik} show a missing value for cluster i at a specific time point k. y_{ik} is nonmonotone missing if time points as a < k < b exists and y_{ia} and y_{ib} are not missing. y_{ik} is monotone missing if for all time points h > k, y_{ih} is missing.
SI methods are classified into three imputation classes: crosssectional (methods such as crossmean, crossmedian, and crosshot deck), longitudinal (methods such as trajmean, trajmedian, trajhot deck, last observation carried forward (LOCF), next observation carried backwards (NOCB), interpolation LOCF, interpolation global, interpolation local, and interpolation bisector), and crosssectionallongitudinal (methods such as copy mean LOCF, copy mean global, copy mean local, and copy mean bisector). The crosssectional imputation methods deal with observed data at a specific time (across clusters) for replacing the missing values in this time, whereas longitudinal imputation methods deal with observed data of the same cluster to impute the missing values in this cluster. The crosssectionallongitudinal imputation methods utilize both crosssectional information (y_{i.}) and longitudinal information (y_{.j}).
Cross methods
The cross methods include crossmean, crossmedian, and crosshot deck. In the crossmean method, y_{ik} at a particular time point is estimated by the mean of all values observed at the time point of interest. Likewise, the crossmedian method uses the median of all values observed at the time point of interest instead. In the crosshot deck method, y_{ik} at a particular time point is estimated by using a randomly selected value observed at the time point of interest.
Traj methods
The traj methods include trajmean, trajmedian, and trajhot deck. In the trajmean method, y_{ik} is estimated by the mean of all values observed at the trajectory of interest y_{i}. The trajmedian method uses the median of all values observed at the trajectory of interest. In the trajhot deck method, y_{ik} at a particular time point is estimated by a randomly selected value observed from the trajectory interest.
LOCF (Last Occurrence Carried Forward) and NOCB (Next Occurrence Carried Backward)
In the LOCF and NOCB methods, y_{ik} is estimated by the last and next observed value of the trajectory of interest, respectively.
Interpolation methods
Interpolation methods include the four methods: interpolationLOCF, interpolationglobal, interpolationlocal, and interpolationbisector. In all interpolation methods, a nonmonotone missing (y_{ik}) is replaced by drawing a line between the values immediately surrounding y_{ik,} and the mathematical formula of this step is as follows: \({\mathrm{y}}_{\mathrm{ia}}+(\mathrm{k}\mathrm{a})\frac{{\mathrm{y}}_{\mathrm{ib}}{\mathrm{y}}_{\mathrm{ia}}}{(\mathrm{b}\mathrm{a})}\) (y_{ia} and y_{ib} are values immediately surrounding y_{ik}). But these methods have different strategies to deal with monotone missing values. For example, the interpolationLOCF uses LOCF or NOCB methods to solve this problem. In the interpolationglobal method, replacing a monotone missing is performed by drawing a line linking the first and the last observed values. In the interpolationlocal method, monotone missing values are generated first by drawing a line linking the first and second nonmissing value. Then, monotone missing values at the end of the trajectory are replaced by drawing a line linking the last and penultimate nonmissing value. Finally, the interpolationbisector method provides an intermediate method using the bisector of interpolationglobal and interpolationlocal for imputing monotone missing values, and the imputed values are chosen on the bisectors.
Copymean methods
The copymean methods include copymean LOCF, copymean global, copymean local, and copymean bisector. The copymean LOCF method is based on two steps for the imputation of missing observations. First, missing values are imputed using the LOCF method to provide an initial approximation of these values. Then the mean trajectory of the population is used to refine the initial approximation in the previous step. Let,\(({\overline{\mathrm{y}} }_{.1}, \dots ,{\overline{\mathrm{y}} }_{.\mathrm{t}})\): the mean trajectory of a populationy_{ik:} the first missing value of i^{th} trajectory\({\mathrm{y}}_{\mathrm{ik}}^{\mathrm{LOCF}}\): the imputed value for y_{ik} using LOCF method for all time points k ≥ d\(\left({\overline{y} }_{.1}^{\mathrm{LOCF}}, \dots ,{\overline{\mathrm{y}} }_{.\mathrm{t}}^{\mathrm{LOCF}}\right)\): the mean trajectory of a population with missing values using the LOCF methodAV_{k}: the average variation at k^{th} time point and is equal to \({\overline{y} }_{.k}{\overline{\mathrm{y}} }_{.\mathrm{k}}^{\mathrm{LOCF}}\)
The missing value y_{ik} is obtained from the copy mean LOCF by adding AV_{k} to the imputed value for y_{ik} using the LOCF method \(\left({\mathrm{y}}_{\mathrm{ik}}^{\mathrm{LOCF}}+{\mathrm{AV}}_{\mathrm{k}}\right)\)
The computational strategies of the copymean local, copymean global, and copymean bisector are similar to the copymean LOCF method, except these methods use the interpolationlocal method, interpolationglobal method, and interpolationbisector method to provide an initial approximation of missing values, respectively.
MI approaches to impute missing values in longitudinal data
MI approaches were proposed by Rubin in 1987 [51], and are flexible and popular methods for imputing missing values based on three steps: imputation step, analysis step, and pooling step [52, 53]. MI approaches use a Bayesian strategy, where posterior estimation is conducted using the Markov Chain Monte Carlo (MCMC) method. These approaches can be decomposed into two approaches, JM and FCS, described in subsequent sections.
In the imputation step, missing data are estimated several times by sampling through their posterior predictive distribution given the observed data and the parameter values of the imputation model. In the next step, multiple complete data sets produced via the imputation step are analyzed using the statistical model of interest. Finally, the results obtained from the analysis step, such as the estimates of regression coefficients with their standard error and criteria of predictive performance, are pooled using Rubin's rules to account for the uncertainty of the imputation [3, 51]. This process is shown graphically in Figure S1 (supplementary file).
Generating more than ten datasets positively affect statistical power [54]; Enders (2010) argues that generating 20 multiple imputed datasets is appropriate [55]. Based on this, we generated 20 multiple datasets of imputed data to impute missing data in the TCGS study using MI approaches.
MI approaches have an attractive property because auxiliary variables can be included in the imputation step without being in the analysis step. Auxiliary variables provide information about missing data and improve the quality of missing data imputation [45]. Researchers also can use a larger number of these variables to reduce the negative effects of MNAR [26].
The MI approaches are based on the MAR assumption, and the inclusion of auxiliary variables in the imputation step raises the plausibility of this assumption. The imputation model should include all the estimations in the analysis step in these imputation approaches. Otherwise, the analysis may generate biased estimates. In real applications, imputation models with many variables can lead to problems such as multicollinearity and nonconvergence [26].
In the present study, we used a number of MI approaches available in R software for missing data imputation; these approaches include joint multivariate LMM (JMMLMM) [4], joint multivariate LMM with heteroskedastic covariance matrices across all clusters (JMMLMMhet) [56], full joint (JMFJ) [57], substantive model compatible joint modelling approach (JMSMC) [58], substantive model compatible joint modelling approach with heteroskedastic covariance matrices across all clusters (JMSMChet) [59], FCSLMM [26], FCSLMM with heteroskedastic residual variance across all clusters (FCSLMMhet) [26], FCSLMMlatent normal (FCSLMMLN) [60], FCSLMMLN with heteroskedastic residual variance across all clusters (FCSLMMLNhet) [60], FCSMLMM with latent normal variables (FCSMLMMLN) [22], hierarchical multiple imputation (hmi) [23], and joint analysis and imputation (JointAI) [61]. Each approach is explained in the following sections. In addition, we mention the number of iterations, burnin period, and convergence criteria for each MI approach.
JMMLMM method
Schafer and Yucel (2002) introduced the JMMLMM method for imputing missing values in longitudinal data using joint multivariate linear mixed models (JMMLMM) instead of treating timedependent variables as distinct variables for imputing variables with missing values. In this method, qualitative variables are imputed as continuous or dummy variables. The JMMLMM method also assumes that random effects are based on a normal distribution with constant covariance matrices across all subjects (clusters) [4].
JMFJ method
In real applications, longitudinal data with missing values may be a mixture of qualitative and quantitative variables, so the normality assumption for these incomplete variables may be unrealistic. Goldstine et al. (2009) suggested the JMMLMMLN method based on the JMMLMM method, which includes latent normal (LN) variables for imputing a mixture of normal and nonnormal variables [57]. Asparouhov and Muthen (2010) proposed the JMFJ method based on the JMMLMMLN method to impute missing values in longitudinal data using all variables in the imputation process as outcome variables.
These two methods and the JMMLMM method are implemented in package mitml [19]. The potential scale reduction criterion near 1 or < 1.05 for all parameters and diagnostics plots were used to assess the convergence. If the potential scale reduction criterion was larger than 1.05, the iterations of the burnin period were increased.
JMSMC and JMSMChet methods
Goldstine et al. (2014) extended the JMMLMMLN method to the JMSMC method by defining the joint imputation method as the product of the analysis model and the joint distribution of variables [58]. The JMSMC method can also accommodate random covariance matrices across all subjects, and this method is defined as the JMSMChet method. These methods use diagnostics plots to assess convergence.
In the present study, all JM approaches were conducted based on the 1000 iterations and 5000 iterations for a burnin period to establish the stability of parameters distribution.
FCSLMM and FCSLMMhet methods
FCSLMM is an FCS adaptation of the JMMLMM method, proposed by van Buuren et al. (2011). This method fits a multilevel LMM to impute missing values of incomplete variables conditional to other variables, cycling iteratively based on the univariate imputation models. In this method, qualitative variables are imputed as continuous variables or as dummy variables. The FCSLMM method assumes normal distributions for all variables with missing values and a fixed residual variance across all subjects [26]. Van Buuren (2011) extended the FCSLMM method to the FCSLMMhet method to deal with heteroskedastic residual variance across all clusters [14].
FCSLMMLN and FCSLMMLNhet methods
Enders et al. (2017) suggested the FCSLMMLN method by extending the FCSLMM method to LN variables [60]. This method imputes missing data using a value randomly selected from observed values having the nearest predicted mean based on the LMM to particular missing data. In the FCSLMMLNhet method, the continuous variables with missing values are imputed using a LMM.
FCSMLMMLN method
Audigier and RescheRigon modified the JMMLMMLN approach to impute missing observations based on an FCS framework where only one variable is considered missing at a time [22]. At each step, all of the variables in the imputation model are considered as outcomes (one of variables in the imputation model is treated as incomplete variable and the rest are considered as complete variables). Using this approach, the incomplete binary and categorical variables are imputed using latent normal variables as for JMMLMMLN.
All FCS approaches mentioned were conducted based on the 20 iterations and 5 iterations for a burnin period to establish the stability of parameters distribution (In these approaches, the convergence of estimations can occur with 5 or 10 iterations). In addition, diagnostic graphs were used to examine convergence [26].
Nonparametric longitudinal analysis method
The longitudinal treebased methods are nonparametric methods for analyzing longitudinal data. Medical studies have used these methods to determine disease risk factors and identify high and lowrisk subgroups of patients [62, 63] by extracting homogeneous subgroups of observations that can be appropriately used for subgroup analysis [64]. Since most studies evaluating longitudinal changes in the outcome variable are conducted in the context of a heterogeneous population, traditional parametric longitudinal models might not provide a good fit and could potentially result in biased estimates. In addition, the actual values of the model parameters may differ between homogeneous subgroups. Because the treebased models can extract homogeneous subgroups of observations and estimate heterogeneous treatment effects, they may be better positioned to assist the clinician in decisionmaking [65, 66].
Unlike traditional parametric longitudinal models, these methods do not require assumptions about the functional form of the data and are robust to outliers and multicollinearity. They can accommodate nonlinear relationships and highorder interactions. The monotone transformations of predictor variables do not have any effect on the results. The interpretation of tree methods is straightforward because the results are shown graphically [67,68,69,70].
The classification and regression tree (CART) algorithm is the bestknown tree algorithm for crosssectional data modeling [71]. Sela et al. (2012) extended this tree algorithm for longitudinal data by combining the LMM and the CART algorithm. This longitudinal regression tree algorithm is named the random effects expectation–maximization (REEM) tree algorithm [37].
The previous section described how the LMM uses a parametric linear form for fixedeffects; this form cannot easily handle complex nonlinear relationships or datasets with very large numbers of predictor variables. The REEM tree algorithm solves this problem using a nonparametric method like the CART algorithm to estimate the fixed effects. The estimation method of REEM is as follows:

1
Set the initial values equal to zero for \({\widehat{b}}_{i}\).

2
Run the following steps until the convergence of \({\widehat{b}}_{i}\) (convergence is established when change in the likelihood or restricted likelihood < predetermined tolerance value (e.g. 0.001)).a) Fit a regression tree to estimate an initial approximation of \(f\) using the CART algorithm, based on response variable,\({y}_{it} {Z}_{it}{\widehat{b}}_{i}\), predictor variables,\({x}_{it}=({x}_{it1}, \dots , {x}_{itK})\), for i = 1, …, I and t = t = 1, …,\({T}_{i}\). This regression tree generates a set of predictor variables, I(\({x}_{it}\in {\mathrm{g}}_{\mathrm{p}}\)), where \({\mathrm{g}}_{\mathrm{p}}\) ranges over all terminal nodes of the tree.b) Run the LMM, \({y}_{it}={Z}_{it}{b}_{i}+\sum_{p}\mathrm{I}\left({x}_{it}\in {\mathrm{g}}_{\mathrm{p}}\right){\mu }_{p}+{\varepsilon }_{it}\), to estimate \({\widehat{b}}_{i}\) from the fitted model.

3
Use estimated predicted response \({\widehat{\mu }}_{p}\) from the fitted LMM in step 2b instead of the predicted response at each terminal node of tree.
Simulation study
Performance of the various imputation approaches was compared using simulation data. We generated 1000 datasets, each of which included 1000 individuals, mimicking the TCGS data. In each simulated dataset, variables were generated as follows (all parameters in the data generating models were estimated from the original data to ensure that the simulated datasets were comparable to a real data example):

1
Sex variable was generated using a binomial distribution with probabilities 0.5.

2
Age variable at the first wave was generated using a truncated normal distribution with exact minimum = 1, maximum = 84, mean = 39.34, and standard deviation = 16.23. Age at other waves was generated as follows:

3
The main predictor variable (BMI) at each wave was generated based on age and sex as well as individuallevel random effects and individuallevel noise in each wave:
where \({\mathrm{\varnothing }}_{0\mathrm{i}}\) = N (0, 4.50) is the random intercept and \({\mathrm{\varnothing }}_{\mathrm{ij}}\) = N (0, 1.57) is the residual error.

4
The outcome variable, DBP at each wave was generated using the following linear process:
where \(\gamma\) _{0i} = N (0, 6.35) is the random intercept and \({\upgamma }_{\mathrm{ij}}\)= N (0, 6.88) is the residual error.
After generating simulated data, missingness for some observations of BMI and DBP are generated based on the MAR mechanism. The following equations are used:
The parameters \(\beta\) _{0j,} \(\beta\) _{1j,} \(\beta\) _{2j,} \(\varphi\) _{0j,} \(\varphi\) _{1j,} and \(\varphi\) _{2j} were determined based on the TCGS data to ensure a similar proportion of missing data for each variable at each wave; the proportions in both TCGS study and simulation study are shown in Table 1. These parameters are as follows:
Criteria for comparing the performance of missing data imputation methods
The performance of imputation approaches under the MAR mechanism and statistical methods applied to the real and simulated data sets was compared by evaluating the standard errors of regression coefficients (SE), MSE, RMSE, MAD, deviance, and AIC. The imputation approaches with smaller value in terms of SE, MSE, RMSE, MAD, deviance, and AIC indicate better performance.
Software programs
R software was used to impute missing longitudinal data and data analysis, and R packages used are mentioned in Table 2. The R codes of SI and MI approaches for missing data imputation and data simulation are available in https://github.com/MinaJahangiri/Rcodesofmissingimputationmethods.
Results
The study's variables with missing values are BMI (predictor variable) and DBP/SBP (outcome variables). Figure 2 shows the frequency of missing values for these variables at each phase of the TCGS study. The descriptive statistics of TCGS data are shown in the Table S2 (supplementary file). The percentage of missing values in a particular combination of variables based on the long data format is visually shown in Figure S2 (supplementary file). Seventysix percent observations have no missing values, and 21% have missing values for BMI, DBP, and SBP, simultaneously. The missing data pattern of TCGS data indicated that both monotone and nonmonotone patterns were present. The formal MCAR test indicated that the MCAR assumption is not reasonable (P < 0.001, so the MCAR assumption is rejected at a significance level of 0.05).
After the imputation of missing values, the data modeling results based on the LMM and REEM longitudinal tree algorithm for the two outcome variables (DBP and SBP) are shown in Tables 3, 4, 5 and 6. According to Tables 3 and 5, all SI approaches except crossmedian, and crosshot deck performed similarly with respect to estimates of regression coefficients and their standard error estimates.
The results shown in Tables 3, 4, 5 and 6 indicated that parametric longitudinal models are not appropriate for analyzing the TCGS data. There appears to be nonlinear relationships (for BMI variable) and the assumptions of homoscedasticity and normality of residuals are clearly not established by using the diagnostic plots such as standardized residuals versus BMI variable, quantile–quantile (QQ) plot of residuals, and plot of standardized residuals versus fitted values for linear mixed effects model, respectively. These assumptions for the LMM appear to have been violated regardless of the SI approach used; as an illustration, we show the diagnostic plots for the trajmean method in Figures S3, S4, S5, S6, S7 and S8 (supplementary file). We compared the parametric and nonparametric models using SI approaches, and the parametric longitudinal models resulted in larger MSE, RMSE, and MAD than the longitudinal regression tree algorithm (Tables 3, 4, 5 and 6). Given these advantages of the nonparametric model, the comparison of SI and MI imputation approaches is only explained based on the REEM tree algorithm.
When comparing SI approaches that were used in conjunction with the REEM tree algorithm, trajmean method performed the best (lowest MSE, RMSE, MAD, and deviance), and the crosshot deck performed the worst (Tables 4 and 6). The tree structure of the REEM tree algorithm using trajmean for imputation of missing values for two outcome variables, DBP and SBP, are shown in Figure S9 and Figure S10 (supplementary file).
Density plots of the observed and imputed data for incomplete variables like BMI, DBP, and SBP using mice packages are shown in Figure S11, S12, S13 and S14 (supplementary file). Figure S15 (supplementary file) also demonstrates the trace line plots of the mean and standard deviation of the imputed values against the iteration number for each replication. These trace lines are intermingled without any particular trend, so it appears estimation has converged. Due to space limitations, trace line plots for other FCS approaches are not shown, though these plots also indicated convergence. In addition, the Rhat statistic of mean and variances for all incomplete variables based on the FCSLMMLN and FCSLMMLNhet methods are near one, so the convergence of these methods is also established (Table S3 in supplementary file).
When comparing FCS approaches, the FCSLMMLN performed the best and FCSLMM/FCSLMMhet performed the worst (Tables 4 and 6). All of the JM approaches had a similar performance (Tables 4 and 6). The diagnostic plots of the JM approaches indicate convergence. In addition, the potential scale reduction criterion was near 1 for all parameters based on JMFJ and JMMLMM methods. Due to space limitations, these diagnostic plots are only included for the JointAI and JMMLMM methods in Figures S16, S17, S18, S19 and S20 (supplementary file), respectively.
The simulation results were consistent with the real data analyses. The longitudinal regression tree algorithm provided better performance than the LMM for analyzing the simulated data under the missing at random (MAR) mechanism. In addition, the SI trajmean method provided better performance (lowest MSE, RMSE, and MAD) than other imputation approaches (Tables 7 and 8). We have not assessed the bias, because the longitudinal tree algorithm, unlike LMM, does not generate the estimates of coefficient regression. Rather, we have based our evaluation of the methods on prediction performance.
Discussion
Missing values are a significant problem in longitudinal studies, and managing this problem is essential. In the current study, we compared the performance of SI and MI approaches to impute longitudinal missing data in the context of using LMM and the REEM tree algorithm for data modelling. Previous studies have compared the performance of MI approaches when the statistical model of interest is a parametric longitudinal model; the performance of MI approaches when the statistical model of interest is a nonparametric longitudinal model is less well understood.
The current study provides a comprehensive assessment using missing imputation approaches for handling missing data in the TCGS dataset and simulated data under the MAR mechanism. To evaluate this aim, we compared the performance of 16 SI approaches and 12 MI approaches to fit the REEM tree algorithm and LMM when assessing the association between DBP/SBP and predictor variables such as age, gender, and BMI. We also focused on the Rpackages and provided R code for data modeling after using the SI and MI approaches, as well as missing longitudinal data simulation.
The real and simulated data results suggest that the REEM tree algorithm could perform better than parametric longitudinal models. Tree algorithms have some advantages compared to parametric longitudinal models, and we propose that researchers use these methods for future longitudinal studies. These algorithms can accommodate large data sets, nonlinear relationships, and interactions, and can extract homogeneous subgroups of data. The interpretation of the tree algorithm is straightforward because the result is graphically shown and is robust to multicollinearity and outliers. These algorithms are also invariant to monotone transformations of independent variables and do not require additional distributional assumptions [67, 68, 72,73,74,75].
Generally, the comparisons of imputation methods indicated little difference between them. However, a SI approach (trajmean) had better performance among all imputation approaches in fitting the REEM tree algorithm for both outcome variables DBP and SBP.
In addition, we evaluated the computational time of imputation approaches. JM approaches are much more resourceintensive than FCS approaches. These methods may not be practicable in longitudinal studies with many clusters and predictor variables with high missing rates. In FCS approaches, the convergence of estimations can occur with 5 or 10 iterations. The number of iterations to establish convergence of JM approaches is much larger than FCS approaches, and FCS approaches have higher computational speed than SI approaches. Overall, MI approaches require more computing resources than SI approaches. Therefore, in real applications with many clusters (e.g., TCGS), SI approaches are likely to be more costeffective in terms of computational time. The SI approaches are not based on the maximum likelihood estimation and are classified as nonparametric imputation approaches. These methods are appropriate for nonparametric trajectories (e.g., BMI trajectories in TCGS data). Based on the computational time and performance of SI method of trajmean, it appears this can be an excellent approach to deal with missing observations in data similar to the TCGS data set.
In the current study, we only assessed the MCAR mechanism, though there are sensitivity analyses (selection models and patternmixture models) that can be performed to assess the appropriateness of the MAR assumption; unfortunately, these models are unavailable for longitudinal data with missing values in longitudinal quantitative predictor and outcome variables [17, 76,77,78].
Past studies compared MI approaches for fitting parametric longitudinal models, such as LMM with random intercepts and LMM with random intercepts and slopes. These studies indicate that all MI approaches provide consistent regression coefficients [16, 17]. Some studies also compared these imputation methods for missing data in the context of multilevel data and concluded that these methods provide consistent regression coefficients [15]. In addition, two studies comprehensively compared the SI approaches to impute monotone and nonmonotone missing data in longitudinal studies. Unlike the current study, the copy mean method was more effective than other SI approaches [34. Like these two studies, Zhang (2016) also indicated that the copymean method had better performance for imputing missing values [79].
To the best of our knowledge, the present study is the first to compare the SI, and MI approaches to impute missing longitudinal data with many time points, clusters, and values using real and simulation data. However, this present study has two limitations, one of which was related to the violation of parametric longitudinal model assumptions. Another limitation was related to the computational time of the simulation study. We used SI approaches and a nonparametric longitudinal model like the REEM tree algorithm to deal with this limitation. For future studies, the nonparametric imputation methods using multivariate skewnormal distribution for the random effects can impute missing longitudinal data. In addition, in the case with unequal time intervals, functional data analysis could be helpful.
Conclusion
The result of this study should be generalized with caution to other data sets with different characteristics. Because imputation methods can have different levels of performance with different data sets, certain conditions such as missing the data mechanisms or the rate of missingness might lead analysts to opt for different imputation options. Therefore, we conclude that researchers apply all imputation methods (SI and MI) in the context of fitting their statistical models, and then select the imputation method that demonstrates the best performance based on the criteria highlighted in this paper.
Availability of data and materials
The datasets analyzed during the current study are not publicly available due to containing information that could compromise the privacy of research participants but are available from the corresponding authors on reasonable request. However, R code of SI and MI approaches for missing data imputation, and data simulation are available in https://github.com/MinaJahangiri/Rcodesofmissingimputationmethods.
Abbreviations
 DBP:

Diastolic blood pressure
 SBP:

Systolic blood pressure
 BMI:

Body mass index
 JM:

Joint modelling
 FCS:

Fully conditional specification
 JMMVN:

Joint multivariate normal imputation
 SI:

Single imputation
 MI:

Multiple imputation
 LMM:

Linear mixedeffects model
 REEM:

Random effects expectation–maximization
 TCGS:

Tehran cardiometabolic genetic study
References
Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis, vol. 998: John Wiley & Sons; 2012.
Chen SX, Zhong PS. ANOVA for longitudinal data with missing values. 2010.
Little RJ, Rubin DB. Statistical analysis with missing data, vol. 793: John Wiley & Sons; 2019.
Schafer JL, Yucel RM. Computational strategies for multivariate linear mixedeffects models with missing values. J Comput Graph Stat. 2002;11(2):437–57.
Van Buuren S, Brand JP, GroothuisOudshoorn CG, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul. 2006;76(12):1049–64.
Fox MJ. Package ‘norm.’ 2013.
Schafer JL, Tusell MF. Package ‘cat.’ 2012.
Kowarik A, Templ M. Imputation with the R Package VIM. J Stat Softw. 2016;74(1):1–16.
van Buuren S, GroothuisOudshoorn K, Robitzsch A, Vink G, Doove L, Jolani S. Package ‘mice’. Computer software. 2015.
Gelman A, Hill J, Su YS, Yajima M, Pittau M, Goodrich B, Si Y, Kropko J, Goodrich MB. Package ‘mi’. R CRAN R Foundation for Statistical Computing. 2015.
Husson F, Josse J, Husson MF, FactoMineR I. Package ‘missMDA.’ methods. 2013;153(2):79–99.
Lüdtke O, Robitzsch A, Grund S. Multiple imputation of missing data in multilevel designs: A comparison of different strategies. Psychol Methods. 2017;22(1):141.
Enders CK, Mistler SA, Keller BT. Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation. Psychol Methods. 2016;21(2):222.
Van Buuren S. Multiple imputation of multilevel data: Routledge; 2011.
Wijesuriya R, MorenoBetancur M, Carlin JB, Lee KJ. Evaluation of approaches for multiple imputation of threelevel data. BMC Med Res Methodol. 2020;20(1):1–15.
Huque MH, Carlin JB, Simpson JA, Lee KJ. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med Res Methodol. 2018;18(1):1–16.
Huque MH, MorenoBetancur M, Quartagno M, Simpson JA, Carlin JB, Lee KJ. Multiple imputation methods for handling incomplete longitudinal and clustered data where the target analysis is a linear mixed effects model. Biom J. 2020;62(2):444–66.
Quartagno M, Carpenter J, Quartagno MM, BaBooN S. Package ‘jomo.’ 2020.
Grund S, Robitzsch A, Luedtke O, Grund MS. Package ‘mitml.’ 2019.
Robitzsch A, Grund S, Henke T, Robitzsch MA. Package ‘miceadds.’ R Package: Madison; 2017.
Grund S, Lüdtke O, Robitzsch A. Multiple imputation of multilevel missing data: An introduction to the R package pan. SAGE Open. 2016;6(4):2158244016668220.
Audigier V, RescheRigon M. micemd: multiple imputation by chained equations with multilevel data. R package version 160. 2019.
Speidel M, Drechsler J, Jolani S. R package hmi: a convenient tool for hierarchical multiple imputation and beyond. In: IABDiscussion Paper; 2018.
Erler NS, Rizopoulos D, Lesaffre EM. JointAI: joint analysis and imputation of incomplete data in R. 2019. arXiv preprint arXiv:190710867.
Genolini C, Falissard B, Fang D, Tierney L, Genolini MC. Package ‘longitudinalData.’ 2016.
Van Buuren S, GroothuisOudshoorn K. mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45(1):1–67.
Muthen Linda K, Muthen Bengt O. Mplus: Statistical Analysis with Latent Variables. Los Angeles: Muthen & Muthen; 2007.
Royston P, White IR. Multiple imputation by chained equations (MICE): implementation in Stata. J Stat Softw. 2011;45(4):1–20.
Welch C, Bartlett J, Petersen I. Application of multiple imputation using the twofold fully conditional specification algorithm in longitudinal clinical data. Stand Genomic Sci. 2014;14(2):418–31.
Keller BT, Enders CK. Blimp Software Manual (Version Beta 6.7). Los Angeles. 2017.
Bartlett J. REALCOMIMPUTE: Stata module to export and import data to the realcomImpute software package. 2018.
Mistler SA. A SAS macro for applying multiple imputation to multilevel data. In: Proceedings of the SAS Global Forum: 2013: Citeseer; 2013:1–8.
Charlton C, Michaelides D, Cameron B, Szmaragd C, Parker R, Yang H. StatJR software. 2012.
Genolini C, JacqminGadda H. Copy mean: a new method to impute intermittent missing values in longitudinal studies. Open J Stat. 2013;3(04):26.
Genolini C, Lacombe A, Écochard R, Subtil F. CopyMean: a new method to predict monotone missing values in longitudinal studies. Comput Methods Programs Biomed. 2016;132:29–44.
De Silva AP, MorenoBetancur M, De Livera AM, Lee KJ, Simpson JA. A comparison of multiple imputation methods for handling missing values in longitudinal data in the presence of a timevarying covariate with a nonlinear association with time: a simulation study. BMC Med Res Methodol. 2017;17(1):1–11.
Sela RJ, Simonoff JS. REEM trees: a data mining approach for longitudinal and clustered data. Mach Learn. 2012;86(2):169–207.
Akbarzadeh M, Moghimbeigi A, Mahjub H, Soltanian AR, Daneshpour M, Morris N. Trajectories of change in obesity among tehranian families: multilevel latent growth curve modeling. Int J Fam Med. 2016;2016.
Akbarzadeh M, Moghimbeigi A, Morris N, Daneshpour MS, Mahjub H, Soltanian AR. A Bayesian structural equation model in general pedigree data analysis. Stat Analysis Data Mining. 2019;12(5):404–11.
Daneshpour MS, Hedayati M, SedaghatiKhayat B, Guity K, Zarkesh M, Akbarzadeh M, et al. Genetic identification for noncommunicable disease: Findings from 20 years of the Tehran Lipid and Glucose Study. Int J Endocrinol Metab. 2018;16(4 Suppl).
Zahedi AS, Akbarzadeh M, SedaghatiKhayat B, Seyedhamzehzadeh A, Daneshpour MS. GCKR common functional polymorphisms are associated with metabolic syndrome and its components: a 10year retrospective cohort study in Iranian adults. Diabetol Metab Syndr. 2021;13(1):1–10.
SedaghatiKhayat B, Barzin M, Akbarzadeh M, Guity K, Fallah MS, Pourhassan H, et al. Lack of association between FTO gene variations and metabolic healthy obese (MHO) phenotype: Tehran Cardiometabolic Genetic Study (TCGS). Eat Weight Disord Stud Anorexia Bulimia Obes. 2020;25(1):25–35.
Kolifarhood G, Daneshpour M, Hadaegh F, Sabour S, Mozafar Saadati H, Akbar Haghdoust A, et al. Heritability of blood pressure traits in diverse populations: a systematic review and metaanalysis. J Hum Hypertens. 2019;33(11):775–85.
Kolifarhood G, Daneshpour MS, Khayat BS, Saadati HM, Guity K, Khosravi N, et al. Generality of genomic findings on blood pressure traits and its usefulness in precision medicine in diverse populations: A systematic review. Clin Genet. 2019;96(1):17–27.
Pedersen AB, Mikkelsen EM, CroninFenton D, Kristensen NR, Pham TM, Pedersen L, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157.
Twisk JW. Applied longitudinal data analysis for epidemiology: a practical guide: Cambridge university press; 2013.
Little R. Selection and patternmixture models. Longitudinal data analysis: Chapman and Hall/CRC; 2008. p. 423–46.
Jamshidian M, Jalal S, Jansen C. MissMech: An R package for testing homoscedasticity, multivariate normality, and missing completely at random (MCAR). J Stat Softw. 2014;56(1):1–31.
Little RJ. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198–202.
Ibrahim JG, Molenberghs G. Missing data methods in longitudinal studies: a review. TEST. 2009;18(1):1–43.
Rubin DB. Multiple imputation for nonresponse in surveys, vol. 81: John Wiley & Sons; 2004.
Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338.
Rezvan PH, Lee KJ, Simpson JA. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC Med Res Methodol. 2015;15(1):1–14.
Graham JW, Olchowski AE, Gilreath TD. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci. 2007;8(3):206–13.
Enders CK. Applied missing data analysis: Guilford Publications; 2022.
Yucel RM. Random covariances and mixedeffects models for imputing multivariate multilevel continuous data. Stat Model. 2011;11(4):351–70.
Goldstein H, Carpenter J, Kenward MG, Levin KA. Multilevel models with multivariate mixed response types. Stat Model. 2009;9(3):173–97.
Goldstein H, Carpenter JR, Browne WJ. Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and nonlinear terms. J Royal Stat Soc Series A (Statistics in Society). 2014:553–64.
Quartagno M, Carpenter J. Multiple imputation for IPD metaanalysis: allowing for heterogeneity and studies with missing covariates. Stat Med. 2016;35(17):2938–54.
Enders CK, Keller BT, Levy R. A fully conditional specification approach to multilevel imputation of categorical and continuous variables. Psychol Methods. 2018;23(2):298.
Erler NS, Rizopoulos D, Jaddoe VW, Franco OH, Lesaffre EM. Bayesian imputation of timevarying covariates in linear mixed models. Stat Methods Med Res. 2019;28(2):555–68.
Camp NJ, Slattery ML. Classification tree analysis: a statistical tool to investigate risk factor interactions with an example for colon cancer (United States). Cancer Causes Control. 2002;13(9):813–23.
Jahangiri M, Khodadi E, Rahim F, Saki N, Saki Malehi A. Decision‐tree‐based methods for differential diagnosis of β‐thalassemia trait from iron deficiency anemia. Expert Syst. 2017;34(3).
Loh WY, He X, Man M. A regression tree approach to identifying subgroups with differential treatment effects. Stat Med. 2015;34(11):1818–33.
Kundu MG, Harezlak J. Regression trees for longitudinal data with baseline covariates. Biostatistics & epidemiology. 2019;3(1):1–22.
Eo SH, Cho H. Treestructured mixedeffects regression modeling for longitudinal data. J Comput Graph Stat. 2014;23(3):740–60.
Lemon SC, Roy J, Clark MA, Friedmann PD, Rakowski W. Classification and regression tree analysis in public health: methodological review and comparison with logistic regression. Ann Behav Med. 2003;26(3):172–81.
Malehi AS, Jahangiri M. Classic and Bayesian TreeBased Methods. In: Enhanced Expert Systems. edn.: IntechOpen; 2019.
Jahangiri M, Rahim F, Saki N, Saki Malehi A. Application of Bayesian Decision Tree in Hematology Research: Differential Diagnosis of βThalassemia Trait from Iron Deficiency Anemia. Comput Math Methods Med. 2021;2021.
Rahim F, Kazemnejad A, Jahangiri M, Malehi AS, Gohari K. Diagnostic performance of classification trees and hematological functions in hematologic disorders: an application of multidimensional scaling and cluster analysis. BMC Med Inform Decis Mak. 2021;21(1):1–13.
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees: CRC press; 1984.
De’ath G, Fabricius KE. Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology. 2000;81(11):3178–92.
Speybroeck N, Berkvens D, MfoukouNtsakala A, Aerts M, Hens N, Van Huylenbroeck G, et al. Classification trees versus multinomial models in the analysis of urban farming systems in Central Africa. Agric Syst. 2004;80(2):133–49.
Feldesman MR. Classification trees as an alternative to linear discriminant analysis. Am J Phys Anthropol. 2002;119(3):257–75.
Chan KY, Loh WY. LOTUS: An algorithm for building accurate and comprehensible logistic regression trees. J Comput Graph Stat. 2004;13(4):826–52.
Rezvan PH, Lee KJ, Simpson JA. Sensitivity analysis within multiple imputation framework using deltaadjustment: application to longitudinal study of Australian Children. Longitudinal Life Course Stud. 2018;9(3):259–78.
MorenoBetancur M, Chavance M. Sensitivity analysis of incomplete longitudinal data departing from the missing at random assumption: Methodology and application in a clinical trial with dropouts. Stat Methods Med Res. 2016;25(4):1471–89.
Fiero MH, Hsu CH, Bell ML. A patternmixture model approach for handling missing continuous outcome data in longitudinal cluster randomized trials. Stat Med. 2017;36(26):4094–105.
Zhang Z. Missing data imputation: focusing on single imputation. Ann Transl Med. 2016;4(1).
Templ M, Alfons A, Kowarik A, Prantner B, Templ MM. Package ‘VIM’. 2021.
Bates D, Sarkar D, Bates MD, Matrix L. The lme4 package. R package version. 2007;2(1):74.
Sela RJ, Simonoff JS. REEM trees: a data mining approach for longitudinal and clustered data. Mach Learn. 2012;86:169–207.
Goldfeld K, WujciakJens J. simstudy: Illuminating research methods through data generation. J Open Source Softw. 2020;5(54):2763.
Acknowledgements
The authors would like to express their gratitude to the staff and participants in the TCGS project.
Funding
No funding was received for this study.
Author information
Authors and Affiliations
Contributions
Conceptualization: Mahdi Akbarzadeh, and Mina Jahangiri; Formal analysis: Mina Jahangiri, Mahdi Akbarzadeh, and Shayan Mostafaei; Methodology: Mina Jahangiri, Mahdi Akbarzadeh, Anoshirvan Kazemnejad, Maryam S Daneshpour, and Davood Khalili; Medical consultant: Maryam S Daneshpour; Software and data simulation: Mina Jahangiri, Keith Goldfeld, Mahdi Akbarzadeh, Shayan Mostafaei, and Mohammad Reza Moghadas; Writingoriginal draft: Mina Jahangiri and Keith Goldfeld; Supervision: Anoshirvan Kazemnejad and Mahdi Akbarzadeh; All authors reviewed and accepted the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
The ethical committee approved this study at Research Institute for Endocrine Sciences; Shahid Beheshti University of Medical Sciences (Research Approval Code: 28778 & Research Ethical Code: IR.SBMU.ENDOCRINE.REC.1400.084). In this study, all participants provided written informed consent for participating in the study. This study has been performed in accordance with the Declaration of Helsinki.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
Figure S1. The process of the multiple imputations approach (e.g., the number ofmultiple imputed data sets is equal to 5).
Additional file 2:
Figure S2. Missing data pattern (percentage of missing values in a particular combination of variables based on the long data format) using VIM package (blue color: observed values andred color: missing values).
Additional file 3:
Figure S3. The plot of standardized residuals versus fitted values for linear mixedeffects model with random intercepts using lme4 package based on the model: DBP ~ Age + Sex + BMI + Time + (1Id) after using the trajmean method for the imputation of missing values of longitudinal data.
Additional file 4:
Figure S4. The plot of standardized residuals versus fitted values for linear mixedeffects model with random intercepts using lme4 package based on the model: SBP ~ Age + Sex + BMI + Time + (1Id) after using the trajmean method for the imputation of missing values of longitudinal data.
Additional file 5:
Figure S5. The quantilequantile plot of residuals for linear mixedeffects model with random intercepts using lme4 package based on the model: DBP ~ Age + Sex + BMI + Time + (1Id) after using the trajmean method for the imputation of missing values of longitudinal data.
Additional file 6:
Figure S6. The quantilequantile plot of residuals for linear mixedeffects model with random intercepts using lme4 package based on the model: SBP ~ Age + Sex + BMI + Time + (1Id) after using the trajmean method for the imputation of missing values of longitudinal data.
Additional file 7:
Figure S7. Standardized residuals of linear mixedeffects model with random intercepts versus BMI variable using lme4 package based on the model: DBP ~ Age + Sex+ BMI + Time + (1Id) after using the trajmean method for the imputation of missing values of longitudinal data.
Additional file 8:
Figure S8. Standardized residuals oflinear mixedeffects model with random intercepts versus BMI variable using lme4 package based on the model: SBP ~ Age + Sex+ BMI + Time + (1Id) after using the trajmean method for the imputation of missing values of longitudinal data.
Additional file 9:
Figure S9. The tree structure of the REEMtree algorithm based on the trajmean method to impute missing values for extracting homogeneous subgroups ofobservations for diastolic blood pressure (DBP) using the REEMtree package.This tree algorithm extracted 7 homogeneous subgroups of observations; the lowest and highest subgroups were subjects with "BMI < 23.70 & age< 41.50" and subjects with "BMI ≥ 30.41 & age ≥ 43.50", respectively.
Additional file 10:
Figure S10. The tree structure of the REEMtree algorithm based on the trajmean method to impute missing values for extracting homogeneous subgroups ofobservations for systolic blood pressure (SBP) using the REEMtree package. This tree algorithm extracted 9 homogeneous subgroups of observations; the lowest and highest subgroups were subjects with "BMI < 23.70 & age <49.50" and subjects with "age ≥ 59.50 & BMI ≥ 26.50", respectively.
Additional file 11:
Figure S11. Density plots of the observed and imputed data for incomplete variables like BMI and DBP for each iteration (the number of iterations = 20) using mice package (observed data: blue and imputed data: red).
Additional file 12:
Figure S12. Density plots of the observed and imputed data for BMI variablefor each iteration using mice package (observed data: blue and imputed data:red).
Additional file 13:
Figure S13. Density plots of the observed and imputed data for the DBP variable for each iteration using mice package(observed data: blue and imputed data: red).
Additional file 14:
Figure S14. Density plots of the observed and imputed data for the SBP variable for each iteration using mice package (observed data: blue and imputeddata: red).
Additional file 15:
Figure S15. The trace line plots of the mean and standard deviation of the imputed values against the iteration number for each replication using the mice package.
Additional file 16:
Figure S16. Trace plot using JointAI package based on the model: DBP ~ Age + Sex + BMI + Time +(1Id).
Additional file 17:
Figure S17. MC plot using JointAI package based on the model: DBP ~ Age + Sex + BMI + Time +(1Id).
Additional file 18:
Figure S18. Trace plot using JointAI package based on the model: SBP ~ Age + Sex + BMI + Time +(1Id).
Additional file 19:
Figure S19.MC plot using JointAI package based on the model: SBP ~ Age + Sex + BMI + Time + (1Id).
Additional file 20:
Figure S20. The convergence plot of the JMMLMM method.
Additional file 21:
TableS1. The sample data for the first 20 individuals of TCGS (NA: missing value).
Additional file 22:
Table S2. Descriptive statistics for continuous variables of TCGS participants included inthe present study at each phase (BMI: body mass index, DBP:diastolic blood pressure, SBP: systolic blood pressure, and SD: std. deviation).
Additional file 23:
Table S3. The statistic for mean and variances for all incomplete variables using miceadds package for FCSLMMLN and FCSLMMLNhet imputation methods.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Jahangiri, M., Kazemnejad, A., Goldfeld, K.S. et al. A wide range of missing imputation approaches in longitudinal data: a simulation study and real data analysis. BMC Med Res Methodol 23, 161 (2023). https://doi.org/10.1186/s12874023019688
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874023019688