A wide range of missing imputation approaches in longitudinal data: a simulation study and real data analysis

Jahangiri, Mina; Kazemnejad, Anoshirvan; Goldfeld, Keith S.; Daneshpour, Maryam S.; Mostafaei, Shayan; Khalili, Davood; Moghadas, Mohammad Reza; Akbarzadeh, Mahdi

doi:10.1186/s12874-023-01968-8

Research
Open access
Published: 06 July 2023

A wide range of missing imputation approaches in longitudinal data: a simulation study and real data analysis

Mina Jahangiri¹,
Anoshirvan Kazemnejad¹^na1,
Keith S. Goldfeld²,
Maryam S. Daneshpour³,
Shayan Mostafaei⁴,
Davood Khalili⁵,
Mohammad Reza Moghadas³ &
…
Mahdi Akbarzadeh³^na1

BMC Medical Research Methodology volume 23, Article number: 161 (2023) Cite this article

4861 Accesses
2 Citations
Metrics details

Abstract

Background

Missing data is a pervasive problem in longitudinal data analysis. Several single-imputation (SI) and multiple-imputation (MI) approaches have been proposed to address this issue. In this study, for the first time, the function of the longitudinal regression tree algorithm as a non-parametric method after imputing missing data using SI and MI was investigated using simulated and real data.

Method

Using different simulation scenarios derived from a real data set, we compared the performance of cross, trajectory mean, interpolation, copy-mean, and MI methods (27 approaches) to impute missing longitudinal data using parametric and non-parametric longitudinal models and the performance of the methods was assessed in real data. The real data included 3,645 participants older than 18 years within six waves obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS). The data modeling was conducted using systolic and diastolic blood pressure (SBP/DBP) as the outcome variables and included predictor variables such as age, gender, and BMI. The efficiency of imputation approaches was compared using mean squared error (MSE), root-mean-squared error (RMSE), median absolute deviation (MAD), deviance, and Akaike information criteria (AIC).

Results

The longitudinal regression tree algorithm outperformed based on the criteria such as MSE, RMSE, and MAD than the linear mixed-effects model (LMM) for analyzing the TCGS and simulated data using the missing at random (MAR) mechanism. Overall, based on fitting the non-parametric model, the performance of the 27 imputation approaches was nearly similar. However, the SI traj-mean method improved performance compared with other imputation approaches.

Conclusion

Both SI and MI approaches performed better using the longitudinal regression tree algorithm compared with the parametric longitudinal models. Based on the results from both the real and simulated data, we recommend that researchers use the traj-mean method for imputing missing values of longitudinal data. Choosing the imputation method with the best performance is widely dependent on the models of interest and the data structure.

Peer Review reports

Background

Longitudinal data collected from the same subjects over time are frequently used in observational studies and clinical trials. Traditional models for longitudinal data analysis are generalized linear mixed-effects models (LMM), marginal models like generalized estimating equations (GEE), and transitional models. In various types of studies, especially longitudinal ones, researchers frequently face significant challenges, such as missing data. During follow-up, some subjects may withdraw or become lost to follow-up at planned visits. Subjects who participate only during a particular study period may complete only a subset of the information [1, 2].

Most conventional statistical models deal only with complete cases, and missing data are omitted before fitting statistical models (this is the default in most statistical software programs and is called the listwise deletion method). Excluding these observations has disadvantages, including loss of information, loss of precision, reduction in statistical power, and potentially biased estimates [3]. Therefore, different approaches have been introduced to impute missing values and can be classified as either single-imputation (SI) or multiple-imputation (MI).

MI methods for imputing missing data in software programs are based on two approaches: joint modeling (JM) and fully conditional specification (FCS). JM approaches for MI are based on the multivariate distribution or the joint distribution of incomplete variables (often, the multivariate normal (MVN) distribution is considered and can be referred to as multivariate normal imputation (MVNI)) [4]. In FCS approaches, missing observations of each incomplete variable are imputed given all the other predictor variables, cycling iteratively through a sequence of univariate imputation models [5].

Several JM and FCS methods, like JM-MVN (joint multivariate normal imputation) and FCS-standard approaches, have been proposed to handle missing values in cross-sectional studies [6,7,8,9,10,11]. These approaches are also appropriate for imputing missing values in balanced longitudinal data where longitudinal measurements are obtained at fixed time intervals. In this case, the JM-MNV and FCS-standard methods treat time-dependent variables as distinct variables (wide format) for the imputation of balanced missing longitudinal data [3, 5].

Sometimes, longitudinal data are collected at unequal time intervals along with many longitudinal predictor variables. Then standard JM-MVN and FCS methods cannot be used for imputing missing values in this case because large numbers of time-dependent predictor variables may lead to problems like overfitting and multicollinearity among distinct predictor variables.

However, ignoring the longitudinal and multilevel structures when imputing missing values of longitudinal data and multilevel data may lead to biased inferences for the estimates of regression coefficients and their standard errors [12, 13]. Recently, several studies extended MI methods for imputing missing values in multilevel data [14, 15] and longitudinal data [4, 16, 17]. These extensions are also available in several software programs such as R [9, 18,19,20,21,22,23,24,25,26], Mplus [27], STATA [28, 29], Blimp [30], REALCOM-IMPUTE [31], SAS [32], and Stat-JR [33]. In addition, in 2013 and 2016, Genolini et al. introduced several SI approaches to impute monotone/dropout and non-monotone/intermittent missing data in longitudinal studies [34, 35].

A few studies compared MI approaches for imputing missing values in longitudinal studies [16, 17, 36]. These studies used parametric approaches like LMM as an analysis model of interest. However, it is uncertain how well the different MI approaches perform when the statistical model of interest is a non-parametric longitudinal model. In addition, there is no comparison of SI and MI approaches in the literature, where the target analysis is a non-parametric longitudinal model. Hence, the present study is the first to consider non-parametric estimation methods for longitudinal data analysis following missing data imputation with SI and MI approaches.

In this study, the non-parametric longitudinal method of interest is the longitudinal regression tree algorithm proposed by Sela et al. This algorithm is named the random effects expectation–maximization (REEM) tree algorithm [37].

The primary purpose of this study is to evaluate the performance of MI and SI approaches for imputing missing values in longitudinal data. The longitudinal data for this study were obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS) to assess the association between diastolic/systolic blood pressure (DBP/SBP) and predictor variables such as age, gender, and body mass index (BMI).

Methods

Tehran cardiometabolic genetic study (TCGS)

Subjects of the study are extracted from TCGS, an ongoing cohort study based on the framework of the Tehran Lipid and Glucose Study (TLGS). TLGS is the first prospective cohort study in West Asia, and was conducted in Tehran, the capital of Iran. This study was designed to assess the epidemiology of non-communicable diseases of participants from district 13 of Tehran with 24 years of follow-up. The first or baseline phase of the TLGS study was started in February 1999, and the individuals were selected via a multistage stratified cluster random sampling method with follow up every three years. The primary purpose of the TLGS is to determine the prevalence of cardiovascular diseases (CVD) and risk factors for the Tdevelopment of these diseases. The design of the TLGS study has been reported elsewhere [38,39,40,41,42].

In the present study, some participants of TCGS were used to evaluate the association between DBP/SBP and predictor variables such as age, sex, and BMI [43, 44]. The predictor variables such as sex and age had no missing values (Table S1 in supplementary file provides sample data for the first 20 individuals of TCGS). However, the BMI and outcome variables (DBP/SBP) had missing values in all six waves of the TCGS study. SI and MI approaches were used to impute missing values of incomplete variables. Parametric linear mixed-effects and non-parametric longitudinal regression tree models were used for longitudinal data analysis after imputing missing data. The study structure for selecting individuals and statistical analysis plan is shown in Fig. 1. In the following sections, each step for data analysis is fully described.

Missing data mechanisms

Three missing data mechanisms are defined for generating missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR [3]. We can distinguish between the mechanisms by defining outcome Y = (Y^observed, Y^missing) and missing indicator R (1: observed and 0: missing). Each missing data mechanism suggests a general imputation approach.

If missingness probability is not related to observed and unobserved data, then data are MCAR. According to this mechanism, the distribution of missing data P $\left(\mathrm{R}|\mathrm{Y}\right)$ = P(R). While this assumption is unrealistic, the listwise deletion method is an unbiased method for dealing with missing data when this assumption is established. Multiple imputation (MI) approaches are developed under the MAR assumption, which states that the probability of missingness is related to the observed data and is not dependent on the unobserved data. In this case, the distribution of missing data is defined as P $\left(\mathrm{R}|\mathrm{Y}\right)$ = P $\left(\mathrm{R}|{\mathrm{Y}}^{\mathrm{observed}}\right)$, and MI methods can generate unbiased and efficient results [45]. Traditional models for longitudinal data analysis like GEE and models based on the maximum likelihood estimation like GLMMs also lead to valid estimates when missing mechanisms are MCAR and MAR, respectively [46].

The MNAR mechanism occurs when the probability of missingness is related to observed and unobserved data, or the distribution of missing data is equal to P $\left(\mathrm{R}|{\mathrm{Y}}^{\mathrm{observed}},{\mathrm{Y}}^{\mathrm{missing}}\right)$. Selection and pattern-mixture models have been introduced to handle MNAR data [47]. Assuming that the mechanism is MCAR is a very strong assumption; there are several tests to assess this assumption against not MCAR [48, 49]. The MAR assumption is most common in practice, and this can be tested against the MCAR assumption. However, it is never possible to rule out the assumption of MNAR. The missing data mechanism of multiple variables may be a mixture of any or all mechanisms described here.

Missing data patterns

In longitudinal studies, missing values are based on non-monotone/intermittent and monotone/dropout processes [50]. The non-monotone pattern is created when study information is not available for a subject at one time point, but the subject returns at a subsequent time point. A monotone pattern is unlike a non-monotone pattern; if a subject misses a particular follow-up, then this subject is not available again. In practice, these two patterns can occur together for different measures.

SI approaches to impute missing values in longitudinal data

In SI approaches, a single value is estimated for each missing data point. In 2013 and 2016, Genolini et al. introduced several SI approaches to impute monotone/dropout and non-monotone/intermittent missing data in longitudinal studies [34, 35].

To understand the computational strategy related to this section, a data set of n clusters (subjects) is considered. A time-dependent variable is recorded at t time points for each cluster. In this case, a trajectory for cluster i and a cross-sectional measurement for a particular time point k, is defined as the sequence y_i. = (y_i1, y_i2, …, y_it) and as the vector y_.k = (y_1k, y_2k, …, y_nk), respectively. Let y_ik show a missing value for cluster i at a specific time point k. y_ik is non-monotone missing if time points as a < k < b exists and y_ia and y_ib are not missing. y_ik is monotone missing if for all time points h > k, y_ih is missing.

SI methods are classified into three imputation classes: cross-sectional (methods such as cross-mean, cross-median, and cross-hot deck), longitudinal (methods such as traj-mean, traj-median, traj-hot deck, last observation carried forward (LOCF), next observation carried backwards (NOCB), interpolation LOCF, interpolation global, interpolation local, and interpolation bisector), and cross-sectional-longitudinal (methods such as copy mean LOCF, copy mean global, copy mean local, and copy mean bisector). The cross-sectional imputation methods deal with observed data at a specific time (across clusters) for replacing the missing values in this time, whereas longitudinal imputation methods deal with observed data of the same cluster to impute the missing values in this cluster. The cross-sectional-longitudinal imputation methods utilize both cross-sectional information (y_i.) and longitudinal information (y_.j).

Cross methods

The cross methods include cross-mean, cross-median, and cross-hot deck. In the cross-mean method, y_ik at a particular time point is estimated by the mean of all values observed at the time point of interest. Likewise, the cross-median method uses the median of all values observed at the time point of interest instead. In the cross-hot deck method, y_ik at a particular time point is estimated by using a randomly selected value observed at the time point of interest.

Traj methods

The traj methods include traj-mean, traj-median, and traj-hot deck. In the traj-mean method, y_ik is estimated by the mean of all values observed at the trajectory of interest y_i. The traj-median method uses the median of all values observed at the trajectory of interest. In the traj-hot deck method, y_ik at a particular time point is estimated by a randomly selected value observed from the trajectory interest.

LOCF (Last Occurrence Carried Forward) and NOCB (Next Occurrence Carried Backward)

In the LOCF and NOCB methods, y_ik is estimated by the last and next observed value of the trajectory of interest, respectively.

Interpolation methods

Interpolation methods include the four methods: interpolation-LOCF, interpolation-global, interpolation-local, and interpolation-bisector. In all interpolation methods, a non-monotone missing (y_ik) is replaced by drawing a line between the values immediately surrounding y_ik, and the mathematical formula of this step is as follows: ${\mathrm{y}}_{\mathrm{ia}}+(\mathrm{k}-\mathrm{a})\frac{{\mathrm{y}}_{\mathrm{ib}}-{\mathrm{y}}_{\mathrm{ia}}}{(\mathrm{b}-\mathrm{a})}$ (y_ia and y_ib are values immediately surrounding y_ik). But these methods have different strategies to deal with monotone missing values. For example, the interpolation-LOCF uses LOCF or NOCB methods to solve this problem. In the interpolation-global method, replacing a monotone missing is performed by drawing a line linking the first and the last observed values. In the interpolation-local method, monotone missing values are generated first by drawing a line linking the first and second non-missing value. Then, monotone missing values at the end of the trajectory are replaced by drawing a line linking the last and penultimate non-missing value. Finally, the interpolation-bisector method provides an intermediate method using the bisector of interpolation-global and interpolation-local for imputing monotone missing values, and the imputed values are chosen on the bisectors.

Copy-mean methods

The copy-mean methods include copy-mean LOCF, copy-mean global, copy-mean local, and copy-mean bisector. The copy-mean LOCF method is based on two steps for the imputation of missing observations. First, missing values are imputed using the LOCF method to provide an initial approximation of these values. Then the mean trajectory of the population is used to refine the initial approximation in the previous step. Let,$({\overline{\mathrm{y}} }_{.1}, \dots ,{\overline{\mathrm{y}} }_{.\mathrm{t}})$: the mean trajectory of a populationy_ik: the first missing value of i^th trajectory${\mathrm{y}}_{\mathrm{ik}}^{\mathrm{LOCF}}$: the imputed value for y_ik using LOCF method for all time points k ≥ d$\left({\overline{y} }_{.1}^{\mathrm{LOCF}}, \dots ,{\overline{\mathrm{y}} }_{.\mathrm{t}}^{\mathrm{LOCF}}\right)$: the mean trajectory of a population with missing values using the LOCF methodAV_k: the average variation at k^th time point and is equal to ${\overline{y} }_{.k}-{\overline{\mathrm{y}} }_{.\mathrm{k}}^{\mathrm{LOCF}}$

The missing value y_ik is obtained from the copy mean LOCF by adding AV_k to the imputed value for y_ik using the LOCF method $\left({\mathrm{y}}_{\mathrm{ik}}^{\mathrm{LOCF}}+{\mathrm{AV}}_{\mathrm{k}}\right)$

The computational strategies of the copy-mean local, copy-mean global, and copy-mean bisector are similar to the copy-mean LOCF method, except these methods use the interpolation-local method, interpolation-global method, and interpolation-bisector method to provide an initial approximation of missing values, respectively.

MI approaches to impute missing values in longitudinal data

MI approaches were proposed by Rubin in 1987 [51], and are flexible and popular methods for imputing missing values based on three steps: imputation step, analysis step, and pooling step [52, 53]. MI approaches use a Bayesian strategy, where posterior estimation is conducted using the Markov Chain Monte Carlo (MCMC) method. These approaches can be decomposed into two approaches, JM and FCS, described in subsequent sections.

In the imputation step, missing data are estimated several times by sampling through their posterior predictive distribution given the observed data and the parameter values of the imputation model. In the next step, multiple complete data sets produced via the imputation step are analyzed using the statistical model of interest. Finally, the results obtained from the analysis step, such as the estimates of regression coefficients with their standard error and criteria of predictive performance, are pooled using Rubin's rules to account for the uncertainty of the imputation [3, 51]. This process is shown graphically in Figure S1 (supplementary file).

Generating more than ten datasets positively affect statistical power [54]; Enders (2010) argues that generating 20 multiple imputed datasets is appropriate [55]. Based on this, we generated 20 multiple datasets of imputed data to impute missing data in the TCGS study using MI approaches.

MI approaches have an attractive property because auxiliary variables can be included in the imputation step without being in the analysis step. Auxiliary variables provide information about missing data and improve the quality of missing data imputation [45]. Researchers also can use a larger number of these variables to reduce the negative effects of MNAR [26].

The MI approaches are based on the MAR assumption, and the inclusion of auxiliary variables in the imputation step raises the plausibility of this assumption. The imputation model should include all the estimations in the analysis step in these imputation approaches. Otherwise, the analysis may generate biased estimates. In real applications, imputation models with many variables can lead to problems such as multi-collinearity and non-convergence [26].

In the present study, we used a number of MI approaches available in R software for missing data imputation; these approaches include joint multivariate LMM (JM-MLMM) [4], joint multivariate LMM with heteroskedastic covariance matrices across all clusters (JM-MLMM-het) [56], full joint (JM-FJ) [57], substantive model compatible joint modelling approach (JM-SMC) [58], substantive model compatible joint modelling approach with heteroskedastic covariance matrices across all clusters (JM-SMC-het) [59], FCS-LMM [26], FCS-LMM with heteroskedastic residual variance across all clusters (FCS-LMM-het) [26], FCS-LMM-latent normal (FCS-LMM-LN) [60], FCS-LMM-LN with heteroskedastic residual variance across all clusters (FCS-LMM-LN-het) [60], FCS-MLMM with latent normal variables (FCS-MLMM-LN) [22], hierarchical multiple imputation (hmi) [23], and joint analysis and imputation (JointAI) [61]. Each approach is explained in the following sections. In addition, we mention the number of iterations, burn-in period, and convergence criteria for each MI approach.

JM-MLMM method

Schafer and Yucel (2002) introduced the JM-MLMM method for imputing missing values in longitudinal data using joint multivariate linear mixed models (JM-MLMM) instead of treating time-dependent variables as distinct variables for imputing variables with missing values. In this method, qualitative variables are imputed as continuous or dummy variables. The JM-MLMM method also assumes that random effects are based on a normal distribution with constant covariance matrices across all subjects (clusters) [4].

JM-FJ method

In real applications, longitudinal data with missing values may be a mixture of qualitative and quantitative variables, so the normality assumption for these incomplete variables may be unrealistic. Goldstine et al. (2009) suggested the JM-MLMM-LN method based on the JM-MLMM method, which includes latent normal (LN) variables for imputing a mixture of normal and non-normal variables [57]. Asparouhov and Muthen (2010) proposed the JM-FJ method based on the JM-MLMM-LN method to impute missing values in longitudinal data using all variables in the imputation process as outcome variables.

These two methods and the JM-MLMM method are implemented in package mitml [19]. The potential scale reduction criterion near 1 or < 1.05 for all parameters and diagnostics plots were used to assess the convergence. If the potential scale reduction criterion was larger than 1.05, the iterations of the burn-in period were increased.

JM-SMC and JM-SMC-het methods

Goldstine et al. (2014) extended the JM-MLMM-LN method to the JM-SMC method by defining the joint imputation method as the product of the analysis model and the joint distribution of variables [58]. The JM-SMC method can also accommodate random covariance matrices across all subjects, and this method is defined as the JM-SMC-het method. These methods use diagnostics plots to assess convergence.

In the present study, all JM approaches were conducted based on the 1000 iterations and 5000 iterations for a burn-in period to establish the stability of parameters distribution.

FCS-LMM and FCS-LMM-het methods

FCS-LMM is an FCS adaptation of the JM-MLMM method, proposed by van Buuren et al. (2011). This method fits a multilevel LMM to impute missing values of incomplete variables conditional to other variables, cycling iteratively based on the univariate imputation models. In this method, qualitative variables are imputed as continuous variables or as dummy variables. The FCS-LMM method assumes normal distributions for all variables with missing values and a fixed residual variance across all subjects [26]. Van Buuren (2011) extended the FCS-LMM method to the FCS-LMM-het method to deal with heteroskedastic residual variance across all clusters [14].

FCS-LMM-LN and FCS-LMM-LN-het methods

Enders et al. (2017) suggested the FCS-LMM-LN method by extending the FCS-LMM method to LN variables [60]. This method imputes missing data using a value randomly selected from observed values having the nearest predicted mean based on the LMM to particular missing data. In the FCS-LMM-LN-het method, the continuous variables with missing values are imputed using a LMM.

FCS-MLMM-LN method

Audigier and Resche-Rigon modified the JM-MLMM-LN approach to impute missing observations based on an FCS framework where only one variable is considered missing at a time [22]. At each step, all of the variables in the imputation model are considered as outcomes (one of variables in the imputation model is treated as incomplete variable and the rest are considered as complete variables). Using this approach, the incomplete binary and categorical variables are imputed using latent normal variables as for JM-MLMM-LN.

All FCS approaches mentioned were conducted based on the 20 iterations and 5 iterations for a burn-in period to establish the stability of parameters distribution (In these approaches, the convergence of estimations can occur with 5 or 10 iterations). In addition, diagnostic graphs were used to examine convergence [26].

Non-parametric longitudinal analysis method

The longitudinal tree-based methods are non-parametric methods for analyzing longitudinal data. Medical studies have used these methods to determine disease risk factors and identify high- and low-risk subgroups of patients [62, 63] by extracting homogeneous subgroups of observations that can be appropriately used for subgroup analysis [64]. Since most studies evaluating longitudinal changes in the outcome variable are conducted in the context of a heterogeneous population, traditional parametric longitudinal models might not provide a good fit and could potentially result in biased estimates. In addition, the actual values of the model parameters may differ between homogeneous subgroups. Because the tree-based models can extract homogeneous subgroups of observations and estimate heterogeneous treatment effects, they may be better positioned to assist the clinician in decision-making [65, 66].

Unlike traditional parametric longitudinal models, these methods do not require assumptions about the functional form of the data and are robust to outliers and multicollinearity. They can accommodate non-linear relationships and high-order interactions. The monotone transformations of predictor variables do not have any effect on the results. The interpretation of tree methods is straightforward because the results are shown graphically [67,68,69,70].

The classification and regression tree (CART) algorithm is the best-known tree algorithm for cross-sectional data modeling [71]. Sela et al. (2012) extended this tree algorithm for longitudinal data by combining the LMM and the CART algorithm. This longitudinal regression tree algorithm is named the random effects expectation–maximization (REEM) tree algorithm [37].

The previous section described how the LMM uses a parametric linear form for fixed-effects; this form cannot easily handle complex non-linear relationships or datasets with very large numbers of predictor variables. The REEM tree algorithm solves this problem using a non-parametric method like the CART algorithm to estimate the fixed effects. The estimation method of REEM is as follows:

1
Set the initial values equal to zero for ${\widehat{b}}_{i}$.

2
Run the following steps until the convergence of ${\widehat{b}}_{i}$ (convergence is established when change in the likelihood or restricted likelihood < predetermined tolerance value (e.g. 0.001)).a) Fit a regression tree to estimate an initial approximation of $f$ using the CART algorithm, based on response variable,${y}_{it}- {Z}_{it}{\widehat{b}}_{i}$, predictor variables,${x}_{it}=({x}_{it1}, \dots , {x}_{itK})$, for i = 1, …, I and t = t = 1, …,${T}_{i}$. This regression tree generates a set of predictor variables, I(${x}_{it}\in {\mathrm{g}}_{\mathrm{p}}$), where ${\mathrm{g}}_{\mathrm{p}}$ ranges over all terminal nodes of the tree.b) Run the LMM, ${y}_{it}={Z}_{it}{b}_{i}+\sum_{p}\mathrm{I}\left({x}_{it}\in {\mathrm{g}}_{\mathrm{p}}\right){\mu }_{p}+{\varepsilon }_{it}$, to estimate ${\widehat{b}}_{i}$ from the fitted model.

3
Use estimated predicted response ${\widehat{\mu }}_{p}$ from the fitted LMM in step 2b instead of the predicted response at each terminal node of tree.

Simulation study

Performance of the various imputation approaches was compared using simulation data. We generated 1000 datasets, each of which included 1000 individuals, mimicking the TCGS data. In each simulated dataset, variables were generated as follows (all parameters in the data generating models were estimated from the original data to ensure that the simulated datasets were comparable to a real data example):

1
Sex variable was generated using a binomial distribution with probabilities 0.5.

2
Age variable at the first wave was generated using a truncated normal distribution with exact minimum = 1, maximum = 84, mean = 39.34, and standard deviation = 16.23. Age at other waves was generated as follows:

$$\begin{array}{cc}{\mathrm{Age}}_{\mathrm{ij}}= {\mathrm{Age}}_{\mathrm{i}1}+\left(\mathrm{j}-1\right)\times 3\mathrm{ i}\hspace{0.17em}=\hspace{0.17em}1, \dots , 3645,\mathrm{ j}\hspace{0.17em}=\hspace{0.17em}2, \dots , 6& \mathrm{i}\hspace{0.17em}=\hspace{0.17em}1, \dots , 3645,\mathrm{ j}\hspace{0.17em}=\hspace{0.17em}2, \dots , 6\end{array}$$

3
The main predictor variable (BMI) at each wave was generated based on age and sex as well as individual-level random effects and individual-level noise in each wave:

$$\begin{array}{cc}{\mathrm{BMI}}_{\mathrm{ij}}=19.86+0.136\times {\mathrm{Age}}_{\mathrm{ij}}+2.360 \times {\mathrm{Sex}}_{\mathrm{i}}+ {\mathrm{\varnothing }}_{0\mathrm{i}}+ {\mathrm{\varnothing }}_{\mathrm{ij}}& \mathrm{i}=\hspace{0.17em}1, \dots ., 3645,\mathrm{ j}\hspace{0.17em}=\hspace{0.17em}1, \dots , 6\end{array}$$

where ${\mathrm{\varnothing }}_{0\mathrm{i}}$ = N (0, 4.50) is the random intercept and ${\mathrm{\varnothing }}_{\mathrm{ij}}$ = N (0, 1.57) is the residual error.

4
The outcome variable, DBP at each wave was generated using the following linear process:

$$\begin{array}{l}{\mathrm{DBP}}_{\mathrm{ij}}=55.03+0.098\times {\mathrm{Age}}_{\mathrm{ij}}-3.434 \times {\mathrm{Sex}}_{\mathrm{i}}+ {0.707 {\times \mathrm{ BMI}}_{\mathrm{ij}}+\upgamma }_{0\mathrm{i}}+ {\upgamma }_{\mathrm{ij}}\\ \mathrm{i}\hspace{0.17em}=\hspace{0.17em}1, \dots ., 3645,\mathrm{ j}\hspace{0.17em}=\hspace{0.17em}1, \dots , 6\end{array}$$

where $\gamma$ _0i = N (0, 6.35) is the random intercept and ${\upgamma }_{\mathrm{ij}}$= N (0, 6.88) is the residual error.

After generating simulated data, missingness for some observations of BMI and DBP are generated based on the MAR mechanism. The following equations are used:

$$\mathrm{logit }\left(\mathrm{P}\left({\mathrm{BMI}}_{\mathrm{ij}}=\mathrm{missing}\right)\right)={\beta }_{0\mathrm{j}}+{\beta }_{1\mathrm{j}}{\mathrm{Age}}_{\mathrm{ij}}+{\beta }_{2\mathrm{j}}{\mathrm{DBP}}_{\mathrm{ij}}$$

$$\mathrm{logit }\left(\mathrm{P}\left({\mathrm{DBP}}_{\mathrm{ij}}=\mathrm{missing}\right)\right)={\varphi }_{0\mathrm{j}}+{\varphi }_{1\mathrm{j}}{\mathrm{Age}}_{\mathrm{ij}}+{\varphi }_{2\mathrm{j}}{\mathrm{BMI}}_{\mathrm{ij}}$$

The parameters $\beta$ _0j, $\beta$ _1j, $\beta$ _2j, $\varphi$ _0j, $\varphi$ _1j, and $\varphi$ _2j were determined based on the TCGS data to ensure a similar proportion of missing data for each variable at each wave; the proportions in both TCGS study and simulation study are shown in Table 1. These parameters are as follows:

Table 1 The proportions of missing data in both TCGS study and simulation study

Full size table

$${\upbeta }_{0}=\left\{-2.646, -2.634, -3.047, -3.271, -2.872, -2.440\right\}$$

$${\varphi}_{0}=\left\{-1.701, -1.815, -2.091, -2.386, -2.104, -1.522\right\}$$

$${\upbeta }_{1}={\varphi}_{1}=0.002$$

$${\upbeta }_{2}={\varphi}_{2}=0.02$$

Criteria for comparing the performance of missing data imputation methods

The performance of imputation approaches under the MAR mechanism and statistical methods applied to the real and simulated data sets was compared by evaluating the standard errors of regression coefficients (SE), MSE, RMSE, MAD, deviance, and AIC. The imputation approaches with smaller value in terms of SE, MSE, RMSE, MAD, deviance, and AIC indicate better performance.

Software programs

R software was used to impute missing longitudinal data and data analysis, and R packages used are mentioned in Table 2. The R codes of SI and MI approaches for missing data imputation and data simulation are available in https://github.com/MinaJahangiri/R-codes-of-missing-imputation-methods.

Table 2 R packages for data analysis

Full size table

Results

The study's variables with missing values are BMI (predictor variable) and DBP/SBP (outcome variables). Figure 2 shows the frequency of missing values for these variables at each phase of the TCGS study. The descriptive statistics of TCGS data are shown in the Table S2 (supplementary file). The percentage of missing values in a particular combination of variables based on the long data format is visually shown in Figure S2 (supplementary file). Seventy-six percent observations have no missing values, and 21% have missing values for BMI, DBP, and SBP, simultaneously. The missing data pattern of TCGS data indicated that both monotone and non-monotone patterns were present. The formal MCAR test indicated that the MCAR assumption is not reasonable (P < 0.001, so the MCAR assumption is rejected at a significance level of 0.05).

After the imputation of missing values, the data modeling results based on the LMM and REEM longitudinal tree algorithm for the two outcome variables (DBP and SBP) are shown in Tables 3, 4, 5 and 6. According to Tables 3 and 5, all SI approaches except cross-median, and cross-hot deck performed similarly with respect to estimates of regression coefficients and their standard error estimates.

Table 3 Results of linear mixed effects model for diastolic blood pressure (DBP)

Full size table

Table 4 Results of random effects expectation–maximization (REEM) tree algorithm for diastolic blood pressure (DBP)

Full size table

Table 5 Results of linear mixed effects model for systolic blood pressure (SBP)

Full size table

Table 6 Results of random effects expectation–maximization (REEM) tree algorithm for systolic blood pressure (SBP)

Full size table

The results shown in Tables 3, 4, 5 and 6 indicated that parametric longitudinal models are not appropriate for analyzing the TCGS data. There appears to be non-linear relationships (for BMI variable) and the assumptions of homoscedasticity and normality of residuals are clearly not established by using the diagnostic plots such as standardized residuals versus BMI variable, quantile–quantile (QQ) plot of residuals, and plot of standardized residuals versus fitted values for linear mixed effects model, respectively. These assumptions for the LMM appear to have been violated regardless of the SI approach used; as an illustration, we show the diagnostic plots for the traj-mean method in Figures S3, S4, S5, S6, S7 and S8 (supplementary file). We compared the parametric and non-parametric models using SI approaches, and the parametric longitudinal models resulted in larger MSE, RMSE, and MAD than the longitudinal regression tree algorithm (Tables 3, 4, 5 and 6). Given these advantages of the non-parametric model, the comparison of SI and MI imputation approaches is only explained based on the REEM tree algorithm.

When comparing SI approaches that were used in conjunction with the REEM tree algorithm, traj-mean method performed the best (lowest MSE, RMSE, MAD, and deviance), and the cross-hot deck performed the worst (Tables 4 and 6). The tree structure of the REEM tree algorithm using traj-mean for imputation of missing values for two outcome variables, DBP and SBP, are shown in Figure S9 and Figure S10 (supplementary file).

Density plots of the observed and imputed data for incomplete variables like BMI, DBP, and SBP using mice packages are shown in Figure S11, S12, S13 and S14 (supplementary file). Figure S15 (supplementary file) also demonstrates the trace line plots of the mean and standard deviation of the imputed values against the iteration number for each replication. These trace lines are intermingled without any particular trend, so it appears estimation has converged. Due to space limitations, trace line plots for other FCS approaches are not shown, though these plots also indicated convergence. In addition, the Rhat statistic of mean and variances for all incomplete variables based on the FCS-LMM-LN and FCS-LMM-LN-het methods are near one, so the convergence of these methods is also established (Table S3 in supplementary file).

When comparing FCS approaches, the FCS-LMM-LN performed the best and FCS-LMM/FCS-LMM-het performed the worst (Tables 4 and 6). All of the JM approaches had a similar performance (Tables 4 and 6). The diagnostic plots of the JM approaches indicate convergence. In addition, the potential scale reduction criterion was near 1 for all parameters based on JM-FJ and JM-MLMM methods. Due to space limitations, these diagnostic plots are only included for the JointAI and JM-MLMM methods in Figures S16, S17, S18, S19 and S20 (supplementary file), respectively.

The simulation results were consistent with the real data analyses. The longitudinal regression tree algorithm provided better performance than the LMM for analyzing the simulated data under the missing at random (MAR) mechanism. In addition, the SI traj-mean method provided better performance (lowest MSE, RMSE, and MAD) than other imputation approaches (Tables 7 and 8). We have not assessed the bias, because the longitudinal tree algorithm, unlike LMM, does not generate the estimates of coefficient regression. Rather, we have based our evaluation of the methods on prediction performance.

Table 7 Simulation results of linear mixed effects model for diastolic blood pressure (DBP)

Full size table

Table 8 Simulation results of random effects expectation–maximization (REEM) tree algorithm for diastolic blood pressure (DBP)

Full size table

Discussion

Missing values are a significant problem in longitudinal studies, and managing this problem is essential. In the current study, we compared the performance of SI and MI approaches to impute longitudinal missing data in the context of using LMM and the REEM tree algorithm for data modelling. Previous studies have compared the performance of MI approaches when the statistical model of interest is a parametric longitudinal model; the performance of MI approaches when the statistical model of interest is a non-parametric longitudinal model is less well understood.

The current study provides a comprehensive assessment using missing imputation approaches for handling missing data in the TCGS dataset and simulated data under the MAR mechanism. To evaluate this aim, we compared the performance of 16 SI approaches and 12 MI approaches to fit the REEM tree algorithm and LMM when assessing the association between DBP/SBP and predictor variables such as age, gender, and BMI. We also focused on the R-packages and provided R code for data modeling after using the SI and MI approaches, as well as missing longitudinal data simulation.

The real and simulated data results suggest that the REEM tree algorithm could perform better than parametric longitudinal models. Tree algorithms have some advantages compared to parametric longitudinal models, and we propose that researchers use these methods for future longitudinal studies. These algorithms can accommodate large data sets, non-linear relationships, and interactions, and can extract homogeneous subgroups of data. The interpretation of the tree algorithm is straightforward because the result is graphically shown and is robust to multicollinearity and outliers. These algorithms are also invariant to monotone transformations of independent variables and do not require additional distributional assumptions [67, 68, 72,73,74,75].

Generally, the comparisons of imputation methods indicated little difference between them. However, a SI approach (traj-mean) had better performance among all imputation approaches in fitting the REEM tree algorithm for both outcome variables DBP and SBP.

In addition, we evaluated the computational time of imputation approaches. JM approaches are much more resource-intensive than FCS approaches. These methods may not be practicable in longitudinal studies with many clusters and predictor variables with high missing rates. In FCS approaches, the convergence of estimations can occur with 5 or 10 iterations. The number of iterations to establish convergence of JM approaches is much larger than FCS approaches, and FCS approaches have higher computational speed than SI approaches. Overall, MI approaches require more computing resources than SI approaches. Therefore, in real applications with many clusters (e.g., TCGS), SI approaches are likely to be more cost-effective in terms of computational time. The SI approaches are not based on the maximum likelihood estimation and are classified as non-parametric imputation approaches. These methods are appropriate for non-parametric trajectories (e.g., BMI trajectories in TCGS data). Based on the computational time and performance of SI method of traj-mean, it appears this can be an excellent approach to deal with missing observations in data similar to the TCGS data set.

In the current study, we only assessed the MCAR mechanism, though there are sensitivity analyses (selection models and pattern-mixture models) that can be performed to assess the appropriateness of the MAR assumption; unfortunately, these models are unavailable for longitudinal data with missing values in longitudinal quantitative predictor and outcome variables [17, 76,77,78].

Past studies compared MI approaches for fitting parametric longitudinal models, such as LMM with random intercepts and LMM with random intercepts and slopes. These studies indicate that all MI approaches provide consistent regression coefficients [16, 17]. Some studies also compared these imputation methods for missing data in the context of multilevel data and concluded that these methods provide consistent regression coefficients [15]. In addition, two studies comprehensively compared the SI approaches to impute monotone and non-monotone missing data in longitudinal studies. Unlike the current study, the copy mean method was more effective than other SI approaches [34. Like these two studies, Zhang (2016) also indicated that the copy-mean method had better performance for imputing missing values [79].

To the best of our knowledge, the present study is the first to compare the SI, and MI approaches to impute missing longitudinal data with many time points, clusters, and values using real and simulation data. However, this present study has two limitations, one of which was related to the violation of parametric longitudinal model assumptions. Another limitation was related to the computational time of the simulation study. We used SI approaches and a non-parametric longitudinal model like the REEM tree algorithm to deal with this limitation. For future studies, the non-parametric imputation methods using multivariate skew-normal distribution for the random effects can impute missing longitudinal data. In addition, in the case with unequal time intervals, functional data analysis could be helpful.

Conclusion

The result of this study should be generalized with caution to other data sets with different characteristics. Because imputation methods can have different levels of performance with different data sets, certain conditions such as missing the data mechanisms or the rate of missingness might lead analysts to opt for different imputation options. Therefore, we conclude that researchers apply all imputation methods (SI and MI) in the context of fitting their statistical models, and then select the imputation method that demonstrates the best performance based on the criteria highlighted in this paper.

Availability of data and materials

The datasets analyzed during the current study are not publicly available due to containing information that could compromise the privacy of research participants but are available from the corresponding authors on reasonable request. However, R code of SI and MI approaches for missing data imputation, and data simulation are available in https://github.com/MinaJahangiri/R-codes-of-missing-imputation-methods.

Abbreviations

DBP:: Diastolic blood pressure
SBP:: Systolic blood pressure
BMI:: Body mass index
JM:: Joint modelling
FCS:: Fully conditional specification
JM-MVN:: Joint multivariate normal imputation
SI:: Single imputation
MI:: Multiple imputation
LMM:: Linear mixed-effects model
REEM:: Random effects expectation–maximization
TCGS:: Tehran cardiometabolic genetic study

References

Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis, vol. 998: John Wiley & Sons; 2012.
Chen SX, Zhong P-S. ANOVA for longitudinal data with missing values. 2010.
Book Google Scholar
Little RJ, Rubin DB. Statistical analysis with missing data, vol. 793: John Wiley & Sons; 2019.
Schafer JL, Yucel RM. Computational strategies for multivariate linear mixed-effects models with missing values. J Comput Graph Stat. 2002;11(2):437–57.
Article Google Scholar
Van Buuren S, Brand JP, Groothuis-Oudshoorn CG, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul. 2006;76(12):1049–64.
Article Google Scholar
Fox MJ. Package ‘norm.’ 2013.
Google Scholar
Schafer JL, Tusell MF. Package ‘cat.’ 2012.
Google Scholar
Kowarik A, Templ M. Imputation with the R Package VIM. J Stat Softw. 2016;74(1):1–16.
Google Scholar
van Buuren S, Groothuis-Oudshoorn K, Robitzsch A, Vink G, Doove L, Jolani S. Package ‘mice’. Computer software. 2015.
Gelman A, Hill J, Su Y-S, Yajima M, Pittau M, Goodrich B, Si Y, Kropko J, Goodrich MB. Package ‘mi’. R CRAN R Foundation for Statistical Computing. 2015.
Husson F, Josse J, Husson MF, FactoMineR I. Package ‘missMDA.’ methods. 2013;153(2):79–99.
Google Scholar
Lüdtke O, Robitzsch A, Grund S. Multiple imputation of missing data in multilevel designs: A comparison of different strategies. Psychol Methods. 2017;22(1):141.
Article PubMed Google Scholar
Enders CK, Mistler SA, Keller BT. Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation. Psychol Methods. 2016;21(2):222.
Article PubMed Google Scholar
Van Buuren S. Multiple imputation of multilevel data: Routledge; 2011.
Wijesuriya R, Moreno-Betancur M, Carlin JB, Lee KJ. Evaluation of approaches for multiple imputation of three-level data. BMC Med Res Methodol. 2020;20(1):1–15.
Article Google Scholar
Huque MH, Carlin JB, Simpson JA, Lee KJ. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med Res Methodol. 2018;18(1):1–16.
Article Google Scholar
Huque MH, Moreno-Betancur M, Quartagno M, Simpson JA, Carlin JB, Lee KJ. Multiple imputation methods for handling incomplete longitudinal and clustered data where the target analysis is a linear mixed effects model. Biom J. 2020;62(2):444–66.
Article PubMed Google Scholar
Quartagno M, Carpenter J, Quartagno MM, BaBooN S. Package ‘jomo.’ 2020.
Google Scholar
Grund S, Robitzsch A, Luedtke O, Grund MS. Package ‘mitml.’ 2019.
Google Scholar
Robitzsch A, Grund S, Henke T, Robitzsch MA. Package ‘miceadds.’ R Package: Madison; 2017.
Google Scholar
Grund S, Lüdtke O, Robitzsch A. Multiple imputation of multilevel missing data: An introduction to the R package pan. SAGE Open. 2016;6(4):2158244016668220.
Article Google Scholar
Audigier V, Resche-Rigon M. micemd: multiple imputation by chained equations with multilevel data. R package version 160. 2019.
Google Scholar
Speidel M, Drechsler J, Jolani S. R package hmi: a convenient tool for hierarchical multiple imputation and beyond. In: IAB-Discussion Paper; 2018.
Erler NS, Rizopoulos D, Lesaffre EM. JointAI: joint analysis and imputation of incomplete data in R. 2019. arXiv preprint arXiv:190710867.
Genolini C, Falissard B, Fang D, Tierney L, Genolini MC. Package ‘longitudinalData.’ 2016.
Google Scholar
Van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45(1):1–67.
Google Scholar
Muthen Linda K, Muthen Bengt O. Mplus: Statistical Analysis with Latent Variables. Los Angeles: Muthen & Muthen; 2007.
Google Scholar
Royston P, White IR. Multiple imputation by chained equations (MICE): implementation in Stata. J Stat Softw. 2011;45(4):1–20.
Article Google Scholar
Welch C, Bartlett J, Petersen I. Application of multiple imputation using the two-fold fully conditional specification algorithm in longitudinal clinical data. Stand Genomic Sci. 2014;14(2):418–31.
Google Scholar
Keller BT, Enders CK. Blimp Software Manual (Version Beta 6.7). Los Angeles. 2017.
Bartlett J. REALCOMIMPUTE: Stata module to export and import data to the realcomImpute software package. 2018.
Google Scholar
Mistler SA. A SAS macro for applying multiple imputation to multilevel data. In: Proceedings of the SAS Global Forum: 2013: Citeseer; 2013:1–8.
Charlton C, Michaelides D, Cameron B, Szmaragd C, Parker R, Yang H. Stat-JR software. 2012.
Google Scholar
Genolini C, Jacqmin-Gadda H. Copy mean: a new method to impute intermittent missing values in longitudinal studies. Open J Stat. 2013;3(04):26.
Article Google Scholar
Genolini C, Lacombe A, Écochard R, Subtil F. CopyMean: a new method to predict monotone missing values in longitudinal studies. Comput Methods Programs Biomed. 2016;132:29–44.
Article PubMed Google Scholar
De Silva AP, Moreno-Betancur M, De Livera AM, Lee KJ, Simpson JA. A comparison of multiple imputation methods for handling missing values in longitudinal data in the presence of a time-varying covariate with a non-linear association with time: a simulation study. BMC Med Res Methodol. 2017;17(1):1–11.
Article Google Scholar
Sela RJ, Simonoff JS. RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn. 2012;86(2):169–207.
Article Google Scholar
Akbarzadeh M, Moghimbeigi A, Mahjub H, Soltanian AR, Daneshpour M, Morris N. Trajectories of change in obesity among tehranian families: multilevel latent growth curve modeling. Int J Fam Med. 2016;2016.
Akbarzadeh M, Moghimbeigi A, Morris N, Daneshpour MS, Mahjub H, Soltanian AR. A Bayesian structural equation model in general pedigree data analysis. Stat Analysis Data Mining. 2019;12(5):404–11.
Article Google Scholar
Daneshpour MS, Hedayati M, Sedaghati-Khayat B, Guity K, Zarkesh M, Akbarzadeh M, et al. Genetic identification for non-communicable disease: Findings from 20 years of the Tehran Lipid and Glucose Study. Int J Endocrinol Metab. 2018;16(4 Suppl).
Zahedi AS, Akbarzadeh M, Sedaghati-Khayat B, Seyedhamzehzadeh A, Daneshpour MS. GCKR common functional polymorphisms are associated with metabolic syndrome and its components: a 10-year retrospective cohort study in Iranian adults. Diabetol Metab Syndr. 2021;13(1):1–10.
Article Google Scholar
Sedaghati-Khayat B, Barzin M, Akbarzadeh M, Guity K, Fallah M-S, Pourhassan H, et al. Lack of association between FTO gene variations and metabolic healthy obese (MHO) phenotype: Tehran Cardio-metabolic Genetic Study (TCGS). Eat Weight Disord Stud Anorexia Bulimia Obes. 2020;25(1):25–35.
Article Google Scholar
Kolifarhood G, Daneshpour M, Hadaegh F, Sabour S, Mozafar Saadati H, Akbar Haghdoust A, et al. Heritability of blood pressure traits in diverse populations: a systematic review and meta-analysis. J Hum Hypertens. 2019;33(11):775–85.
Article PubMed Google Scholar
Kolifarhood G, Daneshpour MS, Khayat BS, Saadati HM, Guity K, Khosravi N, et al. Generality of genomic findings on blood pressure traits and its usefulness in precision medicine in diverse populations: A systematic review. Clin Genet. 2019;96(1):17–27.
Article CAS PubMed Google Scholar
Pedersen AB, Mikkelsen EM, Cronin-Fenton D, Kristensen NR, Pham TM, Pedersen L, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157.
Article PubMed PubMed Central Google Scholar
Twisk JW. Applied longitudinal data analysis for epidemiology: a practical guide: Cambridge university press; 2013.
Little R. Selection and pattern-mixture models. Longitudinal data analysis: Chapman and Hall/CRC; 2008. p. 423–46.
Google Scholar
Jamshidian M, Jalal S, Jansen C. MissMech: An R package for testing homoscedasticity, multivariate normality, and missing completely at random (MCAR). J Stat Softw. 2014;56(1):1–31.
Google Scholar
Little RJ. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198–202.
Article Google Scholar
Ibrahim JG, Molenberghs G. Missing data methods in longitudinal studies: a review. TEST. 2009;18(1):1–43.
Article PubMed Google Scholar
Rubin DB. Multiple imputation for nonresponse in surveys, vol. 81: John Wiley & Sons; 2004.
Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338.
Rezvan PH, Lee KJ, Simpson JA. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC Med Res Methodol. 2015;15(1):1–14.
Google Scholar
Graham JW, Olchowski AE, Gilreath TD. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci. 2007;8(3):206–13.
Article PubMed Google Scholar
Enders CK. Applied missing data analysis: Guilford Publications; 2022.
Yucel RM. Random covariances and mixed-effects models for imputing multivariate multilevel continuous data. Stat Model. 2011;11(4):351–70.
Article Google Scholar
Goldstein H, Carpenter J, Kenward MG, Levin KA. Multilevel models with multivariate mixed response types. Stat Model. 2009;9(3):173–97.
Article Google Scholar
Goldstein H, Carpenter JR, Browne WJ. Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and non-linear terms. J Royal Stat Soc Series A (Statistics in Society). 2014:553–64.
Quartagno M, Carpenter J. Multiple imputation for IPD meta-analysis: allowing for heterogeneity and studies with missing covariates. Stat Med. 2016;35(17):2938–54.
Article CAS PubMed Google Scholar
Enders CK, Keller BT, Levy R. A fully conditional specification approach to multilevel imputation of categorical and continuous variables. Psychol Methods. 2018;23(2):298.
Article PubMed Google Scholar
Erler NS, Rizopoulos D, Jaddoe VW, Franco OH, Lesaffre EM. Bayesian imputation of time-varying covariates in linear mixed models. Stat Methods Med Res. 2019;28(2):555–68.
Article PubMed Google Scholar
Camp NJ, Slattery ML. Classification tree analysis: a statistical tool to investigate risk factor interactions with an example for colon cancer (United States). Cancer Causes Control. 2002;13(9):813–23.
Article PubMed Google Scholar
Jahangiri M, Khodadi E, Rahim F, Saki N, Saki Malehi A. Decision‐tree‐based methods for differential diagnosis of β‐thalassemia trait from iron deficiency anemia. Expert Syst. 2017;34(3).
Loh WY, He X, Man M. A regression tree approach to identifying subgroups with differential treatment effects. Stat Med. 2015;34(11):1818–33.
Article PubMed PubMed Central Google Scholar
Kundu MG, Harezlak J. Regression trees for longitudinal data with baseline covariates. Biostatistics & epidemiology. 2019;3(1):1–22.
Article Google Scholar
Eo S-H, Cho H. Tree-structured mixed-effects regression modeling for longitudinal data. J Comput Graph Stat. 2014;23(3):740–60.
Article Google Scholar
Lemon SC, Roy J, Clark MA, Friedmann PD, Rakowski W. Classification and regression tree analysis in public health: methodological review and comparison with logistic regression. Ann Behav Med. 2003;26(3):172–81.
Article PubMed Google Scholar
Malehi AS, Jahangiri M. Classic and Bayesian Tree-Based Methods. In: Enhanced Expert Systems. edn.: IntechOpen; 2019.
Jahangiri M, Rahim F, Saki N, Saki Malehi A. Application of Bayesian Decision Tree in Hematology Research: Differential Diagnosis of β-Thalassemia Trait from Iron Deficiency Anemia. Comput Math Methods Med. 2021;2021.
Rahim F, Kazemnejad A, Jahangiri M, Malehi AS, Gohari K. Diagnostic performance of classification trees and hematological functions in hematologic disorders: an application of multidimensional scaling and cluster analysis. BMC Med Inform Decis Mak. 2021;21(1):1–13.
Article Google Scholar
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees: CRC press; 1984.
De’ath G, Fabricius KE. Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology. 2000;81(11):3178–92.
Article Google Scholar
Speybroeck N, Berkvens D, Mfoukou-Ntsakala A, Aerts M, Hens N, Van Huylenbroeck G, et al. Classification trees versus multinomial models in the analysis of urban farming systems in Central Africa. Agric Syst. 2004;80(2):133–49.
Article Google Scholar
Feldesman MR. Classification trees as an alternative to linear discriminant analysis. Am J Phys Anthropol. 2002;119(3):257–75.
Article PubMed Google Scholar
Chan K-Y, Loh W-Y. LOTUS: An algorithm for building accurate and comprehensible logistic regression trees. J Comput Graph Stat. 2004;13(4):826–52.
Article Google Scholar
Rezvan PH, Lee KJ, Simpson JA. Sensitivity analysis within multiple imputation framework using delta-adjustment: application to longitudinal study of Australian Children. Longitudinal Life Course Stud. 2018;9(3):259–78.
Article Google Scholar
Moreno-Betancur M, Chavance M. Sensitivity analysis of incomplete longitudinal data departing from the missing at random assumption: Methodology and application in a clinical trial with drop-outs. Stat Methods Med Res. 2016;25(4):1471–89.
Article CAS PubMed Google Scholar
Fiero MH, Hsu CH, Bell ML. A pattern-mixture model approach for handling missing continuous outcome data in longitudinal cluster randomized trials. Stat Med. 2017;36(26):4094–105.
Article PubMed PubMed Central Google Scholar
Zhang Z. Missing data imputation: focusing on single imputation. Ann Transl Med. 2016;4(1).
Templ M, Alfons A, Kowarik A, Prantner B, Templ MM. Package ‘VIM’. 2021.
Bates D, Sarkar D, Bates MD, Matrix L. The lme4 package. R package version. 2007;2(1):74.
Sela RJ, Simonoff JS. RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn. 2012;86:169–207.
Goldfeld K, Wujciak-Jens J. simstudy: Illuminating research methods through data generation. J Open Source Softw. 2020;5(54):2763.

Download references

Acknowledgements

The authors would like to express their gratitude to the staff and participants in the TCGS project.

Funding

No funding was received for this study.

Author information

These authors contributed equally: Anoshirvan Kazemnejad and Mahdi Akbarzadeh.

Authors and Affiliations

Department of Biostatistics, Faculty of Medical Sciences, Tarbiat Modares University, Tehran, Iran
Mina Jahangiri & Anoshirvan Kazemnejad
Division of Biostatistics, Department of Population Health, NYU Grossman School of Medicine, New York, NY, USA
Keith S. Goldfeld
Cellular and Molecular Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
Maryam S. Daneshpour, Mohammad Reza Moghadas & Mahdi Akbarzadeh
Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden
Shayan Mostafaei
Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
Davood Khalili

Authors

Mina Jahangiri
View author publications
You can also search for this author in PubMed Google Scholar
Anoshirvan Kazemnejad
View author publications
You can also search for this author in PubMed Google Scholar
Keith S. Goldfeld
View author publications
You can also search for this author in PubMed Google Scholar
Maryam S. Daneshpour
View author publications
You can also search for this author in PubMed Google Scholar
Shayan Mostafaei
View author publications
You can also search for this author in PubMed Google Scholar
Davood Khalili
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Reza Moghadas
View author publications
You can also search for this author in PubMed Google Scholar
Mahdi Akbarzadeh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: Mahdi Akbarzadeh, and Mina Jahangiri; Formal analysis: Mina Jahangiri, Mahdi Akbarzadeh, and Shayan Mostafaei; Methodology: Mina Jahangiri, Mahdi Akbarzadeh, Anoshirvan Kazemnejad, Maryam S Daneshpour, and Davood Khalili; Medical consultant: Maryam S Daneshpour; Software and data simulation: Mina Jahangiri, Keith Goldfeld, Mahdi Akbarzadeh, Shayan Mostafaei, and Mohammad Reza Moghadas; Writing-original draft: Mina Jahangiri and Keith Goldfeld; Supervision: Anoshirvan Kazemnejad and Mahdi Akbarzadeh; All authors reviewed and accepted the manuscript.

Corresponding authors

Correspondence to Anoshirvan Kazemnejad or Mahdi Akbarzadeh.

Ethics declarations

Ethics approval and consent to participate

The ethical committee approved this study at Research Institute for Endocrine Sciences; Shahid Beheshti University of Medical Sciences (Research Approval Code: 28778 & Research Ethical Code: IR.SBMU.ENDOCRINE.REC.1400.084). In this study, all participants provided written informed consent for participating in the study. This study has been performed in accordance with the Declaration of Helsinki.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Figure S1. The process of the multiple imputations approach (e.g., the number ofmultiple imputed data sets is equal to 5).

Additional file 2:

Figure S2. Missing data pattern (percentage of missing values in a particular combination of variables based on the long data format) using VIM package (blue color: observed values andred color: missing values).

Additional file 3:

Figure S3. The plot of standardized residuals versus fitted values for linear mixed-effects model with random intercepts using lme4 package based on the model: DBP ~ Age + Sex + BMI + Time + (1|Id) after using the traj-mean method for the imputation of missing values of longitudinal data.

Additional file 4:

Figure S4. The plot of standardized residuals versus fitted values for linear mixed-effects model with random intercepts using lme4 package based on the model: SBP ~ Age + Sex + BMI + Time + (1|Id) after using the traj-mean method for the imputation of missing values of longitudinal data.

Additional file 5:

Figure S5. The quantile-quantile plot of residuals for linear mixed-effects model with random intercepts using lme4 package based on the model: DBP ~ Age + Sex + BMI + Time + (1|Id) after using the traj-mean method for the imputation of missing values of longitudinal data.

Additional file 6:

Figure S6. The quantile-quantile plot of residuals for linear mixed-effects model with random intercepts using lme4 package based on the model: SBP ~ Age + Sex + BMI + Time + (1|Id) after using the traj-mean method for the imputation of missing values of longitudinal data.

Additional file 7:

Figure S7. Standardized residuals of linear mixed-effects model with random intercepts versus BMI variable using lme4 package based on the model: DBP ~ Age + Sex+ BMI + Time + (1|Id) after using the traj-mean method for the imputation of missing values of longitudinal data.

Additional file 8:

Figure S8. Standardized residuals oflinear mixed-effects model with random intercepts versus BMI variable using lme4 package based on the model: SBP ~ Age + Sex+ BMI + Time + (1|Id) after using the traj-mean method for the imputation of missing values of longitudinal data.

Additional file 9:

Figure S9. The tree structure of the REEMtree algorithm based on the traj-mean method to impute missing values for extracting homogeneous subgroups ofobservations for diastolic blood pressure (DBP) using the REEMtree package.This tree algorithm extracted 7 homogeneous subgroups of observations; the lowest and highest subgroups were subjects with "BMI < 23.70 & age< 41.50" and subjects with "BMI ≥ 30.41 & age ≥ 43.50", respectively.

Additional file 10:

Figure S10. The tree structure of the REEMtree algorithm based on the traj-mean method to impute missing values for extracting homogeneous subgroups ofobservations for systolic blood pressure (SBP) using the REEMtree package. This tree algorithm extracted 9 homogeneous subgroups of observations; the lowest and highest subgroups were subjects with "BMI < 23.70 & age <49.50" and subjects with "age ≥ 59.50 & BMI ≥ 26.50", respectively.

Additional file 11:

Figure S11. Density plots of the observed and imputed data for incomplete variables like BMI and DBP for each iteration (the number of iterations = 20) using mice package (observed data: blue and imputed data: red).

Additional file 12:

Figure S12. Density plots of the observed and imputed data for BMI variablefor each iteration using mice package (observed data: blue and imputed data:red).

Additional file 13:

Figure S13. Density plots of the observed and imputed data for the DBP variable for each iteration using mice package(observed data: blue and imputed data: red).

Additional file 14:

Figure S14. Density plots of the observed and imputed data for the SBP variable for each iteration using mice package (observed data: blue and imputeddata: red).

Additional file 15:

Figure S15. The trace line plots of the mean and standard deviation of the imputed values against the iteration number for each replication using the mice package.

Additional file 16:

Figure S16. Trace plot using JointAI package based on the model: DBP ~ Age + Sex + BMI + Time +(1|Id).

Additional file 17:

Figure S17. MC plot using JointAI package based on the model: DBP ~ Age + Sex + BMI + Time +(1|Id).

Additional file 18:

Figure S18. Trace plot using JointAI package based on the model: SBP ~ Age + Sex + BMI + Time +(1|Id).

Additional file 19:

Figure S19.MC plot using JointAI package based on the model: SBP ~ Age + Sex + BMI + Time + (1|Id).

Additional file 20:

Figure S20. The convergence plot of the JM-MLMM method.

Additional file 21:

TableS1. The sample data for the first 20 individuals of TCGS (NA: missing value).

Additional file 22:

Table S2. Descriptive statistics for continuous variables of TCGS participants included inthe present study at each phase (BMI: body mass index, DBP:diastolic blood pressure, SBP: systolic blood pressure, and SD: std. deviation).

Additional file 23:

Table S3. The statistic for mean and variances for all incomplete variables using miceadds package for FCS-LMM-LN and FCS-LMM-LN-het imputation methods.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Jahangiri, M., Kazemnejad, A., Goldfeld, K.S. et al. A wide range of missing imputation approaches in longitudinal data: a simulation study and real data analysis. BMC Med Res Methodol 23, 161 (2023). https://doi.org/10.1186/s12874-023-01968-8

Download citation

Received: 11 January 2023
Accepted: 08 June 2023
Published: 06 July 2023
DOI: https://doi.org/10.1186/s12874-023-01968-8

A wide range of missing imputation approaches in longitudinal data: a simulation study and real data analysis

Abstract

Background

Method

Results

Conclusion

Background

Methods

Tehran cardiometabolic genetic study (TCGS)

Missing data mechanisms

Missing data patterns

SI approaches to impute missing values in longitudinal data

Cross methods

Traj methods

LOCF (Last Occurrence Carried Forward) and NOCB (Next Occurrence Carried Backward)

Interpolation methods

Copy-mean methods

MI approaches to impute missing values in longitudinal data

JM-MLMM method

JM-FJ method

JM-SMC and JM-SMC-het methods

FCS-LMM and FCS-LMM-het methods

FCS-LMM-LN and FCS-LMM-LN-het methods

FCS-MLMM-LN method

Non-parametric longitudinal analysis method

Simulation study

Criteria for comparing the performance of missing data imputation methods

Software programs

Results

Discussion

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Research Methodology

Contact us