Generative adversarial networks for imputing missing data for big data clinical research

Dong, Weinan; Fong, Daniel Yee Tak; Yoon, Jin-sun; Wan, Eric Yuk Fai; Bedford, Laura Elizabeth; Tang, Eric Ho Man; Lam, Cindy Lo Kuen

doi:10.1186/s12874-021-01272-3

BMC Medical Research Methodology

Table 2 Imputation errors of different methods in HT-data

From: Generative adversarial networks for imputing missing data for big data clinical research

	Skewness or proportion of minority class	MICE	missForest	GAIN
Missingness rate = 20%
*Continuous variables*
Age, years	−0.018	0.063 ± 0.002	0.049 ± 0.001 ^a,b	0.057 ± 0.004 ^a
SBP	0.492	0.075 ± 0.001	0.058 ± 0.000 ^a	0.048 ± 0.000 ^a,c
Charlson index	0.146	0.154 ± 0.002	0.121 ± 0.001 ^a,b	0.144 ± 0.003 ^a
TC/HDL-C ratio	3.139	0.175 ± 0.003	0.137 ± 0.001 ^a	0.115 ± 0.001 ^a,c
Hospital admission times	7.037	2.379 ± 0.069	1.885 ± 0.042 ^a	1.752 ± 0.141 ^a,c
*Categorical variables*
Smoking	7.45%	0.133 ± 0.007	0.123 ± 0.003 ^a	0.098 ± 0.010 ^a,c
Hypertensive drugs	8.10%	0.149 ± 0.006	0.126 ± 0.003 ^a	0.098 ± 0.002 ^a,c
Lipid Lowering drugs	9.99%	0.173 ± 0.007	0.159 ± 0.003 ^a	0.129 ± 0.006 ^a,c
Overweight	37.89%	0.433 ± 0.01	0.400 ± 0.005 ^a	0.359 ± 0.003 ^a,c
Sex	41.21%	0.448 ± 0.019	0.412 ± 0.004 ^a	0.405 ± 0.022 ^a
Missingness rate = 50%
*Continuous variables*
Age, years	−0.018	0.129 ± 0.002	0.102 ± 0.001 ^a	0.094 ± 0.007 ^a,c
SBP	0.492	0.115 ± 0.001	0.095 ± 0.001 ^a	0.080 ± 0.002 ^a
Charlson index	0.146	0.295 ± 0.001	0.239 ± 0.002 ^a	0.241 ± 0.009 ^a
TC/HDL-C ratio	3.139	0.279 ± 0.004	0.235 ± 0.003 ^a	0.183 ± 0.002 ^a,c
Hospital admission times	7.037	3.766 ± 0.12	3.199 ± 0.057 ^a	3.004 ± 0.246 ^a,c
*Categorical variables*
Smoking	7.45%	0.335 ± 0.006	0.277 ± 0.015 ^a	0.267 ± 0.012 ^a,c
Hypertensive drugs	8.10%	0.368 ± 0.014	0.305 ± 0.004 ^a	0.276 ± 0.005 ^a,c
Lipid Lowering drugs	9.99%	0.441 ± 0.015	0.319 ± 0.006 ^a	0.304 ± 0.009 ^a,c
Overweight	37.89%	1.135 ± 0.018	1.029 ± 0.019 ^a	0.850 ± 0.020 ^a,c
Sex	41.21%	1.149 ± 0.02	1.050 ± 0.013 ^a	1.007 ± 0.055 ^a

Notes
SBP Systolic Blood Pressure, TC Total Cholesterol, HDL-C High-Density Lipoprotein Cholesterol
Since NRMSE and PFC both followed normal distribution (Shapiro-Wilk normality test p value > 0.05), imputation errors of different methods were compared using one-way ANOVA. If p < 0.05, paired methods were compared using independent sample t-test;
^aThe mean imputation error is significantly lower than that of MICE (p < 0.05)
^bThe mean imputation error is significantly lower than that of GAIN (p < 0.05)
^cThe mean imputation error is significantly lower than that of missForest (p < 0.05)

Back to article page

ISSN: 1471-2288

Contact us

General enquiries: journalsubmissions@springernature.com