Skip to main content

Table 2 Description of data pre-processing steps

From: Machine learning in medicine: a practical introduction to techniques for data pre-processing, hyperparameter tuning, and model comparison

Step

Description

step_knnimputation

Impute missing values using the k-nearest neighbor algorithm

step_BoxCox

Transform numeric data using simple Box-Cox transformation

step_other

Pool less frequent categories into an "other" category for categorical variables

step_zv

Remove variables that have a single value

step_nzv

Remove variables having the frequency ratio of their first and second frequent values above 95/5 and the number of unique values over the total number of samples below 10%

step_normalize

Normalize numeric variables to have zero mean and one unit of variance (standard deviation = 1)

step_dummy

Covert each level of categorical variables into a numeric binary term

step_corr

Remove variables that are highly correlated with other variables (absolute correlation values >  = 0.9)

  1. Note. A thorough introduction to each step can be found in the package document “recipes” [20]