Accommodating heterogeneous missing data patterns for prostate cancer risk prediction

Table 1 Methods for fitting individual predictor-specific risk models for members of a test set by combining data from multiple cohorts. All individuals in the training and test cohorts have 2 predictors, PSA and age, and then any subset, including none, of 10 additional predictors for a total of 12 predictors, denoted by \(\mathrm{X}\). The set of predictors available for the new individual is denoted by \({\mathrm{X}}^{*}\). All models use logistic regression for prediction of clinically significant prostate cancer. MICE = Multiple imputation by chained equations; BIC = Bayesian Information Criterion defined as the -2(maximized log likelihood) + (number of covariates) \(\times\) log(sample size)

Method	Definition
Available cases	Pool individual-level data that have \({\mathrm{X}}^{}\) measured across all cohorts and fit a model including \({\mathrm{X}}^{}\) as main effects
Iterative BIC selection	Same as available cases, but with an iterative stepwise BIC-based model selection to determine the optimal subset of \({\mathrm{X}}^{*}\) and interactions
Cohort ensemble	Separate models are built to each cohort by using the coinciding variables of the cohort and the patient
Categorization	All individuals in all cohorts are used. Predictors are categorized with missing as one of the categories so that the complete list of predictors \(\mathrm{X}\) is used
Missing indicator	Include an indicator for missing a continuous predictor value and the interaction with the predictor as additional variables in the analysis. Mostly similar to Categorization
Imputation	Impute missing covariates in the training set following the MICE method. Mean imputation for missing values in prediction

ISSN: 1471-2288