Skip to main content

Handling missing data and measurement error for early-onset myopia risk prediction models

Abstract

Background

Early identification of children at high risk of developing myopia is essential to prevent myopia progression by introducing timely interventions. However, missing data and measurement error (ME) are common challenges in risk prediction modelling that can introduce bias in myopia prediction.

Methods

We explore four imputation methods to address missing data and ME: single imputation (SI), multiple imputation under missing at random (MI-MAR), multiple imputation with calibration procedure (MI-ME), and multiple imputation under missing not at random (MI-MNAR). We compare four machine-learning models (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical models (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression) in myopia risk prediction. We apply these models to the Shanghai Jinshan Myopia Cohort Study and also conduct a simulation study to investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on model performance. Model performance is evaluated using the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC).

Results

Our findings indicate that in scenarios with missing data and ME, using MI-ME in combination with logistic regression yields the best prediction results. In scenarios without ME, employing MI-MAR to handle missing data outperforms SI regardless of the missing mechanisms. When ME has a greater impact on prediction than missing data, the relative advantage of MI-MAR diminishes, and MI-ME becomes more superior. Furthermore, our results demonstrate that statistical models exhibit better prediction performance than machine-learning models.

Conclusion

MI-ME emerges as a reliable method for handling missing data and ME in important predictors for early-onset myopia risk prediction.

Peer Review reports

Introduction

Uncorrected refractive error, particularly myopia, is a significant cause of vision impairment globally [1]. High prevalence of myopia has become an important public health issue in Asia, where it affects around 80-90% of young adults, of whom 10-20% had high myopia (\(\le\) -6.00 diopters [D]) [2]. Early-onset myopia in children increases the likelihood of developing high myopia later in life [3,4,5,6], which is associated with a higher risk of maculopathy, glaucoma, and retinal detachment [7, 8]. It is widely recognized that myopia developed during childhood progresses rapidly and irreversibly. Therefore, it is essential to identify children at high risk of developing early-onset myopia to introduce timely interventions at a young age [5, 9].

Previous attempts to predict early-onset myopia of school-aged children have utilized statistical models, such as discrete-time survival analysis [5], Cox proportional hazard models [9], ordinary logistic regression [10], multilevel ordinal logistic regression [11], and generalized estimating equations [12]. More recent studies have used machine learning (ML) algorithms to predict the onset of high myopia, demonstrating good accuracy of model prediction [13, 14]. Most of these studies have focused on predicting high myopia development among high school students, rather than early-onset myopia among younger school children aged 6 to 8 years old. This specific age group posed unique challenges in capturing the impact of risk factors on myopia onset during the early stages of myopia development [10]. Additionally, while some studies have reported that ML methods outperform statistical models in prediction and classification tasks [15, 16], others have the opposite conclusions [17, 18]. It is then of interest to compare the performance of ML vs statistical models in predicting early-onset myopia.

In many myopia studies, lifestyle and habits data for risk prediction have been collected using questionnaires [5, 10, 19, 20]. While these questionnaires are typically implemented following a designed and valid guideline [20], they often encounter challenges with missing data, such as unit non-response and item non-response, especially when data are collected through interviews [21]. Popular approaches, such as complete-case analysis, that exclude incomplete sample records from studies, result in the loss of valuable information and fail to adequately address the mechanisms behind missing data [22, 23]. Researchers have explored various imputation techniques, including mean imputaion [22, 24], median imputation [25], mode imputation [23], multiple imputation (MI) [22, 23, 26,27,28,29,30], and single imputation by deep learning algorithms [31], to handle missing data in risk prediction studies. However, these studies predominantly concentrate on the mechanisms of missing completely at random (MCAR) or missing at random (MAR) [32], neglecting the frequently encountered missing not at random (MNAR) scenarios in real-world analysis. Since methods like MI allow unbiased estimation only under the assumption of MAR, it is crucial for researchers to carefully consider the plausibility of these underlying assumptions of missing data mechanism, to avoid the improper use of methods for handling missing data in risk prediction studies. Furthermore, recent attention has been given to using ML to establish risk prediction models [13, 33, 34]. However, most of these studies overlook missing data or handle it inadequately [35].

Spherical equivalent (SE) has been widely used as a myopia outcome measure (SE\(\le\)-0.5 or -0.75 Dioptre[D]) for the identification of putative risk factors for myopia or high myopia (SE\(\le\)-6.0[D]) among children [36, 37]. Non-cycloplegic autorefraction of SE is free of side effects of cycloplegia [38] and less time-consuming, making it popular in campus vision screening, especially in large population-based studies [39,40,41]. However, using non-cycloplegic autorefraction of SE may introduce measurement error (ME) due to children’s accommodative responses during vision examination [42]. Ignoring ME in predictors would weaken the predictive power of models [43, 44]. In risk prediction studies using statistical models, various approaches, such as regression calibration [45, 46], semiparametric maximum likelihood [47], Bayesian approach [48, 49], and MI [50], are commonly used to handle ME in predictors. In risk prediction studies using ML models, the focus tends to be on the ME of the outcome (label noise) rather than the predictors [51]. Methods such as data filtering, which removes predictors with ME from the dataset, and data polishing, which predicts values of erroneous predictors based on models built from original data, are discussed to handle ME in predictors [52, 53]. These methods are similar to complete-case analysis and regression imputation in missing data research. In fact, ME can be considered as a missing data problem, allowing for the utilization of statistical methods like MI to address both missing data and ME issues [50, 54]. However, there is a lack of literature on the use of MI methods to simultaneously address missing data and ME issues for myopia risk prediction.

The objective of this study is to propose an effective solution for predicting early-onset myopia risk that addresses the issues of missing data and ME in baseline predictors, as well as the selection of prediction models (ML vs. statistical methods). To address missing data and ME in predictors, we explored four imputation techniques: MI under missing at random (MI-MAR), MI with calibration procedure (MI-ME), MI under missing not at random (MI-MNAR) for sensitivity analysis of MI-MAR, and single imputation (SI) as commonly used in practice. To mitigate the risk of misusing prediction models, we investigated seven prediction models: four ML algorithms (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical methods (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression). Using a case study from the Shanghai Jinshan Myopia Cohort Study (SJMC), we compared the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) of prediction models after addressing missing data and ME through imputation methods. Additionally, we conducted a simulation study to further investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on the performances of models.

The paper is organized as follows. The ‘Methods’ section begins with the SJMC motivating example and provides details about the five imputation methods and the seven prediction models. The ‘Simulation study’ section details the data generation, scenario settings, and data analysis for our simulation experiments. In the ‘Results’ section, the performances of models from the simulation study and case study are presented. Finally, the ‘Discussion and conclusion’ section discusses the implications of our findings and concludes the study.

Methods

Shanghai Jinshan Myopia Cohort Study (SJMC) data

Participants were recruited from the Shanghai Jinshan Myopia Cohort Study (SJMC), a longitudinal observational study from September 1, 2013 and October 30, 2018. The SJMC consisted of 3263 students in grade 1 to grade 5 at the baseline in Jinshan primary school who were followed up for 5 years, which aimed to investigate the progression and risk factors of myopia. We used the metrics of SE \(\le\) -0.50 D in two years after baseline to define the myopia outcome [36, 37]. SE was obtained through non-cycloplegic autorefraction. The annual visit was completed between September 1 and October 30. In this study, we excluded students: 1) whose age was outside the 6-8 years range at baseline, 2) whose SE \(\le\) -0.50 D at baseline, and 3) who had incomplete three-year follow-ups, resulting in 1114 students in the final analysis dataset. The study was approved by the Ethics Committee of Jinshan Hospital of Fudan University, Shanghai. All study procedures adhered to the tenets of the Declaration of Helsinki. Written informed consent was obtained from the guardians of all children. Data used throughout our study were anonymized before they were made available to the researchers.

Ocular biometrics, lifestyle, and habits data in the SJMC were collected annually following the same procedures for all participants (see collection procedures in Appendix A). We considered 19 candidate risk factors in the initial prediction models, including (i) demographics: age at baseline and gender (female/male); (ii) birth parameters, height, and weight: birth weight (\(\le\)2500 grams, 2500-4000 grams or \(\ge\) 4000 grams), premature (no/yes) and baseline body mass index ( BMI, kg/m2); (iii) baseline ocular biometrics: axial length (AL), SE, and mean anterior keratometry reading (Km); (iv) the number of parents with self-reported myopia (0, 1, or 2 myopic parents); (v) daily activities at baseline: time spent outdoors (hours/day; separately for weekdays and weekends), time for homework (hours/day; separately for weekdays and weekends), time for after-class tutoring, sports, and arts (hours/week); (vi) visual habits at baseline: continuous near work for at least 30 minutes (never, occasionally and usually), excessive bending and wrong sitting posture were processed in the same way as continuous near work for 30 minutes. The outcome of prediction models was defined as SE \(\le\) -0.50 D (SE outcome event) two years after the baseline visit. The descriptive statistics of 19 baseline predictors are shown in Appendix A Table A1.

Appendix A Table A1 presents the details of the missing data proportions for each variable. Three variables (BMI, Premature, and Birth weight) had a low proportion of missing data, less than 1%. However, the other four variables (Continuous near work, Excessive bending, Wrong sitting posture, and No. of Myopic parents) had a high proportion of missing data, ranging from 31.6% to 52.1%. We employed simple methods to address the missing data issue for variables with very low missing proportions. For continuous variables, we imputed missing values by using the average of the observed values [55]. For categorical variables, missing values were imputed using the mode of the observed values [55]. Considering the impact of the high proportion of missing data and measurement error on subsequent prediction, we focused on addressing both the missing data problem in the latter four predictors and the measurement error problem in the baseline SE through imputation techniques.

Imputation methods for addressing missing data

We denote the dataset of sample size N as D, specifically defined as \(D=\left\{ X_i^{mis}, X_i^{err}, X_i^{obs}, y_i\right\} _{i=1}^N, i=1,\cdots , N\), where \(X_i^{mis}\),\(X_i^{err}\), and \(X_i^{obs}\) represent predictors with missing data, predictors with measurement error, and predictors with complete and accurate measurement, respectively. We assume that there is no overlap among these three types of predictors. It means that a predictor cannot have both missing values and measurement error. \(y_i\) is a binary outcome variable that indicates the onset of myopia within two years after baseline, with \(y_i=1\) representing myopia onset. In our analysis, \(f_\theta (\cdot )\) and \(g_{\beta }(\cdot )\) represent the imputation model and the prediction model, respectively.

Although using mean imputation (applicable under MCAR [55]) to deal with predictors missing data might worsen the subsequent prediction [56, 57], we still include it for methods comparison since it is simply implemented and also commonly used. The mean imputation method is a single imputation method, denoted as SI for simplicity. For continuous predictors, SI uses the average of the observable values of the variable to impute the missing values; for categorical predictors, SI uses the mode of the observable part of the variable to replace missing values [23](see Algorithm 1). After processing by SI, we obtain a complete imputed dataset to train the subsequent prediction model.

figure a

Algorithm 1 SI

Mostly, in the epidemiological analysis, MI techniques are popular methods for handling missing data problems of predictors [22, 23, 26,27,28,29,30]. The ordinary multiple imputation by chained equations (MICE) is reliable under the mechanism of MAR [55, 58], and we refer it to MI-MAR in the following context. Method MI-MAR starts with an incomplete raw dataset and outputs several complete imputed datasets by replacing the missing values with estimated values drawn from a distribution specifically modeled for each missing entry [55], see Algorithm 2. The subsequent task of risk prediction is implemented on each imputed dataset, and several results would be pooled to one final result by averaging based on Rubin’s Rule [55, 59].

figure b

Algorithm 2 MI-MAR

Missing mechanisms can be complicated in practice, because MAR and MNAR might co-exist in the analysis [35]. Therefore, we include an MI that could handle a special MNAR [55] where the probability of being missing depends on the values of the variable itself, and refer this method to MI-MNAR. MI-MNAR is a sensitivity analysis for how much the MAR assumption is violated if the original MICE is applied in analysis [60, 61]. We assume that \(P(X_i^{mis}|R_i=1) \ne P(X_i^{mis}|R_i=0)\), where \(R_i=1\) indicates unobserved records and \(R_i=0\) indicate observed records. We introduce a constant \(\delta\) to describe the difference between \(P(X_i^{mis}|R=1)\) and \(P(X_i^{mis}|R=0)\), and then adjust the parameters of imputation model in terms of the constant \(\delta\) for restoring the true distribution of \(X_i^{mis}\), see Algorithm 3. The impact of MNAR on the imputation model increases with increasing \(\delta\). When \(\delta =0\), MI-MNAR is equivalent to MI-MAR. However, the adjustment of imputation of one specific variable based on \(\delta\) might affect the imputation of other variables with missing data since this variable is very likely used to impute other variables. The degree of impact depends on the correlation of this covariate with other variables. Therefore, it is necessary to carefully select \(\delta\) based on background knowledge and consideration of the relationship between variables. Similar to MI-MAR, each of imputed datasets generated from MI-MNAR would be used for prediction modeling. The results will be pooled to one final result.

figure c

Algorithm 3 MI-MNAR

Multiple imputation for addressing measurement error

In this paper, we assume that the ME of a continuous predictor, i.e., non-cycloplegic SE, is systematic. The observed variable is a biased representation of the true variable, and the average of repeated observed measurements would no longer be a good proxy of the actual variable value [46, 50]. The error is independent of the outcome, conditional on the values of the true variable, which is called non-differential assumption. Under the non-differential systematic ME, the popular method for handling ME is regression calibration [46]. To some degree, it could be regarded as a ‘conditional mean imputation’ approach [62] since a calibrated value is imputed as the estimated conditional mean given other observed variables, auxiliary variables, and the error-prone measurements. ME of predictors could be viewed as data being partially missing , or vice versa [50]. Therefore, MI could also be used to correct ME [62,63,64,65]. Blackwell et al. proposed an MI approach to accommodate both ME and missing values, but restricted by requiring a multivariate normal distribution [50]. Therefore, it does not apply to a range of prediction models such as logistic regression and time-to-event models. Bartlett et al. handled this problem by proposing the substantive model compatible method [66], but it was designed for missing data context. We adopted the extended version [54] for ME and missing data and denoted this method as MI-ME for simplicity. Algorithm 4 gave a procedure of MI-ME for handling ME. Besides the necessary inputs similar to MI-MAR, MI-ME requires the subsequent prediction model form, the variable with ME, and the imputation model into the MI-ME for imputing missing data and correcting ME. The prediction model is trained on each imputed dataset before pooling the results. More details in the

figure d

Algorithm 4 MI-ME

Prediction models: ML and statistical methods

Four machine-learning methods were explored in this paper: Cart Decision Tree(tree) [33, 67], Naive Bayes(nb) [68], Random Forest(rf) [69], and Xgboost(xgb) [70]. The models were constructed to predict the probability of developing early-onset myopia 2 years after baseline. Considering the small sample size, we did not include deep learning algorithms for analysis. We chose models that could handle the classification task. The four models cover different types of algorithms to avoid the misuse of a single ML algorithm [71]. In addition, Volovici et al. recommended the use of traditional statistical models as sensitivity analyses [71]. Therefore, to evaluate the performances of the ML methods, we also considered statistical methods such as logistic regression [72], stepwise logistic regression [73], and least absolute shrinkage and selection operator (Lasso) logistic regression [74]. Data were randomly split into a training dataset (80%) and a testing dataset (20%). Ten-fold cross-validation in the training dataset was used to search optimal parameters of ensemble learning algorithms (Random Forest and Xgboost), such as the number of trees in a random forests model. AUROC and AUPRC were used to assess the model performances.

In this paper, the strategy we used to combine internal validation with MI is followed by previous works [22, 27, 28], which gets m imputed complete datasets at first, then trains a prediction model on each of the datasets, and finally pools the results. Furthermore, we included outcome variable in the imputation model for better imputation like the prior studies did [26, 75].

Simulation study

Data generating

The simulation study was based on the SJMC data, which included 1114 children at baseline. Data including nineteen predictors and one outcome were generated as specified below in Fig. 1 based on the existing literature [9, 10, 12, 20, 28, 36, 37, 76,77,78,79,80,81,82,83]. Time-independent and time-dependent variables are presented in Fig. 1. This process was repeated to generate 100 complete datasets with a sample size of 1500. The details of the simulation procedure are provided in Appendix B.

Fig. 1
figure 1

Diagram for the association between myopia onset and predictors in baseline based on the literature. “myopic parents” denotes No. of Myopic parents; “ocular biometry at baseline” indicates three variables at baseline: AL, SE, Km; “visual habits” consist of time for homework and outdoor activities, continuous near work, excessive bending, and wrong sitting posture; orange blocks denote time-independent variables: myopic parents, prematurity, birth weight, and gender; yellow blocks denote time-dependent variables

Four time-independent variables are generated at first. Prematurity and gender are generated from the corresponding Bernoulli distributions with given probability \(p(\text {premature}=1 )=0.05\) and \(p(\text {female}=1)=0.5\) respectively. Categorical variables myopic parents (0, 1, and 2) and birth weight class (0, 1, and 2) are generated from multinomial distributions with probabilities of \(\lbrace 0.1, 0.5, 0.4\rbrace\) and \(\lbrace 0.93, 0.04, 0.03\rbrace\), respectively. The generation of birth weight class is related to prematurity and gender based on prior works [76, 77]. After simulating the time-independent variables, the time-dependent variables at baseline are simulated in a similar way to mimic the SJMTC data. Specifically, SE at baseline is generated by a linear regression model with time-independent variables, BMI, and visual habits at baseline (see Appendix B). SE two years after baseline (outcome_SE) is generated with time-independent variables, BMI, ocular biometry (i.e., SE, AL, and Km) at baseline, and visual habits (meantime for outdoor activity and homework in five workdays ) at baseline, where myopic parents in time-independent variables and three ocular biometry variables that are more important predictors have larger parameters than visual habits. At last, outcome variable (outcome_myopia) based on the classification criteria of \(SE \le -0.5 D\) [36, 37] as follows:

$$\begin{aligned} outcome\_myopia_i = \left\{ \begin{array}{ll} 1,& \quad \text {if} \quad outcome\_SE_i \le -0.5 \\ 0, & \quad \text {if} \quad outcome\_SE_i > -0.5 \\ \end{array}\right. \end{aligned}$$
(1)

where \(i=1,\cdots ,N,\) and \(N=1500\).

In general, all parameter values in the simulation were chosen such that simulated data were representative of the SJMTC data, values of parameters presented in Appendix B. The complete data consists of 19 predictors and one outcome. All subsequent simulation experiments are implemented using this original data.

Simulation scenarios

We conducted nine simulation scenarios to investigate the impact of a) the degree of ME, b) missing mechanisms, and c) the importance of predictors on the performances of imputation models and prediction models. The missing data and ME settings from scenario 1 to 9 are summarized in Table 1.

Table 1 Summary of missing data and ME settings in scenario 1 to 9

Specifically, scenario 1 explored the impact of the degree of ME in predictors on the final prediction. Three datasets with no error, small ME, and large ME were set in scenario 1. The available SE at baseline with a small ME, \(SE_i^*\), is generated as follows:

$$\begin{aligned} SE_i^* = SE_i + \delta _{se1,i}, \end{aligned}$$

where \(\delta _{se1,i} \sim N(0,\sigma _{SE^*}^2)\). The available SE at baseline with a large ME, \(SE_i^{**}\), is generated as follows:

$$\begin{aligned} SE_i^{**} = \delta _{se2,i} + \gamma _{0} + \gamma _{se} SE_i + \gamma _{al} AL_i + \gamma _{km} Km_i + \sum \limits _{a=0}^{a=2} \gamma _{se,a} [parents\_myopia_i=a] \end{aligned}$$

where \(\delta _{se2,i} \sim N(0,\sigma _{SE^{**}}^2)\); \(\gamma _{0}\) is the intercept and other \(\gamma\) with subscripts are parameters for variables related to the true SE.

Scenario 2 to 5 have missing data problem but without ME. Scenario 6 to 9 have similar missing mechanisms with scenario 2 to 5, where SE at baseline has a large ME, that is, \(SE_i^{**}\) in scenario 1. For simplicity, we give the generation of missing data in scenario 2 and 5.

Four important predictors at baseline (i.e., prematurity, myopic parents, AL, and SE) have missing data in scenario 2 and 3. All four predictors share the same missing mechanism of MAR in scenario 2:

$$\begin{aligned} \text {MAR: } logit\lbrace P(R_i=1) \rbrace = \sigma _{s2} + \sigma _{s2a} age_i \end{aligned}$$

where \(R_i=1\) indicates that the corresponding variables of individual i are missing; \(\sigma _{s2}\) is the intercept with a constant value; \(\sigma _{s2}\) and \(\sigma _{s2a}\) are chosen to make around 40% missing data of the specific variable.

Different from that in scenario 2, SE at baseline in scenario 3 is self-mask MNAR, the other three predictors (i.e. prematurity, myopic parents, and AL) are still MAR.

$$\begin{aligned} \text {MAR:}logit\lbrace P(R_{1,i}=1) \rbrace & = \sigma _{s31} +\sigma _{s3a} age_i \\ \text {MNAR:}logit\lbrace P(R_{2,i}=1) \rbrace & =\sigma _{s32} +\sigma _{s3s} SE_i \end{aligned}$$

where \(R_{1,i}\) indicates the missing mechanism of MAR; \(R_{2,i}\) indicates the missing mechanism of MNAR; parameters \(\sigma\) with subscripts above are chosen to make around 40% around 40% missing data of four predictors.

Four predictors at baseline with slight impacts on prediction (i.e., continue using eyes, excessive bend, not straight posture, and myopic parents) have missing data in scenario 4 and 5. All four predictors share the same missing mechanism of MAR in scenario 4:

$$\begin{aligned} \text {MAR: } logit\lbrace P(R_i=1) \rbrace = \sigma _{s4a} age_i + \sigma _{s4} \end{aligned}$$

where \(\sigma _{s4a}\) and \(\sigma _{s4}\) are chosen to make around 40% missing data of four predictors.

Different from that in scenario 4, myopic parents in scenario 5 is self-mask MNAR, and the other three predictors (i.e., continue using eyes, excessive bend, and not straight posture) are still MAR.

$$\begin{aligned} \text {MAR: } logit\lbrace P(R_{1,i}=1) \rbrace & = \sigma _{s5a} age_i + \sigma _{s51} \\ \text {MNAR:} P(R_{2,i}=1) & = \left\{ \begin{array}{ll} 0.5,& \quad \text {if} \quad parent\_myopia_i =0 \\ 0.55, & \quad \text {if} \quad parent\_myopia_i =1 \\ 0.55, & \quad \text {if} \quad parent\_myopia_i =2 \end{array}\right. \end{aligned}$$

where parameters \(\sigma _{s5a}\) and \(\sigma _{s51}\) are chosen to make around 40% missing data of three predictors under MAR; the missing proportion of myopic parents is around 50%.

As mentioned above, scenario 6 to 9 consider the ME problem in addition to similar missing data problems in scenario 2 to 3. The available SE at baseline in scenario 6 to 9 is \(SE_i^{**}\), not the true SE. We used \(\tilde{SE_i}\) which is true SE with a random error from normal distribution as an auxiliary variable, \(\tilde{SE_i} = SE_i + \delta _{se3},\delta _{se3} \sim N(0,\sigma _{aux}^2)\), to facilitate calibration procedure.

Data analysis

Scenario 1 is set for reference so that datasets are complete. However, the predictor SE in scenario 1 has varying degrees of ME. We directly train models on the dataset without using imputation techniques.

In scenario 2 to 9, we choose logistic regression model as the imputation model for binary predictors (premature), polytomous logistic regression for categorical predictors (continue_using_eyes, excessive_bend, not_straight_posture, and parents_myopia), bayesian linear regression for continuous predictors (AL and SE). We conduct five imputations with five iterations in implementing MI techniques and include the outcome in the imputation. Furthermore, we use \(\tilde{SE}\) as auxiliary variables to calibrate the ME (\(\tilde{SE} = SE + \delta _{n}\)). For SI method, we use mean imputation to deal with the missing data of AL and SE. Mode imputation is used to address the missing data of five categorical variables. Scenario 3 (MNAR) shares the same missing mechanisms with scenario 7 (MNAR), so scenario 5 (MNAR) and scenario 9 (MNAR) do. We use \(\delta =\lbrace 0.001,0.005,0.01,0.1,0.5\rbrace\) for sensitivity analysis in scenario 2-8. When predictors with missing data are categorical variables, the supplementary parameters \(\delta\) are expressed as odds-ratios (corresponding to the excess of risk to present the modality of interest for non-responders as compared to responders). When predictors with missing data are continuous variables, the supplementary parameters \(\delta\) are the difference between the expected values in responders and non-responders. When the value of \(\delta\) is set to zero, the underlying assumption for missing mechanism is actually MAR.

After model tuning, the number of trees to grow in Random Forest (rf) is 500, and the number of variables randomly sampled as candidates at each split is 6. The max number of iterations of Xgboost (xgb) is 500, and the number of threads used in training is 8. All other parameters of prediction models are default settings.

For the simulation study, we randomly generated 100 datasets according to the data generation setting and performed nine different scenarios specified in the Simulation scenarios section on each complete dataset. For the case study, we repeated the “impute first and then predict” process on the SJMC data 100 times. Finally, AUROCs and AUPRCs in each training across nine scenarios and the SJMC data are aggregated, and their mean and standard deviation are calculated.

Results

Results from simulation study

In scenario 1 with complete data (Table 2), lasso logistic regression model performed best among all models with the highest mean AUROC of 0.828; XGBoost obtained the highest mean AUROC of 0.822 and random forest got the second highest mean AUROC of 0.809 among four ML models. Overall, in the other eight scenarios (with missing data and MEs), the prediction performances of three statistical models are better than that of four ML models, no matter which imputation method is used before (see Table 3 and 4).

Table 2 Mean AUROC (sd) of prediction models in scenario 1 of simulation study
Table 3 Mean AUROC (sd) of prediction models after imputing missing data in scenarios 2-5
Table 4 Mean AUROC (sd) of prediction models after imputing missing data in scenarios 6-9

As expected, we found that when important predictors have ME, using MI-ME to process missing data and MEs can significantly improve prediction models’ AUROC and AUPRC (see Table 4). We set up scenario 8 (MAR) and 9 (MNAR) to mimic the data of the SJMC with similar missing data and ME. As Table 4 presented, using MI-ME, MI with a calibration procedure to address ME, got the highest mean AUROC. Compared to other imputation methods, the relative advantages of MI-ME are smaller in scenario 9 (MNAR) than in scenario 8 (MAR). The results of MI-MNAR with different sensitivity parameters proved the robustness of the performance of MI-ME (see Figs. 2 and 3).

Fig. 2
figure 2

Mean AUROC of prediction models in sensitivity analysis of scenarios 2-5 in the simulation study. Scenario 2: important predictors with missing data under MAR; Scenario 3: important predictors with missing data under MNAR; Scenario 4: non-important predictors with missing data under MAR; Scenario 5: non-important predictors with missing data under MNAR. Abbreviation: rf, random forest; Nbeyes, naive bayes; logreg, logistic regression; lassolog, lasso logistic regression; steplog, stepwise logistic regression

Fig. 3
figure 3

Mean AUROC of prediction models in sensitivity analysis of scenarios 6-9 in the simulation study. Scenario 6: important predictors with missing data under MAR and ME; Scenario 7: important predictors with missing data under MNAR and ME; Scenario 8: non-important predictors with missing data under MAR and ME; Scenario 9: non-important predictors with missing data under MNAR and ME. Abbreviation: rf, random forest; Nbeyes, naive bayes; logreg, logistic regression; lassolog, lasso logistic regression; steplog, stepwise logistic regression

When predictors with missing data greatly impact the outcome, using MI-MAR to deal with the problem of missing data in predictors can get better prediction than SI, no matter where is MAR or MNAR (see Table 3 and 4). For example, the relative advantages of using MI-MAR to reduce bias from missing data and thus improve prediction are larger in scenario 2 and 3 (important predictors: MAR vs MNAR) than in scenario 4 and 5 (non-important predictors: MAR vs MNAR). For the detailed results of scenario 2-8, see Appendix C Table C6-C28.

When the impact of ME on prediction is larger than that of missing data, the advantages of MI-MAR compared to SI in reducing bias caused by missing data are weakened due to additional bias caused by ME (Table 3 and 4). For example, in the sensitivity analysis for MNAR of scenario 3 (MNAR without ME) and scenario 7 (MNAR with ME), the corresponding mean AUROCs are relatively more sensitive to different sensitivity parameters \(\delta\) in scenario 3 than in scenario 7 (see Figs. 2 and 3).

The trend of AUPRC values is similar to AUROC values (see Appendix C). The absolute value of AUPRC is always smaller than that of AUROC because of a class-imbalance problem in the dataset. The standard deviation of the AUPRC value is always higher than the AUROC value since the AUPRC value is more sensitive to the class-imbalance problem.

Results from case study

Consistent with simulation results in scenario 8 and 9, using MI-ME can improve the performances of prediction models except for Naive Bayes. The highest mean AUROC with a value of 0.833 is obtained by logistic regression after using the MI-ME strategy to deal with missing data and ME, which is higher than 0.672 obtained by logistic regression after using MI-MAR (see Table 5). Among four ML models after using the MI-ME strategy, the highest mean AUROC with a value of 0.755 is obtained by random forest and the second highest of 0.713 is got by Xgboost (see Table 5).

Table 5 Mean AUROC (sd) of prediction models under three imputation methods in case study

The standard deviation of AUROCs after using MI-ME is also lower than other imputation methods. SE at baseline and No. of myopic parents are validated important risk factors for myopia onset [36]. Dealing with their ME problem can reduce bias and improve the efficiency of fetching information in data. In addition, a variable with ME will be used to impute a variable with missing data. MI-ME handles two data quality problems simultaneously so that both tasks could be facilitated by each other.

As shown in Table 5, the performances of three logistic regression series models after using MI-ME have about 0.1 higher AUROC values than using MI-MAR. However, the advantages of MI-ME are not obvious when prediction models are four ML algorithms. The form of the substantive model assumed in MI-ME should be consistent with the prediction model. In the case study, MI-ME uses logistic regression as the substantive model. This might be one reason for the advantages of MI-ME that appear in statistical models. When the subsequent prediction model is too different from the substantive model assumed in MI-ME, the good performance of MI-ME might be weakened. Therefore, the form of subsequent prediction models should be taken into account when using MI-ME.

Using the MI-MAR did not get significantly better AUROC than using the SI. This might be caused by the fact that the ME has a greater impact on myopia prediction than the missing data problem in the SJMC. As the impact becomes greater, the advantages of MI-MAR are more difficult to capture through the AUROC of a prediction model.

We did not find the obvious differences of AUROC between using MI-MAR and MI-MNAR with five different \(\delta\)s from Appendix Table 30. It is possible that four missing variables in the case study were truly MAR. The specific MNAR we assumed might not be feasible for the missing mechanism of our data because it is a simple assumption, while reality is too complex to give a perfect assumption.

Discussion and conclusion

This study described the development of four imputation techniques (SI, MI-MAR, MI-ME, and MI-MNAR) for handling predictors missing data and ME, as well as seven prediction models for myopia risk prediction. We compared their performances according to AUROC and AUPRC in the simulation study and the SJMC case study. Our findings suggested that: 1) as expected, in the scenarios of predictors having missing data and ME, using MI-ME method could significantly improve the prediction performances whatever prediction model used is; 2) in the scenarios without ME, exploiting MI techniques (MI-MAR and MI-MNAR) could get better prediction performance than mean/mode imputation SI regardless of whether the missing mechanism is MAR or MNAR; 3) when ME has a greater impact on prediction than missing data problems, the relative advantage of normal MI methods (MI-MAR and MI-MNAR) is weakened, and MI-ME is more recommended; 4) statistical methods (logistic regression) have better prediction than ML models in both case study and simulation study.

MI with calibration procedure (MI-ME) could inherently handle missing data well since it enhances the imputation of missing data by utilizing variables that have been denoised from the imputation process. This paper focused on developing a feasible prediction model, which is called the “development stage”. Accordingly, our dataset had the outcome variable for the training model, and we could use it in the MI models for better imputation. When we use the well-trained prediction model for predicting the future risk of a new participant, which is called the “application stage”. The outcome is an unknown risk that needs to be predicted. Thus, some works discuss that the imputation model at the “development stage” should not include outcome due to its unavailability at the “application stage” [26, 29, 84]. However, there is no general imputation technique proposed at the “development stage” that could precisely prophesy the missingness of data at the “application stage”. The datasets from the two stages might be different due to collection procedures, populations, areas, and the interest of studies [29]. Furthermore, incorporating all other variables, including outcome, to impute missing values creates no new information. Consequently, MI methods using the outcome do not cause self-fulfilling prediction [26]. Moreover, the predictors-outcome relationship in the datasets estimated by MI would be less biased, provided that variation in the imputed predictors aligns closely with that of the observed predictor [26]. However, caution is needed when using MI-ME for a more complex case like the inclusion of non-linear terms for X in the substantive model, although it provides greater flexibility on imputation modeling assumptions [66]. In this case, researchers should carefully derive the appropriate form of the imputation model through more information [85]. In conclusion, our findings regarding MI methods are reliable to some extent and can be extended to similar cases.

It is easy to ignore the ME of non-cycloplegic SE and its impact on subsequent analysis since non-cycloplegic SE is very common in on-campus screening [86]. However, our findings indicate that ME of the observed predictor could potentially cause biased estimates of predictor-outcome associations, especially when the predictors importantly impact outcome, which is consistent with the conclusion of previous works [43]. Therefore, to ensure the reliability and validity of the prediction models in similar cases, we suggest that researchers deal with the ME of predictors by MI methods, like MI-ME or other calibration methods. Of course, we recommend optimizing data collection and avoiding missing values and ME initially because essential clinical decision-making needs supportable evidence induced from high-quality data [57]. In the scenario where tests are expensive, complicated, or even unavailable, high-quality data is difficult to obtain. In this case, statistical methods or ML algorithms to address missing data and ME are good tools for researchers to acquire information about the relationship of predictors-outcome as much as possible from the available data [57]. Additionally, the measurement of AL has been encouraged to be used as an important indicator of myopia prevention in myopia studies [10] since it is highly correlated with myopia progression [19] and its measurement is accurate and less time-consuming than cycloplegic autorefraction [87]. As an alternative measurement to facilitate on-campus myopia screening, the measurement of AL has gotten more and more attention from ophthalmologists. In practice, solutions to address data quality problems are flexible depending on the resources and purposes of a specific medical analysis.

ML techniques are powerful in prediction tasks involving data with nonlinear or complex, high-dimensional relationships [71]. This paper gave the opposite result that ML techniques performed not as well as expected based on metrics of AUROC and AUPRC. By contrast, statistical methods performed better than them under the same conditions. It is possible that our sample size is relatively small, and the chosen predictors based on prior researches are strongly related to outcome, making the prediction problem less complex and possibly linear [71]. The benefits of ML techniques might be insignificant in certain medical prediction tasks. Additionally, the imbalance in data, due to the minority of children aged 8-10 with myopia among their peers, potentially affects the performance of prediction models. This is a contributing factor to the low AUPRC value in our case study. We did not conduct an experiment in a simulation study to explore the impact of label imbalance on the prediction, as our project’s goal did not involve addressing it. However, future work could focus on addressing both missing data and imbalanced data issues in prediction. Lastly, this paper used subsampling as an internal validation strategy in the model development process, and easily combined imputation techniques with this strategy. For other internal validation strategies like bootstrapping and K-fold cross-validation, how to combine the MI methods with validation strategies needs further exploration [29]. After all, internal validations impact the final effect of model training.

Some limitations about the applicability of the proposed methods need to be noted. It is unlikely that the sample in our study was representative of the whole Chinese population, as this is school-based, and only participants from Shanghai were evaluated. Second, although we have discussed the impact of missing mechanisms on model performances, it is impossible to provide a perfect inspection of the missingness in future clinical/cohort data. But predictably, with the larger proportion of missing data and more complicated missing mechanisms, the performances of our proposed methods might get worse relatively. Lastly, we only investigated the impact of baseline SE, assuming that all predictors except for baseline SE are precisely measured. However, recall bias may exist in other predictors obtained from a questionnaire, such as lifestyle and behavior data.

In summary, we utilized four imputation methods and seven prediction models to propose a solution to improve myopia risk prediction where predictors have missing data and ME. The results from a case study and simulation study indicate that using MI-ME method to simultaneously address missing data and ME in predictors, along with logistic regression to develop a prediction model, yields the best performance in terms of AUROC and AUPRC. In a similar clinical case, we recommend using MI with a calibration procedure to handle ME and missing data in important predictors. Overall, our findings will shed light on future on-campus screening of myopic students and the development of interventions to prevent the risk of myopia.

Availability of data and materials

Not applicable.

Data availability

Not applicable.

References

  1. Dolgin E. The myopia boom. Nature. 2015;519(7543):276.

    Article  CAS  PubMed  Google Scholar 

  2. Morgan IG, French AN, Ashby RS, Guo X, Ding X, He M, et al. The epidemics of myopia: aetiology and prevention. Prog Retin Eye Res. 2018;62:134–49.

    Article  PubMed  Google Scholar 

  3. Jensen H. Myopia in teenagers: An eight-year follow-up study on myopia progression and risk factors. Acta Ophthalmol Scand. 1995;73(5):389–93.

    Article  CAS  PubMed  Google Scholar 

  4. COMET Group, et al. Myopia stabilization and associated factors among participants in the Correction of Myopia Evaluation Trial (COMET). Investig Ophthalmol Vis Sci. 2013;54(13):7871.

  5. Zadnik K, Sinnott LT, Cotter SA, Jones-Jordan LA, Kleinstein RN, Manny RE, et al. Prediction of juvenile-onset myopia. JAMA Ophthalmol. 2015;133(6):683–9.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Chua SY, Sabanayagam C, Cheung YB, Chia A, Valenzuela RK, Tan D, et al. Age of onset of myopia predicts risk of high myopia in later childhood in myopic Singapore children. Ophthalmic Physiol Opt. 2016;36(4):388–94.

    Article  PubMed  Google Scholar 

  7. Cho BJ, Shin JY, Yu HG. Complications of pathologic myopia. Eye Contact Lens. 2016;42(1):9–15.

    Article  PubMed  Google Scholar 

  8. Ohno-Matsui K, Lai TY, Lai CC, Cheung CMG. Updates of pathologic myopia. Prog Retin Eye Res. 2016;52:156–87.

    Article  PubMed  Google Scholar 

  9. Wang SK, Guo Y, Liao C, Chen Y, Su G, Zhang G, et al. Incidence of and factors associated with myopia and high myopia in Chinese children, based on refraction without cycloplegia. JAMA Ophthalmol. 2018;136(9):1017–24.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Tideman JWL, Polling JR, Jaddoe VW, Vingerling JR, Klaver CC. Environmental risk factors can reduce axial length elongation and myopia incidence in 6-to 9-year-old children. Ophthalmology. 2019;126(1):127–36.

    Article  PubMed  Google Scholar 

  11. Liao C, Ding X, Han X, Jiang Y, Zhang J, Scheetz J, et al. Role of parental refractive status in myopia progression: 12-year annual observation from the Guangzhou twin eye study. Investig Ophthalmol Vis Sci. 2019;60(10):3499–506.

    Article  Google Scholar 

  12. Zhang M, Gazzard G, Fu Z, Li L, Chen B, Saw SM, et al. Validating the accuracy of a model to predict the onset of myopia in children. Investig Ophthalmol Vis Sci. 2011;52(8):5836–41.

    Article  Google Scholar 

  13. Lin H, Long E, Ding X, Diao H, Chen Z, Liu R, et al. Prediction of myopia development among Chinese school-aged children using refraction data from electronic medical records: a retrospective, multicentre machine learning study. PLoS Med. 2018;15(11):e1002674.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Chen Y, Xiaobo G, He M. Optimization of machine learning-based prediction models for myopia development in a long-term longitudinal cohort of Chinese children. Investig Ophthalmol Vis Sci. 2020;61(7):89.

    Google Scholar 

  15. Steele AJ, Denaxas SC, Shah AD, Hemingway H, Luscombe NM. Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease. PLoS ONE. 2018;13(8):e0202344.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Al’Aref SJ, Anchouche K, Singh G, Slomka PJ, Kolli KK, Kumar A, et al. Clinical applications of machine learning in cardiovascular disease and its relevance to cardiac imaging. Eur Heart J. 2019;40(24):1975–86.

    Article  PubMed  Google Scholar 

  17. Gravesteijn BY, Nieboer D, Ercole A, Lingsma HF, Nelson D, Van Calster B, et al. Machine learning algorithms performed no better than regression models for prognostication in traumatic brain injury. J Clin Epidemiol. 2020;122:95–107.

    Article  PubMed  Google Scholar 

  18. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22.

    Article  PubMed  Google Scholar 

  19. Tideman JWL, Snabel MC, Tedja MS, Van Rijn GA, Wong KT, Kuijpers RW, et al. Association of axial length with risk of uncorrectable visual impairment for Europeans with myopia. JAMA Ophthalmol. 2016;134(12):1355–63.

    Article  PubMed  Google Scholar 

  20. Ku PW, Steptoe A, Lai YJ, Hu HY, Chu D, Yen YF, et al. The associations between near visual activity and incident myopia in children: a nationwide 4-year follow-up study. Ophthalmology. 2019;126(2):214–20.

    Article  PubMed  Google Scholar 

  21. De Leeuw ED. Reducing missing data in surveys: An overview of methods. Qual Quant. 2001;35:147–60.

    Article  Google Scholar 

  22. Van der Heijden GJ, Donders ART, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006;59(10):1102–9.

    Article  PubMed  Google Scholar 

  23. Held U, Kessels A, Garcia Aymerich J, Basagaña X, Ter Riet G, Moons KG, et al. Methods for handling missing variables in risk prediction models. Am J Epidemiol. 2016;184(7):545–51.

    Article  PubMed  Google Scholar 

  24. Tsvetanova A, Sperrin M, Peek N, Buchan I, Hyland S, Martin GP. Missing data was handled inconsistently in UK prediction models: a review of method used. J Clin Epidemiol. 2021;140:149–58.

    Article  PubMed  Google Scholar 

  25. Berkelmans GF, Read SH, Gudbjörnsdottir S, Wild SH, Franzen S, Van Der Graaf Y, et al. Population median imputation was noninferior to complex approaches for imputing missing values in cardiovascular prediction models in clinical practice. J Clin Epidemiol. 2022;145:70–80.

    Article  PubMed  Google Scholar 

  26. Moons KG, Donders RA, Stijnen T, Harrell FE Jr. Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006;59(10):1092–101.

    Article  PubMed  Google Scholar 

  27. Mühlenbruch K, Kuxhaus O, di Giuseppe R, Boeing H, Weikert C, Schulze MB. Multiple imputation was a valid approach to estimate absolute risk from a prediction model based on case-cohort data. J Clin Epidemiol. 2017;84:130–41.

    Article  PubMed  Google Scholar 

  28. De Silva AP, Moreno-Betancur M, De Livera AM, Lee KJ, Simpson JA. Multiple imputation methods for handling missing values in a longitudinal categorical variable with restrictions on transitions over time: a simulation study. BMC Med Res Methodol. 2019;19(1):1–14.

    Article  CAS  Google Scholar 

  29. Wahl S, Boulesteix AL, Zierer A, Thorand B, van de Wiel MA. Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation. BMC Med Res Methodol. 2016;16(1):1–18.

    Google Scholar 

  30. Vergouwe Y, Royston P, Moons KG, Altman DG. Development and validation of a prediction model with missing predictor data: a practical approach. J Clin Epidemiol. 2010;63(2):205–14.

    Article  PubMed  Google Scholar 

  31. Fan M, Peng X, Niu X, Cui T, He Q. Missing data imputation, prediction, and feature selection in diagnosis of vaginal prolapse. BMC Med Res Methodol. 2023;23(1):259.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Little RJ, Rubin DB. Statistical analysis with missing data (vol. 793). New Jersey: Wiley; 2019.

  33. Bi Q, Goodman KE, Kaminsky J, Lessler J. What is machine learning? A primer for the epidemiologist. Am J Epidemiol. 2019;188(12):2222–39.

    PubMed  Google Scholar 

  34. Frizzell JD, Liang L, Schulte PJ, Yancy CW, Heidenreich PA, Hernandez AF, et al. Prediction of 30-day all-cause readmissions in patients hospitalized for heart failure: comparison of machine learning and other statistical approaches. JAMA Cardiol. 2017;2(2):204–9.

    Article  PubMed  Google Scholar 

  35. Nijman S, Leeuwenberg A, Beekers I, Verkouter I, Jacobs J, Bots M, et al. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol. 2022;142:218–29.

    Article  PubMed  Google Scholar 

  36. Wu PC, Huang HM, Yu HJ, Fang PC, Chen CT. Epidemiology of myopia. Asia Pac J Ophthalmol. 2016;5(6):386–93.

    Article  Google Scholar 

  37. Flitcroft DI, He M, Jonas JB, Jong M, Naidoo K, Ohno-Matsui K, et al. IMI-Defining and classifying myopia: a proposed set of standards for clinical and epidemiologic studies. Investig Ophthalmol Vis Sci. 2019;60(3):M20–30.

    Article  Google Scholar 

  38. van Minderhout HM, Joosse MV, Grootendorst DC, Schalij-Delfos NE. Adverse reactions following routine anticholinergic eye drops in a paediatric population: an observational cohort study. BMJ Open. 2015;5(12):e008798.

    Article  PubMed  PubMed Central  Google Scholar 

  39. Williams KM, Bertelsen G, Cumberland P, Wolfram C, Verhoeven VJ, Anastasopoulos E, et al. Increasing prevalence of myopia in Europe and the impact of education. Ophthalmology. 2015;122(7):1489–97.

    Article  PubMed  Google Scholar 

  40. Nartey ET, van Staden DB, Amedo AO. Prevalence of ocular anomalies among schoolchildren in Ashaiman, Ghana. Optom Vis Sci. 2016;93(6):607–11.

    Article  PubMed  Google Scholar 

  41. Yotsukura E, Torii H, Inokuchi M, Tokumura M, Uchino M, Nakamura K, et al. Current prevalence of myopia and association of myopia with environmental factors among schoolchildren in Japan. JAMA Ophthalmol. 2019;137(11):1233–9.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Mutti D, Zadnik K, Egashira S, Kish L, Twelker J, Adams A. The effect of cycloplegia on measurement of the ocular components. Investig Ophthalmol Vis Sci. 1994;35(2):515–27.

    CAS  Google Scholar 

  43. Whittle R, Peat G, Belcher J, Collins GS, Riley RD. Measurement error and timing of predictor values for multivariable risk prediction models are poorly reported. J Clin Epidemiol. 2018;102:38–49.

    Article  PubMed  Google Scholar 

  44. Khudyakov P, Gorfine M, Zucker D, Spiegelman D. The impact of covariate measurement error on risk prediction. Stat Med. 2015;34(15):2353–67.

    Article  PubMed  PubMed Central  Google Scholar 

  45. Rosner B, Willett W, Spiegelman D. Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Stat Med. 1989;8(9):1051–69.

    Article  CAS  PubMed  Google Scholar 

  46. Brakenhoff TB, Mitroiu M, Keogh RH, Moons KG, Groenwold RH, van Smeden M. Measurement error is often neglected in medical literature: a systematic review. J Clin Epidemiol. 2018;98:89–97.

    Article  PubMed  Google Scholar 

  47. Schafer DW. Semiparametric maximum likelihood for measurement error model regression. Biometrics. 2001;57(1):53–61.

    Article  CAS  PubMed  Google Scholar 

  48. Richardson S, Gilks WR. A Bayesian approach to measurement error problems in epidemiology using conditional independence models. Am J Epidemiol. 1993;138(6):430–42.

    Article  CAS  PubMed  Google Scholar 

  49. Muff S, Riebler A, Held L, Rue H, Saner P. Bayesian analysis of measurement error models using integrated nested Laplace approximations. J R Stat Soc Ser C Appl Stat. 2015;64(2):231–52.

    Article  Google Scholar 

  50. Blackwell M, Honaker J, King G. A unified approach to measurement error and missing data: Details and extensions. Sociol Methods Res. 2017;46(3):342–69.

    Article  Google Scholar 

  51. Frénay B, Verleysen M. Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst. 2013;25(5):845–69.

    Article  Google Scholar 

  52. Gupta S, Gupta A. Dealing with noise problem in machine learning data-sets: A systematic review. Procedia Comput Sci. 2019;161:466–74.

    Article  Google Scholar 

  53. Zhu X, Wu X. Class noise vs. attribute noise: A quantitative study. Artif Intell Rev. 2004;22:177–210.

    Article  Google Scholar 

  54. Grace YY, Delaigle A, Gustafson P. Handbook of measurement error models. Boca Raton: CRC Press; 2021.

  55. Van Buuren S. Flexible imputation of missing data. Boca Raton: CRC press; 2018.

  56. Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG. A gentle introduction to imputation of missing values. J Clin Epidemiol. 2006;59(10):1087–91.

    Article  PubMed  Google Scholar 

  57. Nijman SWJ, Groenhof TKJ, Hoogland J, Bots ML, Brandjes M, Jacobs JJ, et al. Real-time imputation of missing predictor values improved the application of prediction models in daily practice. J Clin Epidemiol. 2021;134:22–34.

    Article  PubMed  Google Scholar 

  58. Van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999;18(6):681–94.

    Article  PubMed  Google Scholar 

  59. Rubin DB. Multiple imputation after 18+ years. J Am Stat Assoc. 1996;91(434):473–89.

    Article  Google Scholar 

  60. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj. 2009;338:b2393.

  61. Carpenter JR, Kenward MG, White IR. Sensitivity analysis after multiple imputation under missing at random: a weighting approach. Stat Methods Med Res. 2007;16(3):259–75.

    Article  PubMed  Google Scholar 

  62. Cole SR, Chu H, Greenland S. Multiple-imputation for measurement-error correction. Int J Epidemiol. 2006;35(4):1074–81.

    Article  PubMed  Google Scholar 

  63. Brownstone D, Valletta RG. Modeling earnings measurement error: a multiple imputation approach. Rev Econ Stat. 1996;78(4):705–17.

  64. Freedman LS, Midthune D, Carroll RJ, Kipnis V. A comparison of regression calibration, moment reconstruction and imputation for adjusting for covariate measurement error in regression. Stat Med. 2008;27(25):5195–216.

    Article  PubMed  PubMed Central  Google Scholar 

  65. Keogh RH, White IR. A toolkit for measurement error correction, with a focus on nutritional epidemiology. Stat Med. 2014;33(12):2137–55.

    Article  PubMed  PubMed Central  Google Scholar 

  66. Bartlett JW, Seaman SR, White IR, Carpenter JR, Alzheimer’s Disease Neuroimaging Initiative*. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87.

  67. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees. New York: CRC; 2017.

  68. Lewis DD. Naive (Bayes) at forty: The independence assumption in information retrieval. In: Machine Learning: ECML-98: 10th European Conference on Machine Learning Chemnitz, Germany, April 21–23, 1998 Proceedings 10. Berlin: Springer; 1998; p. 4–15.

  69. Breiman L. Bagging predictors. Mach Learn. 1996;24:123–40.

    Article  Google Scholar 

  70. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. New York: Association for Computing Machinery. 2016. p. 785–94.

  71. Volovici V, Syn NL, Ercole A, Zhao JJ, Liu N. Steps to avoid overuse and misuse of machine learning in clinical research. Nat Med. 2022;28(10):1996–9.

    Article  CAS  PubMed  Google Scholar 

  72. Katz MH. Multivariable analysis: a practical guide for clinicians and public health researchers. San Francisco: Cambridge University Press; 2011.

  73. Steyerberg EW, Eijkemans MJ, Habbema JDF. Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. J Clin Epidemiol. 1999;52(10):935–42.

    Article  CAS  PubMed  Google Scholar 

  74. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Stat Methodol. 1996;58(1):267–88.

    Article  Google Scholar 

  75. Leyrat C, Seaman SR, White IR, Douglas I, Smeeth L, Kim J, et al. Propensity score analysis with partially observed covariates: How should multiple imputation be used? Stat Methods Med Res. 2019;28(1):3–19.

    Article  PubMed  Google Scholar 

  76. McIntire DD, Bloom SL, Casey BM, Leveno KJ. Birth weight in relation to morbidity and mortality among newborn infants. N Engl J Med. 1999;340(16):1234–8.

    Article  CAS  PubMed  Google Scholar 

  77. Lubchenco LO, Hansman C, Dressler M, Boyd E. Intrauterine growth as estimated from liveborn birth-weight data at 24 to 42 weeks of gestation. Pediatrics. 1963;32(5):793–800.

    Article  CAS  PubMed  Google Scholar 

  78. Thurber KA, Dobbins T, Kirk M, Dance P, Banwell C. Early life predictors of increased body mass index among Indigenous Australian children. PLoS ONE. 2015;10(6):e0130039.

    Article  PubMed  PubMed Central  Google Scholar 

  79. Jones LA, Sinnott LT, Mutti DO, Mitchell GL, Moeschberger ML, Zadnik K. Parental history of myopia, sports and outdoor activities, and future myopia. Investig Ophthalmol Vis Sci. 2007;48(8):3524–32.

    Article  Google Scholar 

  80. Algawi K, Goggin M, O’Keefe M. Refractive outcome following diode laser versus cryotherapy for eyes with retinopathy of prematurity. Br J Ophthalmol. 1994;78(8):612–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Fieß A, Schuster AKG, Nickels S, Elflein HM, Schulz A, Beutel ME, et al. Association of low birth weight with myopic refractive error and lower visual acuity in adulthood: results from the population-based Gutenberg Health Study (GHS). Br J Ophthalmol. 2019;103(1):99–105.

    Article  PubMed  Google Scholar 

  82. Ojaimi E, Rose K, Rochtchina E, Mai T, Mitchell P. Axial length and its association with gender and anthropometric parameters in a cohort of 6 year old children. Investig Ophthalmol Vis Sci. 2004;45(13):2743.

    Google Scholar 

  83. Jin JX, Hua WJ, Jiang X, Wu XY, Yang JW, Gao GP, et al. Effect of outdoor activity on myopia onset and progression in school-aged children in northeast China: the Sujiatun Eye Care Study. BMC Ophthalmol. 2015;15:1–11.

    Article  Google Scholar 

  84. Sperrin M, Martin GP, Sisk R, Peek N. Missing data should be handled differently for prediction than for description or causal explanation. J Clin Epidemiol. 2020;125:183–7.

    Article  PubMed  Google Scholar 

  85. Keogh RH, Bartlett JW. Measurement error as a missing data problem. In handbook of measurement error models. Boca Raton: CRC Press; 2021; p. 429–50.

  86. Wong YL, Yuan Y, Su B, Tufail S, Ding Y, Ye Y, et al. Prediction of myopia onset with refractive error measured using non-cycloplegic subjective refraction: the WEPrOM Study. BMJ Open Ophthalmol. 2021;6(1):e000628.

  87. Sheng H, Bottjer CA, Bullimore MA. Ocular component measurement using the Zeiss IOLMaster. Optom Vis Sci. 2004;81(1):27–34.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

The authors are grateful to the editors and two anonymous referees for their helpful comments.

Funding

This work was supported by Key Research and Development Project of the Ministry of Science and Technology of China (Grant No.2022YFC3600903 and No.2022YFC3600901) and National Natural Science Foundation of China (Grant No. 12071089 and No.71991470).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: H.Lai. and B.Fu. Data curation: H.Lai., M.Li, T.Li, X.D. Zhou, and X.T. Zhou. Formal Analysis and Methodology: H.Lai, K.Gao, H.Guo, and B.Fu. Writing original draft: H.Lai, K.Gao, M.Li, T.Li, H.Guo, and B.Fu.  All authors reviewed the manuscript and approved the final version of the document.

Corresponding author

Correspondence to Bo Fu.

Ethics declarations

Ethics approval and consent to participate

In the case study, the Shanghai Jinshan Myopia Cohort Study (SJMC) data were provided ethical clearance by the Ethics Committee of Jinshan Hospital of Fudan University, Shanghai. The authors were approved for the use of the SJMC data and were granted access to the data. Informed consent was obtained from the caregiver on behalf of each of the study children, as the children were minors at the time of data collection.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lai, H., Gao, K., Li, M. et al. Handling missing data and measurement error for early-onset myopia risk prediction models. BMC Med Res Methodol 24, 194 (2024). https://doi.org/10.1186/s12874-024-02319-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12874-024-02319-x

Keywords