 Research
 Open access
 Published:
Implementing multiple imputations for addressing missing data in multireader multicase design studies
BMC Medical Research Methodology volume 24, Article number: 217 (2024)
Abstract
Background
In computeraided diagnosis (CAD) studies utilizing multireader multicase (MRMC) designs, missing data might occur when there are instances of misinterpretation or oversight by the reader or problems with measurement techniques. Improper handling of these missing data can lead to bias. However, little research has been conducted on addressing the missing data issue within the MRMC framework.
Methods
We introduced a novel approach that integrates multiple imputation with MRMC analysis (MIMRMC). An elaborate simulation study was conducted to compare the efficacy of our proposed approach with that of the traditional complete case analysis strategy within the MRMC design. Furthermore, we applied these approaches to a real MRMC design CAD study on aneurysm detection via head and neck CT angiograms to further validate their practicality.
Results
Compared with traditional complete case analysis, the simulation study demonstrated the MIMRMC approach provides an almost unbiased estimate of diagnostic capability, alongside satisfactory performance in terms of statistical power and the type I error rate within the MRMC framework, even in small sample scenarios. In the real CAD study, the proposed MIMRMC method further demonstrated strong performance in terms of both point estimates and confidence intervals compared with traditional complete case analysis.
Conclusion
Within MRMC design settings, the adoption of an MIMRMC approach in the face of missing data can facilitate the attainment of unbiased and robust estimates of diagnostic capability.
Introduction
The accuracy of imaging diagnostic modalities is shaped by not only the technical specifications of the diagnostic equipment or the algorithms but also the skill set, education, and sensory and cognitive capacities of the interpreting clinicians/readers (e.g., radiologists) [1,2,3]. The multireader multicase (MRMC) design, which involves various readers to assess each case, enables the quantification of the impact that reader variability has on the accuracy of imaging diagnostic modalities. As a result, MRMC design studies can enhance the generalizability of study findings and strengthen the overall robustness of the research [4]. MRMC design is currently needed for the clinical evaluation of computeraided diagnostic (CAD) devices and imaging diagnostic modalities by regulatory agencies, including the Food and Drug Administration in the United States [5] and the National Medical Products Administration in China [6, 7].
For the analysis of MRMC design data, a lack of independence in reader performance is a critical consideration [8]. Traditional statistical methods may not be suitable for this complexity. The Dorfman–Berbaum–Metz (DBM) [9] method and the Obuchowski–Rockette (OR) [8] method are commonly used approaches to address the intricate correlations present in MRMC studies [10]. In DBM analysis, to address the lack of independence in readers’ performance, jackknife pseudovalues are computed for each testreader combination, and a mixedeffects analysis of variance (ANOVA) is subsequently performed on these pseudovalues to carry out significance testing. For OR analysis, the correlations were addressed by adjusting the F statistic to account for the underlying correlation structures.
As with any study, the challenge of missing data is ubiquitous. Missing data can occur in MRMC design studies when there are instances of misreading or omissions by the reader, substandard specimen collection, issues with measurement techniques, errors during the data collection process, or when results exceed threshold values [4, 11,12,13]. Despite this commonality, the majority of MRMC design clinical trials fail to disclose whether they grappled with missing data issues [10]. Consequently, it remains unclear whether the analytical outcomes were derived from complete or incomplete datasets or if a suitable method for handling missing data was employed. This stands in contrast to the Checklist for Artificial Intelligence in Medical Imaging [14] and the Standards for Reporting of Diagnostic Accuracy Studies [15] guidelines, which both explicitly mandate the transparent reporting of missing data and the strategies employed to address them. Within the framework of causal inference, the ambiguity surrounding the status of missing data can introduce uncertainties about the conditions under which results are inferred and may even potentially result in biased estimates [16, 17].
Currently, there is limited research on methods specifically designed for handling missing data in MRMC studies. This might explain why missing data are rarely reported in such studies. For those that do address missing data issues, the complete case analysis method is arguably the most commonly used approach. This method involves discarding any case that contains missing data, including all evaluations of that case by all readers [18, 19]. This approach typically requires that the type of missing data be missing completely at random (MCAR); otherwise, the results obtained might be biased. Furthermore, the complete case method can lead to further loss of information due to the reduction in sample size, which might also affect the accuracy of the trial results and decrease the statistical power [17]. Additionally, from the perspective of causal inference, accuracy estimates derived from complete case analyses represent only the subset of the population with complete records, failing to accurately reflect the estimator of the entire target population [20]. Hence, missing data handling approaches, especially for MRMC designs, are urgently needed.
In 1976, Donald Rubin [21] introduced the concept of multiple imputation (MI), which involves imputing each missing value multiple times according to a selected imputation model, analyzing the imputed datasets individually, and combining the results on the basis of Rubin’s rules. Thus, MI is able to reflect the uncertainty associated with the data imputation process by increasing the variability of the imputed data. This approach has gained widespread adoption for managing missing data in various research contexts, including drug clinical trials [22] and observational studies [23], and addressing verification bias in diagnostic studies [24, 25]. However, the implementation of MI within the MRMC design framework remains relatively unexplored. The successful implementation of MI hinges on the congruence between the imputation model and the analysis model, necessitating that the imputation method captures all variables and characteristics pertinent to the analysis model. This requirement ensures unbiased parameter estimates and correctly calculated standard errors [26, 27]. The complexity inherent in MRMC designs, however, poses significant challenges to this congruence.
In light of these gaps, in this study, we aim to establish a missing data handling approach that integrates MI theory with MRMC analysis to maximize the use of available data and minimize biases resulting from the exclusion of cases with missing information. We intend to validate the feasibility and suitability of the proposed approach through both simulation studies and a real CAD study. Thus, providing a reliable solution for managing missing data within the MRMC framework enhances the reliability of diagnostic trial outcomes in realworld clinical settings.
The structure of this paper is as follows: First, the approach to address the issue of missing data within MRMC design studies is presented. The specifics of the simulation study are then detailed, including both the setup and the findings obtained. Subsequently, the proposed approach is implemented in a real MRMC design study that includes instances of missing data. The paper concludes with a discussion of the implications of the work and offers practical recommendations.
Methods
Basic settings and notations
In this study, a twotest receiver operating characteristic (ROC) paradigm MRMC study design is assumed. Each reader is tasked with interpreting all cases and assigning a confidenceofdisease score that reflects their assessment of the presence of disease. The true disease status of the patients was verified by experienced, independent readers who served as the gold standard. Instances of missing data may occur during the evaluation phase, leading to the absence of interpretation results. We hypothesize that these missing data were under the MCAR / missingatrandom (MAR) mechanisms. The term ‘test’ will be used to refer to the imaging system, modality, or image processing throughout this article.
In terms of notation, \(\:{X}_{ijk}\) represents the confidenceofdisease score assigned to the \(\:k\)th case by reader\(\:\:j\) on the basis of the \(\:i\)th test. The observed data consists of \(\:{X}_{ijk}\), with\(\:\:i=1,\dots\:,I\), \(\:j=1,\dots\:,J\), and \(\:k=1,\dots\:,K\), where \(\:I\) is the number of diagnostic tests evaluated, which is two for better illustration, \(\:\:J\) denotes the number of readers, and \(\:K\) is the total number of cases examined.
Traditional Approach  Complete case (CC) analysis
Under the complete case (CC) analysis framework, any instance where a single reading record or assessment is missing results in the exclusion of all interpretation results associated with that case. This exclusion applies across all readers and modalities, ensuring that the dataset—referred to as the complete case dataset—comprises only cases with fully observed data.
In this study, DBM [9, 28, 29] analysis was subsequently conducted on the complete case dataset. This analysis method transforms correlated figures of merit (FOM), specifically the area under the ROC curve (AUC), into independent testreadercaselevel jackknife pseudovalues, thereby addressing the complex correlation structure inherent in MRMC data.
The formula for calculating the jackknife pseudovalue is as follows:
where \(\:{Y}_{ijk}\) represents the jackknife pseudovalue of the AUC for the \(\:i\)th test, \(\:j\)th reader, and\(\:\:k\)th case. \(\:{\widehat{\theta\:}}_{ij}\) is the AUC estimate derived from all the cases for the \(\:i\)th test and the \(\:j\)th reader. \(\:{\widehat{\theta\:}}_{ij\left(k\right)}\) corresponds to the AUC estimate computed excluding the \(\:k\)th case. The jackknife pseudovalue of the \(\:k\)th patient can be viewed as the weighted difference in accuracy. When the FOM is the Wilcoxon AUC, the pseudovalues across the case index are identical to the respective FOM estimates.
Using \(\:{Y}_{ijk}\) as the response, the DBM method for testing the effect of the imaging diagnostic tests can be specified via threefactor ANOVA, with the test effect treated as a fixed factor and the reader and case effects treated as random factors to account for the variability among different readers, cases and interactions.
where \(\:{\tau\:}_{i}\) represents the fixed effect attributable to the \(\:i\)th imaging test. \(\:{R}_{j}\) and \(\:{C}_{k}\) are the random effects associated with the\(\:\:j\)th reader and \(\:k\)th case, respectively. Interaction terms, represented by multiple symbols in parentheses, are considered random effects. The error term \(\:{\epsilon\:}_{ijk}\) captures the residual variability not explained by the model. The DBM approach assumes that the random effects, including the interaction terms, are mutually independent and follow normal distributions with means of zero.
The DBM F statistic for testing the test effect is based on the conventional mixed model and is later optimized by Hillis to ensure that the type I error rates are within acceptable bounds [30].
Consequently, for the complete case dataset, the estimated effect size (the difference in the FOM across tests) and corresponding statistics are as follows, where the subscript CC denotes metrics calculated for the complete case dataset:
Meansquare quantities calculated based on pseudovalues [29]:
Proposed MIMRMC Approach
Step 1. Imputation
The multiple imputation by chained equations (MICE) algorithm was implemented to construct the imputation of missing data [31]. The MICE algorithm addresses missing data by generating multiple imputations that reflect the posterior predictive distribution, \(\:P\left({X}_{miss}\right{X}_{obs})\). This process involves constructing a sequence of prediction models, with the imputation of each variable being conditional on the observed and previously imputed values of the other variables. By iteratively producing multiple imputed datasets—M datasets in our implementation—the MICE approach encapsulates the uncertainty inherent in the imputation process.
In the construction of the abovementioned predictive models for the MICE algorithm within MRMC studies, the typical scarcity of auxiliary variables poses a methodological challenge. To circumvent this limitation, an imputation model is proposed that leverages the intrinsic correlations among different readers’ interpretations. Since these readers assess identical case sets, their interrelated evaluations provide a solid basis for the imputation model. In addition, given that the interpretation ratings by each reader are typically treated as continuous variables, the predictive mean matching method was incorporated to enhance the imputation process [32]. Moreover, to accommodate potential variations that may arise when readers evaluate cases across different tests and disease statuses, the model is further calibrated using a subset of the data stratified by modality and disease status.
For diseased cases under test 1, let variable \(\:{X}_{j}\:\) represent the interpretation results of reader\(\:\:j\) (\(\:\:j=\text{1,2},\dots\:,J\)). The observed dataset comprising the results from all readers is denoted as \(\:{x}_{\left(0\right)}\), where \(\:{x}_{\left(0\right)}=\{{X}_{1\left(0\right)},\dots\:,{X}_{j\left(0\right)},\dots\:,\:{X}_{J\left(0\right)}\}\). \(\:{x}_{j\left(1\right)}\) represents the missing part of \(\:{X}_{j}\). The imputation of missing data proceeds through the following process:

a)
Create the initial imputation for the missing data: \(\:{x}_{1\left(1\right)}^{\left(0\right)},{x}_{2\left(1\right)}^{\left(0\right)},\dots\:,{x}_{J\left(1\right)}^{\left(0\right)}\).

b)
In the current iteration (t + 1), the imputed values from the previous iteration (t), denoted as \(\:{x}_{1\left(1\right)}^{\left(t\right)},\dots\:,{x}_{J\left(1\right)}^{\left(t\right)}\), are updated for each variable. This update is achieved by applying the specific predictive formula provided below:
Step 2. Analysis of the individual imputed datasets
For the analysis of the \(\:M\) imputed datasets, the DBM method was also utilized for comparison purposes. Thus, for the \(\:m\)th imputed dataset (\(\:m=\text{1,2},\dots\:,M\)), the estimated effect size and corresponding statistics are as follows, and the calculation of Meansquare quantities is similar as Eq. (7):
Step 3. Pooling results
After the analysis of the individual imputed datasets, the features obtained from each imputed dataset are combined on the idea of Rubin’s rule [31].
Step 3a. Pooling the effect size
The point estimate of \(\:\theta\:\), derived from multiple imputation, is calculated as the mean of the point estimates \(\:{\widehat{\theta\:}}_{m}\) obtained from each of the \(\:M\) imputed datasets (\(\:m=\text{1,2},\dots\:,M\)).
Step 3b. Pooling variance
The total variance of the parameter estimate \(\:\theta\:\) is composed of two components: the betweenimputation variance (\(\:{V}_{B}\)) and the withinimputation variance (\(\:{V}_{W}\)), where the betweenimputation variance captures the variability among the different imputed dataset estimates, and the withinimputation variance is determined by each individual imputed dataset itself.
Within imputation variance:
Between imputation variance:
Total variance:
The pooled standard error:
Step 3c. Significance testing

i.
Wald statistics for MIMRMC
The Wald statistic is constructed by dividing the estimated effect size by its pooled standard error, resulting in a ratio that follows a tdistribution under the null hypothesis.

ii.
Degree of freedom for MIMRMC.
It is proposed that the degrees of freedom for statistical inference should adequately represent the uncertainty from both the MRMC process and the MI process. To achieve this, the average degrees of freedom from the \(\:M\) imputed datasets were chosen as a proxy for the degrees of freedom attributable to the MRMC phase. This average is then integrated with the degrees of freedom as prescribed by the multiple imputation procedure, in accordance with the principles outlined in Rubin’s rules [33]. This composite degree of freedom is then used to conduct statistical tests, ensuring that our final inferences are sensitive to the complexities and uncertainties inherent in both the MRMC and MI processes.

iii.
Confidence interval for MIMRMC.
The confidence interval can then be obtained via the equation below:
Simulation study

a.
Original complete dataset generation.
The generation of original complete datasets was based on the Roe and Metz model [34], which is based on a binormal distribution framework. In this simulation, it was assumed that all Monte Carlo simulation readers evaluated all cases across two imaging modalities and assigned a confidenceofdisease score for each interpretation.
Let \(\:{X}_{ijkt}\) represent the confidenceofdisease score of the Roe and Metz model for test \(\:i\) (\(\:i=1,\dots\:,I\)), reader \(\:j\)(\(\:j=\text{1,2},\dots\:,J\)), case \(\:k\) (\(\:k=\text{1,2},\dots\:,K\)), and truth \(\:t\) (\(\:t=0\) means a nondiseased case image,\(\:\:t=1\) means a disease case image),
where \(\:{\mu\:}_{t}\) is 0 for nondiseased cases, \(\:{\tau\:}_{it}\:\) is the fixed effect of each modality, and the remaining items are random effects that are mutually independent and normally distributed with zero means. The random effect of test×reader×case was combined into the error item, considering that these two effects are inseparable without repeated readings.
To simplify, it is assumed that the variance of the random effect was identical in nondiseased cases and diseased cases.
Thus, for nondiseased cases,
For diseased patients,
In the context of hypothesis testing, under the null hypothesis, it is proposed that both \(\:{\tau\:}_{A0}\) and \(\:{\tau\:}_{B0}\) are equal to zero and that \(\:{\tau\:}_{A1}\:\) is equal to \(\:{\tau\:}_{B1}\). Conversely, under the alternative hypothesis, while \(\:{\tau\:}_{A0}\) and \(\:{\tau\:}_{B0}\) remain zero, \(\:{\tau\:}_{A1}\) and \(\:{\tau\:}_{B1}\) are not equal, indicating a difference in the effects on diagnostic ability.
Within reader correlation \(\:{\rho\:}_{WR}\) and between reader correlation \(\:{\rho\:}_{BR}\) were also identified to indicate different correlation structure settings [35].

b.
Introducing missingness
The simulation study employed two missing data mechanisms, MCAR and MAR, to evaluate their impact on the analytical results.
Under the MAR mechanism, it is posited that the probability of a missing observation is related to observable variables, specifically the reader and the test. The missingness indicator \(\:{R}_{ijk}\), which denotes whether the interpretation by reader\(\:\:j\) for case\(\:\:k\) under test \(\:i\) is observed (\(\:{R}_{ijk}\)=0) or missing (\(\:{R}_{ijk}\)=1), is modeled via logistic regression:
The parameters \(\:{\gamma\:}_{1}\) and \(\:{\gamma\:}_{2}\) represent the effects of the reader and test, respectively, on the logodds of the observation being missing. Specifically, \(\:{\gamma\:}_{1}\) is set to 0.1 and \(\:{\gamma\:}_{2}\:\) is set to 0.15, and the parameter \(\:{\gamma\:}_{0}\) is varied to achieve different missing rates.
Conversely, by setting \(\:{\gamma\:}_{1}={\gamma\:}_{2}=0\), the MCAR mechanism is simulated, where the missingness is independent of the observed data. In this case, the missingness indicator \(\:{R}_{ijk}\) is solely determined by the intercept \(\:{\gamma\:}_{0}\):
The simulation scenarios are detailed in Table 1 and Supplementary Tables 1 and are primarily based on the settings established by Roe and Metz [34]. A total of 1728 scenarios were considered, and under each scenario, 1000 simulations were conducted to mitigate sampling bias.

c.
Evaluation of analysis approaches
In this simulation study, datasets incorporating instances of missing data were analyzed via the MIMRMC approach, as well as via the CC approach, to obtain estimates of the parameter of interest—namely, the difference in the ROC AUC between two diagnostic tests. For the purpose of comparison, the DBM analysis was also conducted on the original complete datasets, which were void of any missing data. This approach will be referred to as ‘original’ hereafter.
The following metrics were calculated to compare the performance of the proposed approach in terms of statistical performance, point estimation accuracy and confidence interval coverage: (1) type I error rate (under null hypothesis settings); (2) power (under alternative hypothesis setting); (3) root mean squared error (RMSE); (4) bias; (5) 95% confidence interval coverage rate; and (6) confidence interval width.
All simulation computations were executed via R (version 4.1.2) [36].
Real example

a
Data Source
The proposed analysis approach was applied to a real MRMC design CAD study, which is an ROC paradigm MRMC design. The study was conducted at the Affiliated Hospital of HeBei University, the ChinaJapan Union Hospital of Jilin University, and Peking University People's Hospital, with ethics approval obtained from the ethics committees of these hospitals.

a
Study Design
This study evaluated the efficacy of aneurysm detection with and without the assistance of a deep learning model in the context of head and neck CT angiograms. A total of 280 subjects were included, 135 of whom had at least one aneurysm. Ten qualified radiologists interpreted all the images and rated each image, where 0 indicated a definitive absence of an aneurysm and 10 indicated a definitive presence.
Out of the 5,600 interpretations (280 subjects × 10 radiologists × 2 tests), there were 17 instances of missing data. Twelve of these were due to radiologists not evaluating some images, whereas 5 were attributable to failure in generating reconstructed images. The overall missing data rate was 0.30%. The MIMRMC and the CC approaches were applied to handle missing data. To establish a benchmark for evaluating our analysis, An “original complete dataset” was also created, in which the previously missing interpretations were subsequently reinterpreted by the original radiologists. A DBM analysis was then conducted on this dataset.
Results
Simulation study
Figure 1 displays the mean Type I error rates under the null hypothesis setting for the MIMRMC, CC, and the original approach by factor level for each of the simulation study factors across various scenarios, differentiated by sample size. The MIMRMC approach exhibits a relatively lower Type I error rate compared with the results from the original complete datasets. In contrast, the CC approach demonstrates contextdependent performance. Despite these slight variations, both the CC and the MIMRMC approaches yield generally comparable results for MAR and MCAR conditions. Specifically, the Type I error rates of both approaches closely align with those observed in the original complete datasets and approximate the nominal significance level of 0.05.
The statistical power under the alternative hypothesis (Fig. 2) reveals that the MIMRMC approach maintains strong performance in terms of power. Notably, for this approach, any reduction in power is slight, even as the rate of missing data increases. In contrast, the CC approach results in a significant decrease in power, which is exacerbated by increases in both the missing data rate and the total number of readers. Furthermore, for both approaches, a decrease in the AUC is associated with a reduction in statistical power. Performance comparisons across different settings of variance components show that the outcomes are broadly similar, indicating that the statistical power of these approaches is relatively consistent regardless of variance component configurations.
The mean RMSE values are detailed in Fig. 3 and Supplementary Figure S1. For all the considered scenarios, whether under MAR or MCAR conditions, the RMSE associated with the CC approach is greater than that associated with the MIMRMC approach. Moreover, the RMSE for the CC approach increases significantly as the sample size diminishes and the rates of missing data increase.
Similarly, in line with the RMSE findings, in comparison with the MIMRMC approach, the bias is greater when the CC approach is employed, particularly in conditions of limited sample size, elevated missing rates, and lower AUC values (Fig. 4, Supplementary Figure S2).
The 95% confidence interval coverage rate, as shown in Fig. 5 and Supplementary Figure S3, is consistent across all the scenarios for these three approaches, closely approximating the ideal 95%.
Regarding the width of the confidence interval, for all approaches, scenarios with smaller sample sizes, lower AUC settings, and higher missing rates are associated with wider confidence intervals. Compared with the performance of original complete datasets, the MIMRMC approach shows a modest increase in confidence interval width. The CC approach, however, results in even wider confidence intervals, particularly under higher missing rates, as illustrated in Supplementary Figure S4.
The detailed simulation results can be found in Supplementary Table S2.
Real example
Table 2 summarizes the results of the CAD study. All methods indicate a significant difference in the AUC when comparing scenarios with and without the use of the deep learning model for aneurysm detection by head and neck CT angiograms. Notably, the proposed MIMRMC approach produces results that align more closely with those from the original complete datasets—more so than the CC approach—regarding point estimates, confidence intervals, and P values.
Discussion
In this study, an MIMRMC approach was developed to handle missing data issues within MRMC design CAD studies. To assess the feasibility and suitability of this approach, we conducted a simulation study comparing its performance against that of the CC approach across 1728 scenarios. Additionally, we implemented a realworld CAD study to evaluate the performance of the MIMRMC approach under actual clinical conditions.
Our findings reveal that with respect to point estimation, the CC approach demonstrates marginally inferior performance compared with that of the MIMRMC and the original approach, resulting in slightly elevated bias and RMSE. However, the CC approach yields substantially wider confidence intervals, which consequently leads to markedly reduced statistical power in comparison to both the MIMRMC and the original approach. This disparity in power becomes more significant with an increase in the rate of missing data and the size of the reader sample. And these findings underscoring the potential for inherent bias and highlighting its inadequacy in effectively managing missing data within MRMC settings. Our results align with observations from other diagnostic test trials beyond MRMC studies, where the limitations of complete case analysis have been similarly noted [37,38,39,40]. Notably, Newman has labeled the complete case analysis as ‘suboptimal, potentially unethical, and totally unnecessary’, noting that even minimal missing data can reduce study power and bias results, making findings applicable only to those with complete data [41]. Despite these identified limitations and the critique from the broader research community, it is important to acknowledge that the CC approach remains the most commonly employed approach in CAD studies. This prevalent use, juxtaposed with the method’s recognized deficiencies, emphasizes the necessity for a paradigm shift toward more robust and reliable methods in the handling of missing data in MRMC design.
In contrast, the MIMRMC approach consistently demonstrates strong statistical power while maintaining the type I error rate close to the nominal 5% level. This is complemented by superior performance metrics, including low RMSE, minimal bias, and accurate 95% confidence interval coverage. These favorable outcomes persist across various conditions, encompassing different missing data mechanisms, diverse sample sizes of cases and readers, a range of missing rates, and various variance structures. Regarding confidence interval width, our findings indicate that MIMRMC tends to produce slightly wider confidence intervals compared with the original complete dataset. This observation aligns with previous literature on MI, which suggests that the wider MI confidence intervals reflect a realistic addition of uncertainty introduced by missing data and the subsequent imputation process [42]. We observed that MIMRMC demonstrates relatively wider confidence intervals, particularly in scenarios with low correlation structures (LL, LH) or limited reader sample sizes. This results in a comparatively lower type I error rate under these conditions. However, it’s important to note that despite the relatively lower type I error rate, MIMRMC still maintains strong statistical power compared with the traditional CC approach.
When deriving the degrees of freedom for the MIMRMC, we adopted the methodology proposed by Barnard and Rubin [33] over the framework suggested by Rubin in 1987 [31]. This decision is informed by the unique characteristics of MRMC studies, which typically feature a modest proportion of missing data and where individual observations, such as the confidenceofdisease score, exert limited influence on the endpoint, specifically the AUC. This results in minimal betweenimputation variance. In addition, Rubin’s 1987 method in this context may inflates the degrees of freedom compared with those derived from the original complete data, potentially skewing significance testing toward optimism. Conversely, the approach of Barnard and Rubin [33], which accounts for the degrees of freedom from both the observed datasets and the imputation phase, offers a more accurate estimation. It enables the integration of the degrees of freedom inherent to the MRMC phase, optimized through Hillis’s contributions [28], ensuring a balanced and precise evaluation of statistical significance.
In our exploratory studies, the joint model algorithm was also evaluated, yielding results comparable to those obtained with the MICE algorithm. Given that the joint model algorithm requires a stringent assumption of multivariate normal distribution [43], the MICE algorithm was selected. Regarding the optimal number of imputations, a preliminary simulation study was conducted using ten imputations. The results indicated marginal gains in precision beyond five imputations, consistent with the recommendations of Little and Rubin [17]. Consequently, five imputations were deemed sufficient for this investigation. Future studies may explore the impact of varying the number of imputations, taking into account realworld application situations and computational constraints.
Multiple imputation, which originated in the 1970s [21], addresses the uncertainty associated with missing data by generating multiple imputed datasets. Since its inception, MI has gained widespread acceptance across various fields, including survey research [44], clinical trials [22, 45], and observational studies [23]. Specifically, in the realm of diagnostic testing, MI has been explored as a solution for mitigating verification bias caused by missing gold standard data [24, 25], as well as for handling missing data in index tests in the nonMRMC design context [37, 46, 47]. Through comprehensive simulations and practical diagnostic trials, MI has proven to be highly effective in these areas [48], establishing itself as a key technique for addressing challenges associated with missing data. Consistent with these prior findings, and by integrating MI within the MRMC framework, our approach further underscores the robustness of the MI theory. This shows compelling statistical performance, even when dealing with missing data within the complex correlated data structures characteristic of MRMC designs, contributing to the expanding evidence of MI’s significant potential to enhance research methodologies in scenarios plagued by missing data.
Furthermore, the estimate of MIMRMC corresponds to the randomized/enrolled population, aligning with the ICH E9(R1) framework [49] and principles of causal inference [50]. In contrast, the CC approach violates the randomization principle and may introduce selection bias due to the deletion of cases with missing data. Thus, MIMRMC could be an actionable sensitive analysis approach when missing data occur in real clinical settings.
It is important to acknowledge the limitations of this research. First, in our realcase study, the original complete dataset relied on ad hoc reinterpretation, which may introduce biases such as interreader variance. However, finding a balance between representing the actual missing data scenario and maintaining dataset integrity has proven challenging. Second, our simulation study, while addressing the 1728 scenario, may not fully replicate realworld conditions. For example, there may be situations where variances across different tests and truth statuses could vary [51]. Therefore, future research should consider applying our approach to more sophisticated scenarios to further evaluate its efficacy. Finally, our investigation focused solely on data MCAR and MAR mechanisms, given that missing not at random occurrences are infrequent in CAD studies. To increase robustness, future studies could incorporate other sensitivity analysis methods, such as the tipping point approach, alongside our proposed MIMRMC framework [13].
Conclusion
In conclusion, this study is the first to address the critical yet often overlooked issue of missing data in MRMC designs. The proposed MIMRMC approach addresses this issue through multiple imputation, thereby producing estimates that are representative of the randomized/enrolled population. By comparing traditional CC approach with the MIMRMC approach and employing both simulation studies and realworld applications, the substantial benefits of MIMRMC are highlighted, particularly in enhancing accuracy and statistical power while maintain good control of the type I error rate in the presence of missing data. Consequently, this method offers an effective solution for managing the challenges associated with missing data in MRMC designs and can serve as a sensitive analysis approach for real clinical environments, thereby to some extent paving the way for more robust and reliable research outcomes in future endeavors.
Availability of data and materials
The datasets used/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Gallas BD, Chan HP, D’Orsi CJ, Dodd LE, Giger ML, Gur D, Krupinski EA, Metz CE, Myers KJ, Obuchowski NA, et al. Evaluating imaging and computeraided detection and diagnosis devices at the FDA. Acad Radiol. 2012;19(4):463–77.
Wagner RF, Metz CE, Campbell G. Assessment of medical imaging systems and computer aids: a tutorial review. Acad Radiol. 2007;14(6):723–48.
Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Findings from a national sample. Arch Intern Med. 1996;156(2):209–13.
Yu T, Li Q, Gray G, Yue LQ. Statistical innovations in diagnostic device evaluation. J Biopharm Stat. 2016;26(6):1067–77.
Clinical Performance Assessment. Considerations for ComputerAssisted Detection Devices Applied to Radiology Images and Radiology Device Data in Premarket Notification (510(k)) Submissions: Guidance for Industry and Food and Drug Administration Staff [https://www.fda.gov/media/77642/download].
Guiding Principles for Technical Review of Breast Xray System Registration. [https://www.cmde.org.cn//flfg/zdyz/zdyzwbk/20210701103258337.html].
Key Points for Review of Medical Device Software Assisted by Deep Learning. [https://www.cmde.org.cn//xwdt/zxyw/20190628151300923.html].
Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Acad Radiol. 1995;2(Suppl 1):S22–29. discussion S5764, S7021 pas.
Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. Invest Radiol. 1992;27(9):723–31.
Wang L, Wang H, Xia C, Wang Y, Tang Q, Li J, Zhou XH. Toward standardized premarket evaluation of computer aided diagnosis/detection products: insights from FDAapproved products. Expert Rev Med Devices. 2020;17(9):899–918.
Obuchowski NA, Bullen J. Multireader Diagnostic Accuracy Imaging studies: fundamentals of Design and Analysis. Radiology. 2022;303(1):26–34.
Campbell G, Pennello G, Yue L. Missing data in the regulation of medical devices. J Biopharm Stat. 2011;21(2):180–95.
Campbell G, Yue LQ. Statistical innovations in the medical device world sparked by the FDA. J Biopharm Stat. 2016;26(1):3–16.
Mongan J, Moy L, Kahn CE Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell. 2020;2(2):e200029.
Cohen JF, Korevaar DA, Altman DG, Bruns DE, Gatsonis CA, Hooft L, Irwig L, Levine D, Reitsma JB, de Vet HC, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open. 2016;6(11):e012799.
Stahlmann K, Reitsma JB, Zapf A. Missing values and inconclusive results in diagnostic studies  a scoping review of methods. Stat Methods Med Res. 2023;32(9):1842–55.
Little RJA, Rubin DB. Statistical Analysis with Missing Data, 3rd Edition. John Wiley & Sons; 2020.
Schuetz GM, Schlattmann P, Dewey M. Use of 3x2 tables with an intention to diagnose approach to assess clinical performance of diagnostic tests: metaanalytical evaluation of coronary CT angiography studies. BMJ. 2012;345:e6717.
Shinkins B, Thompson M, Mallett S, Perera R. Diagnostic accuracy studies: how to report and analyse inconclusive test results. BMJ. 2013;346:f2778.
Mitroiu M, Oude Rengerink K, Teerenstra S, Petavy F, Roes KCB. A narrative review of estimands in drug development and regulatory evaluation: old wine in new barrels? Trials. 2020;21(1):671.
Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92.
Jakobsen JC, Gluud C, Wetterslev J, Winkel P. When and how should multiple imputation be used for handling missing data in randomised clinical trials  a practical guide with flowcharts. BMC Med Res Methodol. 2017;17(1):162.
Pedersen AB, Mikkelsen EM, CroninFenton D, Kristensen NR, Pham TM, Pedersen L, Petersen I. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157–66.
Harel O, Zhou XH. Multiple imputation for correcting verification bias. Stat Med. 2006;25(22):3769–86.
Harel O, Zhou XH. Multiple imputation for the comparison of two screening tests in twophase Alzheimer studies. Stat Med. 2007;26(11):2370–88.
Meng XL. Multipleimputation inferences with uncongenial sources of input. Stat Sci. 1994;9(4):538–73.
Bartlett JW, Seaman SR, White IR, Carpenter JR, Alzheimer’s Disease Neuroimaging I. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87.
Hillis SL, Berbaum KS, Metz CE. Recent developments in the DorfmanBerbaumMetz procedure for multireader ROC study analysis. Acad Radiol. 2008;15(5):647–61.
Chakraborty DP. Observer performance methods for diagnostic imaging: foundations, modeling, and applications with rbased examples. 1st edition. Boca Raton: CRC Press; 2017.
Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Stat Med. 2007;26(3):596–619.
Rubin DB, Wiley I. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.
Landerman LR, Land KC, Pieper CF. An empirical evaluation of the predictive mean matching method for imputing missing values. Sociol Methods Res. 1997;26(1):3–33.
Barnard J, Rubin DB. Miscellanea. Smallsample degrees of freedom with multiple imputation. Biometrika. 1999;86(4):948–55.
Roe CA, Metz CE. DorfmanBerbaumMetz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Acad Radiol. 1997;4(4):298–303.
Hillis SL. Relationship between Roe and Metz simulation model for multireader diagnostic data and ObuchowskiRockette model parameters. Stat Med. 2018;37(13):2067–93. https://doi.org/10.1002/sim.7616.
R. A Language and Environment for Statistical Computing [https://www.Rproject.org/].
Gad AM, Ali AA, Mohamed RH. A multiple imputation approach to evaluate the accuracy of diagnostic tests in presence of missing values. Commun Math Biol Neurosci. 2022;21:1–19.
Kohn MA, Carpenter CR, Newman TB. Understanding the direction of bias in studies of diagnostic test accuracy. Acad Emerg Med. 2013;20(11):1194–206.
Whiting PF, Rutjes AW, Westwood ME, Mallett S, Group QS. A systematic review classifies sources of bias and variation in diagnostic test accuracy studies. J Clin Epidemiol. 2013;66(10):1093–104.
Van der Heijden GJ, Donders ART, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missingindicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006;59(10):1102–9.
Newman DA. Missing data: five practical guidelines. Organizational Res Methods. 2014;17(4):372–411.
Buuren Sv. Flexible imputation of missing data. Boca Raton, FL: CRC; 2012.
Hickey GL, Philipson P, Jorgensen A, KolamunnageDona R. Joint modelling of timetoevent and multivariate longitudinal outcomes: recent developments and issues. BMC Med Res Methodol. 2016;16(1):117.
He Y, Zaslavsky AM, Landrum MB, Harrington DP, Catalano P. Multiple imputation in a largescale complex survey: a practical guide. Stat Methods Med Res. 2010;19(6):653–70.
Barnes SA, Lindborg SR, Seaman JW Jr. Multiple imputation techniques in small sample clinical trials. Stat Med. 2006;25(2):233–45.
Long Q, Zhang X, Hsu CH. Nonparametric multiple imputation for receiver operating characteristics analysis when some biomarker values are missing at random. Stat Med. 2011;30(26):3149–61.
Cheng W, Tang N. Smoothed empirical likelihood inference for ROC curve in the presence of missing biomarker values. Biom J. 2020;62(4):1038–59.
Karakaya J, Karabulut E, Yucel RM. Sensitivity to imputation models and assumptions in receiver operating characteristic analysis with incomplete data. J Stat Comput Simul. 2015;85(17):3498–511.
FDA. E9(R1) Statistical Principles for Clinical Trials: Addendum: Estimands and Sensitivity Analysis in Clinical Trials. https://www.fda.gov/regulatoryinformation/searchfdaguidancedocuments/e9r1statisticalprinciplesclinicaltrialsaddendumestimandsandsensitivityanalysisclinical. Accessed 5 Sep 2024.
Westreich D, Edwards JK, Cole SR, Platt RW, Mumford SL, Schisterman EF. Imputation approaches for potential outcomes in causal inference. Int J Epidemiol. 2015;44(5):1731–7.
Hillis SL. Simulation of unequalvariance binormal multireader ROC decision data: an extension of the Roe and Metz simulation model. Acad Radiol. 2012;19(12):1518–28.
Acknowledgements
We would like to express our sincere gratitude to Prof. Stephen L. Hillis, Prof. Dev P. Chakraborty, and Mr. Ning Li for their invaluable support and guidance throughout the development of this paper. We extend our gratitude to Shanghai United Imaging Intelligence Co., Ltd., for sponsoring the real example study and sharing the data. We also acknowledge the valuable support from the investigators of the real example study: Xiaoping Yin and Jianing Wang from the Affiliated Hospital of HeBei University; Lin Liu and Zhanhao Mo from the ChinaJapan Union Hospital of Jilin University; and Nan Hong and Lei Chen from Peking University People’s Hospital.
Funding
This study was conducted under grants from the Shanghai municipal health commission Special Research Project in Emerging Interdisciplinary Fields (2022JC011) and Shanghai Science and Technology Development Funds (22QA1411400).
Author information
Authors and Affiliations
Contributions
ZM Pan and YY Qin contributed equally to this work. ZM Pan and YY Qin designed the simulation and wrote the main manuscript text. WY Bai prepared figures and tables. Q He conducted the analysis of the real example. XP Ying provided substantial contributions during the revisions. J He provided critical input to the manuscript. All authors reviewed the manuscript and approved the final version of this paper.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The case study has ethics approvals from the Ethics Committee of Peking University People`s Hospital(2022Nov9th), the Ethics Committee of ChinaJapan Union Hospital of Jilin University(2022Nov25th), and the Ethics Committee of the Affiliated Hospital of HeBei University(2022Dec26th). Participating patients provided informed consent and research methods followed national and international guidelines.
Competing interests
The authors declare no competing interests.
Consent for publication
Not applicable.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12874_2024_2321_MOESM1_ESM.tiff
Supplementary Material 1: Supplementary Figure S1. Mean RMSE under different scenarios differentiated by sample size for the original, CC and MIMRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MIMRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset.
12874_2024_2321_MOESM2_ESM.tiff
Supplementary Material 2: Supplementary Figure S2. Mean bias under different scenarios differentiated by sample size for the original, CC and MIMRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MIMRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset
12874_2024_2321_MOESM3_ESM.tiff
Supplementary Material 3: Supplementary Figure S3. Mean 95% confidence interval coverage rate under different scenarios differentiated by sample size for the original, CC and MIMRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MIMRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset
12874_2024_2321_MOESM4_ESM.tiff
Supplementary Material 4:Supplementary Figure S4. Mean confidence interval width under different scenarios differentiated by sample size for the original, CC and MIMRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MIMRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
Pan, Z., Qin, Y., Bai, W. et al. Implementing multiple imputations for addressing missing data in multireader multicase design studies. BMC Med Res Methodol 24, 217 (2024). https://doi.org/10.1186/s12874024023213
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874024023213