Computer-aided assessment of diagnostic images for epidemiological research
© Abraham et al; licensee BioMed Central Ltd. 2009
Received: 23 April 2009
Accepted: 11 November 2009
Published: 11 November 2009
Diagnostic images are often assessed for clinical outcomes using subjective methods, which are limited by the skill of the reviewer. Computer-aided diagnosis (CAD) algorithms that assist reviewers in their decisions concerning outcomes have been developed to increase sensitivity and specificity in the clinical setting. However, these systems have not been well utilized in research settings to improve the measurement of clinical endpoints. Reductions in bias through their use could have important implications for etiologic research.
Using the example of cortical cataract detection, we developed an algorithm for assisting a reviewer in evaluating digital images for the presence and severity of lesions. Available image processing and statistical methods that were easily implementable were used as the basis for the CAD algorithm. The performance of the system was compared to the subjective assessment of five reviewers using 60 simulated images. Cortical cataract severity scores from 0 to 16 were assigned to the images by the reviewers and the CAD system, with each image assessed twice to obtain a measure of variability. Image characteristics that affected reviewer bias were also assessed by systematically varying the appearance of the simulated images.
The algorithm yielded severity scores with smaller bias on images where cataract severity was mild to moderate (approximately ≤ 6/16 ths ). On high severity images, the bias of the CAD system exceeded that of the reviewers. The variability of the CAD system was zero on repeated images but ranged from 0.48 to 1.22 for the reviewers. The direction and magnitude of the bias exhibited by the reviewers was a function of the number of cataract opacities, the shape and the contrast of the lesions in the simulated images.
CAD systems are feasible to implement with available software and can be valuable when medical images contain exposure or outcome information for etiologic research. Our results indicate that such systems have the potential to decrease bias and discriminate very small changes in disease severity. Simulated images are a tool that can be used to assess performance of a CAD system when a gold standard is not available.
Diagnostics are becoming increasingly image based. Whether the setting is clinical practice or research, information must be extracted from an image to determine disease status. The determination of the presence or severity of disease will impact clinical care for a patient or outcome status in a study. In many clinical arenas images are assessed using subjective methods that depend upon the skill and consistency of a reviewer. The performance of screening mammography has been shown to be highly dependent upon the reader's skill and training [1, 2]. The use of computer-aided diagnosis (CAD) systems to improve the sensitivity and specificity of lesion detection have become a focus of medical imaging and diagnostic radiology research . Such systems have been explored extensively as a method for improving the detection of breast cancers from mammography [4, 5] and the evidence indicates CAD can improve the accuracy of detection . These CAD systems have also been employed in lung cancer and other tumor diagnosis. Evaluation of such systems can be challenging since the quality of the images, the application and expertise of the user will all contribute to the detection performance. Established methods such as receiver operating characteristic (ROC) analysis and free-response receiver operating characteristic (FROC) analysis can provide metrics for assessing performance given knowledge of the true disease classification. Such methods are not easily adapted, however, to assessing performance when the outcome is polytomous or continuous, though methods have been explored for handling multi-class and continuous measurements [7–11]. Another consideration is the ascertainment of the true disease status. Biopsy can provide a gold standard (true tumor presence) for cancer diagnostics but simple gold standards for other image diagnostics or for outcomes other than presence of disease (e.g. disease progression) may be challenging to find.
Perhaps as a result of a limited ability to assess performance when ROC analysis isn't practical, CAD systems have primarily been used clinically to locate lesions such as breast tumors. However, their application could be extended to the research setting. Disease incidence is a primary outcome in epidemiologic studies. Further CAD systems could be adapted to outcomes other than the presence or absence of disease. Progression and severity are disease outcomes of interest that can be assessed in diagnostic images. Regardless of the outcome, minimizing measurement error is important for making valid inferences and CAD systems have the potential to reduce bias and misclassification in other applications besides tumor detection. An additional advantage is the ability to calibrate the detection algorithm to adjust the balance between false positives and false negatives to incorporate the cost of missing true cases or falsely identifying non-cases.
Using the example of cortical cataract detection, we developed a software algorithm for aiding in the evaluation of digital images for the presence and severity of lesions. The CAD system was designed 1) to assist a subjective reviewer in identifying lesions in a lens image and assigning a severity score and 2) to use accessible statistical and image processing methods that could be readily implemented through available software. Standard assessment of lens images is done using semi-qualitative classification schemes involving a trained reviewer and a standardized scale of severity [12–18]. Stand-alone software algorithms have been used previously to attempt to improve cataract severity measurement and have shown reasonable agreement with standard reviewer-based methods but were limited in their application and subsequent use [19–27]. We hypothesized that a CAD system that assisted a trained reviewer could reduce measurement error and would be feasible to implement with standard software.
Since the true severity of cortical cataract is unknown, we assessed the performance of the CAD algorithm using simulated images with known severity created to mimic the characteristic appearance of diagnostic lens images. Cataract is a disease processes that alters the structure of the lens to degrade lens transparency. Cataract severity is primarily of interest in the research setting as an outcome for studying risk factor associations and, potentially, for evaluating treatments or interventions. Clinically, an assessment of vision and patient perception of vision difficulty are the metrics used to indicate for cataract surgery. Thus cataract severity does not solely determine the occurrence of surgery; individual and physician factors contribute as well. However, epidemiolgic research, based on an assessment of cataract severity, is important given the high prevalence in older age groups, estimated to be 54.2% among African Americans and 24.2% among Caucasians . Improving upon cataract severity measurement using computerized assessment methods could provide a means for cataract researchers to explore more subtle risk factors associated with disease progression. The ability to develop a CAD system that minimized measurement error using available software packages would, in general, indicate the feasibility of wider application of CAD in research settings.
Eliminating noise, when feasible, improves measurement in general and, for CAD algorithms, increases the performance consistency. Retroillumination images are taken using cross-polarized light to reduce the light reflex artifact in the images. The result is light intensity heterogeneity across an image that can amplify or attenuate the appearance of cortical cataract opacities. Thus a filtering or image processing step is needed prior to attempting to identify cataract opacities. If the distribution of light across the image were known, an image could be standardized to remove the effect of background intensity on an estimate of opacity severity. We estimated the background intensity, B, by averaging intensity information locally using mathematical morphologic procedures called erosion/dilation operations . Implementation was accomplished using the Matlab Image Processing Toolbox (The Mathworks, Inc) imdilate and imerode functions with an ellipsoid structuring element. Dividing the original image, M, by the result, B, yields an image with a standardized background lighting.
Often images contain structures or areas that are not of interest. For example we could rule out ribs in a chest X-ray when assessing for lung tumors. Thus if we can define the boarders of the regions of interest (segment the image) we can often simplify the decision or classification rules necessary for separating normal from abnormal. The pupillary margin bounds the area within a retroillumination image that contains cataract severity information so segmentation is used to eliminate the pixels outside the margin that contribute no information about opacity. We need to identify this boundary, which is equivalent to estimating the function that describes the boundary shape and placement in the image. For this application, we used a specialized edge detector called a deformable contour model, which can find irregularly shaped contours. First formulated by Witkin et al.  and improved by Cohen , deformable contour models are constrained splines that can be used in a variety of image applications. The contour models were implemented by adapting Matlab code available from the work of Xu and Prince .
Decision thresholds are used with any surrogate measure of disease to define the subgroup who will receive intervention, treatment, further diagnostics, or be considered to have the outcome for the purpose of epidemiologic research. The goal is to minimize the percent of false positives and false negatives, which is challenging as diseased and non-diseased individuals have distributions of values of the surrogate measure that often overlap. Defining cortical cataract for each observation (pixel) in a retroillumination image is based on the surrogate measure of grey level. The grey level values in a retroillumination image are a function of the external illumination and opaqueness due to disease. After compensating for the variation in light intensity across the image we assume that only the degree of disease in the standardized image determines the grey level. To discriminate between diseased (dark) and non-diseased (light) pixels we used fuzzy c-means clustering [34, 35]. Fuzzy clustering is a method of classification that allows membership in a cluster to be partial. For each pixel observation i = 1... N, the degree of membership in the diseased and non-diseased clusters was estimated through minimization of a function that describes the cluster criteria and how proximate each pixel is to the criteria. Implementation was accomplished by defining the membership functions and iterating to obtain the degree of membership for each pixel. Final classification was taken as the cluster (cataract or normal) for which a pixel had the highest degree of membership. The methods described above yielded a CAD algorithm for cortical cataract that suggested to the user which pixels were cataractous and provided an estimate of the percent of the total viewable lens area covered by cataract, a standard measure of cortical cataract severity. This severity score was a continuous measure and was normalized to the scale of 0.0 to 16.0 to mimic standard grading methods. The implementation was done in Matlab version 7.0.4 (Mathworks Inc) and an interface was built in LabView 7.1 (National Instruments).
The validity of reviewer-based or computer-based cataract severity measurement has never been assessed since no gold standard exists. Simulated data are drawn from known distributions such that the true disease status is known. Thus simulation studies are an inexpensive way to obtain an estimate of validity, albeit in an idealized setting. For evaluating a CAD algorithm, simulation studies are easily implemented. Digital images can be created that capture various aspects and stages of the lesion of interest. Noise and artifact may be added to challenge the system or all noise can be eliminated to test the optimal performance.
To test the dependence of the CAD performance on image characteristics, the appearance of the images was systematically altered by varying the number of opacities (5 levels), the contrast between diseased and non-diseased areas (3 levels), the width of opacities (2 levels) and the length of the opacities (2 levels). This resulted in 60 simulated images for assessment. The images were each assessed twice by the CAD algorithm and separately by five trained reviewers using a standard assessment method, the Wilmer Eye Institute cortical cataract classification system . The Wilmer classification system uses a seventeen category severity scale with possible scores ranging from zero to sixteen. Reviewers were told to identify all cortical opacities in the image and estimate the area they cover in 16 ths . A circle divided into sixteen pie-shaped wedges is overlaid on the images to provide a visual guide for estimating the area involved. To standardize the assessment, a training set of retroillumination images was presented to all the reviewers. Consensus was achieved to within one severity unit on all training images.
Using language R, between- and within-reviewer variability was estimated. The bias between the mean estimated severity assigned by each reviewer and the true severity was determined and the agreement between the CAD algorithm and the reviewers was assessed. A mixed-effects model was used to examine the effect of each parameter and the choice of method (reviewer or CAD) on the bias between the estimate and the truth.
Bias and variance.
Average of Reviewer
Predictors of reviewer bias.
Fixed parameter estimate*
Multiple reviewers provided an estimate of the between- and within-reviewer variability. The between-reviewer variability tended to increase with increasing opacity severity while the within-reviewer variability did not show a consistent trend. The variance marginal across severity for the reviewers ranged from 0.48 to 1.22 with an average of 0.80. The CAD system in isolation had no variability, as it processed the same image identically each time. Thus, the variability is dependent upon the reviewer using the CAD system.
As diagnostic imaging has become more widely used in the clinical setting, the opportunity for images to be a source for outcome and exposure assessment in epidemiologic research is growing. Assessment by a clinician or trained reviewer is one standard diagnostic methodology for making a disease determination from such images. Computer-aided diagnosis systems were introduced to facilitate this task. In this article we detailed how a CAD system can be developed for research purposes when medical images contain valuable exposure or outcome information for answering a research question. In our example in the field of opthalmologic epidemiology, a CAD system for assessing cortical cataract severity from retroillumination images was designed using available and established image processing and statistical techniques. We further examined the utility of using simulated images to validate image-based measurement or assessment in the absence of a gold standard. From our simulation study we found that the CAD algorithm outperformed the trained reviewer in estimating cataract severity from images with mild to moderate cataract involvement. The reduction in bias observed with the CAD system likely reflects both improved performance at assessing severity as well as the ability of the system to describe severity on a continuous scale. The coarsening of the data through the use of a categorical scale limits how close the reviewers can be, on average, to the true severity. When the true disease status was dichotomized for the ROC analysis, we found that the performance of the reviewers was equivalent to the CAD system, suggesting that, in this example, most of the improvement arises from the CAD algorithm's ability to discriminate very small changes in severity. On more severe cases, the CAD system had higher bias due to problems in the background noise filtering methods. Replacing the erosion/dilation operations with a more robust method of adjustment for uneven background lighting would likely result in an assessment algorithm with low bias at all severities. Methods are available for improving upon background intensity standardization in images [36–39]. We chose erosion/dilation operations for their ease of implementation with standard software. It should be noted that severe cases of cortical cataract are rare since cataract surgery is often performed before the cataract progresses to such an extent.
A feature of the CAD system worth highlighting is the zero variability. The suggested areas of opacity will not vary with repeated assessment of the same image. The reviewer using the CAD system may be more or less adherent to the suggestions of the system, which, we hypothesize, would increase the variability to a maximum that would be the variability of the reviewer making unassisted decisions about severity. Therefore, we would expect that the algorithm assisted reviewer would have, on average lower within-reviewer variability. While we did not evaluate the impact of the CAD system on reviewers' performance, this is an important question that would need evaluation prior to implementation. Differences in the effect of CAD on a reviewer's assessment of an image would best be evaluated using real image data, where image interpretation would be most challenging and results would represent performance in practice. It is clear that trained reviewers are sensitive to various aspects of the images or lesions and this could result in biases that vary from study to study. Reviewers were sensitive to the contrast and performed better with certain opacity shape characteristics, which may have implications for cataract research using standard severity assessment methods. Cortical cataract opacities tend to take on a variety of appearances and, to the extent that the shape may be related to the mechanism, studies of some risk factors may be more prone to bias, potentially differential.
There are numerous aspects of reviewer behavior and performance that could be studied using simulated images. It is difficult to assure that the assessment process of a reviewer would be the same with simulated versus real images. A simulation study could not stand in isolation as the only evaluation of an assessment method and are only valuable when knowledge of true disease status cannot be attained. However, simulation studies are low cost, do not impact patients, and allow for a fuller exploration of the strengths and weaknesses of the assessment method.
Subjective reviewers of lens images can accomplish complex discrimination tasks that cannot be fully automated at present. However, we found that the performance of reviewers is affected by various features in the lens image and the degree of bias in their assessment may vary from image to image. Augmenting the assessment process with a computer algorithm is a means of standardizing the measurement and minimizing some of the variability of subjective image assessment. Such CAD systems can be designed for many different applications with readily available image processing and statistical software. Testing and validation can readily be performed using simulated images that capture the main features of interest.
The authors would like to acknowledge Rob Burke for his many efforts related to this project. We would also like to thank the lens graders who generously gave their time to help us validate the algorithm and compare it with the current lens grading methodology. This work was supported by the Department of Heath and Human Services, National Institutes of Health, National Eye Institute Training Grant [EY 07127] Clinical Trials Training Program in Vision Research.
- Miglioretti D, Smith-Bindman R, Abraham L, et al: Radiologist characteristics associated with interpretive performance of diagnostic mammography. J Natl Cancer Inst. 2007, 99 (24): 1854-1863.View ArticlePubMedPubMed CentralGoogle Scholar
- Biggelaar van den F, Nelemans P, Flobbe K: Performance of radiographers in mammogram interpretation: A systematic review. Breast. 2008, 17: 85-90.View ArticlePubMedGoogle Scholar
- Doi K: Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Comput Med Imagin Graph. 2007, 31: 198-211.View ArticleGoogle Scholar
- Elter M, Schulz-Wendtland R, Wittenberg T: The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. Med Phys. 2007, 34 (11): 4164-4172.View ArticlePubMedGoogle Scholar
- Helvie M: Improving mammographic interpretation: Double reading and computer-aided diagnosis. Radiol Clin North Am. 2007, 45 (5): 801-811.View ArticlePubMedGoogle Scholar
- Hadjiiski L, Sahiner B, Chan H: Advances in computer-aided diagnosis for breast cancer. Curr Opin Obstet Gynecol. 2006, 18: 64-70.View ArticlePubMedPubMed CentralGoogle Scholar
- Scurfield B: Multiple-event forced-choice tasks in the theory of signal detectability. Journal of Mathematical Psychology. 1996, 40: 253-269.View ArticlePubMedGoogle Scholar
- Scurfield B: Generalization of the theory of signal detectability to n-event m-dimensional forced-choice tasks. Journal of Mathematical Psychology. 1998, 42: 5-31.View ArticlePubMedGoogle Scholar
- Mossman D: Three-way ROCs. Medical Decision Making. 1999, 19: 78-89.View ArticlePubMedGoogle Scholar
- Nakas C, Yiannoutsos C: Ordered multiple-class ROC analysis with continuous measurements. Statistics in Medicine. 2004, 23: 3437-3449.View ArticlePubMedGoogle Scholar
- Li J: ROC analysis with multiple classes and multiple tests: methodology and its application in microarray studies. Biostatistics. 2008, 9 (3): 566-576.View ArticlePubMedGoogle Scholar
- West S, Rosenthal F, Newland H, et al: Use of photographic techniques to grade nuclear cataracts. Invest Opthalmol Vis Sci. 1988, 29: 73-77.Google Scholar
- Taylor H, West S: The clinical grading of lens opacities. Aust N Z J Ophthalmol. 1989, 17: 81-86.View ArticlePubMedGoogle Scholar
- Sparrow J, Ayliffe W, Bron A, et al: Inter-observer and intra-observer variability of the Oxford clinical cataract classification and grading system. Int Ophthalmol. 1988, 11: 151-157.View ArticlePubMedGoogle Scholar
- Sparrow J, Bron A, Brown N, et al: The Oxford clinical cataract classification and grading system. Int Ophthalmol. 1986, 9: 207-225.View ArticlePubMedGoogle Scholar
- LC , Wolfe J, Singer D, et al: The Lens Opacities Classification System III. The Longitudinal Study of Cataract Study Group. Arch Ophthalmol. 1993, 111 (6): 831-836.View ArticleGoogle Scholar
- Chylack L, Leske M, Sperduto R, et al: Lens opacities classification system. Arch Ophthalmol. 1988, 106: 330-334.View ArticlePubMedGoogle Scholar
- Klein B, Klein R, Linton K, et al: Assessment of cataracts from photographs in the Beaver Dam Eye Study. Ophthalmology. 1990, 97: 1428-1433.View ArticlePubMedGoogle Scholar
- Brown N, Bron A, Ayliffe W, et al: The objective assessment of cataract. Eye. 1987, 1: 234-246.View ArticlePubMedGoogle Scholar
- Sparrow J, Brown N, Shun-Shin A, et al: The Oxford modular cataract image analysis system. Eye. 1990, 4: 638-648.View ArticlePubMedGoogle Scholar
- Harris M, Hanna K, Shun-Shin G, et al: Analysis of retro-illumination photographs for use in longitudinal studies of cataract. Eye. 1993, 7: 572-577.View ArticlePubMedGoogle Scholar
- Gershenzon A, Robman L: New software for lens retro-illumination digital image analysis. Aust N Z J Ophthalmol. 1999, 27: 170-172.View ArticlePubMedGoogle Scholar
- Lopez M, Datiles M, Podgor M, et al: Reproducibility study of posterior subcapsular opacities on the NEI retroillumination image analysis system. Eye. 1994, 8: 657-661.View ArticlePubMedGoogle Scholar
- Robman L, McCarty C, Garrett S, et al: Variability in digital assessment of cortical and posterior subcapsular cataract. Ophthalmic Res. 1999, 31: 110-118.View ArticlePubMedGoogle Scholar
- Vivino M, Mahurkar A, Trus B, et al: Quantitative analysis of retroillumination images. Eye. 1995, 9 (Pt 1): 77-84.View ArticlePubMedGoogle Scholar
- Wolfe J, LC : Objective measurement of cortical and subcapsular opacification in retroillumination photographs. Ophthalmic Res. 1990, 22 (Suppl 1): 62-67.View ArticlePubMedGoogle Scholar
- Miyauchi A, Mukai S, Sakamoto Y: A new analysis method for cataractous images taken by retroillumination photography. Ophthalmic Res. 1990, 22 (Suppl 1): 74-77.View ArticlePubMedGoogle Scholar
- West S, Munoz B, Schein O, et al: Racial differences in lens opacities: The Salisbury Eye Evaluation (SEE) Project. Am J Epidemiol. 1998, 148 (11): 1033-1039.View ArticlePubMedGoogle Scholar
- West S, Munoz B, Rubin G, Schein O, Bandeen-Roche K, Zeger S, German P, Fried L: Function and visual impairment in a population-based study of older adults. Invest Opthalmol Vis Sci. 1997, 38: 72-82.Google Scholar
- Serra J: Image Analysis and Mathematical Morphology. 1982, London: Academic Press, 1:Google Scholar
- Witkin A, Kass M, Terzopoulos D: Snakes: Active contour models. Int J Computer Vision. 1988, 1 (4): 321-331.View ArticleGoogle Scholar
- Cohen L: On active contour models and balloons. CVGIP: Image Underst. 1991, 53 (2): 211-218.View ArticleGoogle Scholar
- Xu C, Prince J: Snakes, Shapes, and Gradient Vector Flow. IEEE Transactions on Image Processing. 1998, 7 (3): 359-369.View ArticlePubMedGoogle Scholar
- Dunn J: A fuzzy relative of the ISODATA process and its use in detecting compact well-seperated clusters. J Cybernetics. 1973, 3: 32-57.View ArticleGoogle Scholar
- Bezdek J: Pattern recognition with fuzzy objective function algorithms. 1981, New York: Plenum PressView ArticleGoogle Scholar
- Pappas T: An adaptive clustering algorithm for image segmentation. IEEE Trans on Signal Processing. 1992, 40: 901-914.View ArticleGoogle Scholar
- Unser M: Multigrid adaptive image processing. Proc of the IEEE Conf on Image Processing. 1995, 1: 49-52.Google Scholar
- Wells W, Gimson W, Kikinis R, et al: Adaptive segmentation of MRI data. IEEE Trans on Med Imag. 1996, 15: 429-442.View ArticleGoogle Scholar
- Pham D, Prince J: An adaptive fuzzy c-means algorithm for image segmentation in the presence of intensity inhomogeneities. Proc SPIE Medical Imaging 1998: Image Processing. 1998, 3338: 555-563.View ArticleGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/9/74/prepub