1
|
Petrick N, Sahiner B, Armato SG, Bert A, Correale L, Delsanto S, Freedman MT, Fryd D, Gur D, Hadjiiski L, Huo Z, Jiang Y, Morra L, Paquerault S, Raykar V, Samuelson F, Summers RM, Tourassi G, Yoshida H, Zheng B, Zhou C, Chan HP. Evaluation of computer-aided detection and diagnosis systems. Med Phys 2014; 40:087001. [PMID: 23927365 DOI: 10.1118/1.4816310] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Computer-aided detection and diagnosis (CAD) systems are increasingly being used as an aid by clinicians for detection and interpretation of diseases. Computer-aided detection systems mark regions of an image that may reveal specific abnormalities and are used to alert clinicians to these regions during image interpretation. Computer-aided diagnosis systems provide an assessment of a disease using image-based information alone or in combination with other relevant diagnostic data and are used by clinicians as a decision support in developing their diagnoses. While CAD systems are commercially available, standardized approaches for evaluating and reporting their performance have not yet been fully formalized in the literature or in a standardization effort. This deficiency has led to difficulty in the comparison of CAD devices and in understanding how the reported performance might translate into clinical practice. To address these important issues, the American Association of Physicists in Medicine (AAPM) formed the Computer Aided Detection in Diagnostic Imaging Subcommittee (CADSC), in part, to develop recommendations on approaches for assessing CAD system performance. The purpose of this paper is to convey the opinions of the AAPM CADSC members and to stimulate the development of consensus approaches and "best practices" for evaluating CAD systems. Both the assessment of a standalone CAD system and the evaluation of the impact of CAD on end-users are discussed. It is hoped that awareness of these important evaluation elements and the CADSC recommendations will lead to further development of structured guidelines for CAD performance assessment. Proper assessment of CAD system performance is expected to increase the understanding of a CAD system's effectiveness and limitations, which is expected to stimulate further research and development efforts on CAD technologies, reduce problems due to improper use, and eventually improve the utility and efficacy of CAD in clinical practice.
Collapse
Affiliation(s)
- Nicholas Petrick
- Center for Devices and Radiological Health, U.S. Food and Drug Administration, 10903 New Hampshire Avenue, Silver Spring, Maryland 20993, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
2
|
Validation of Monte Carlo estimates of three-class ideal observer operating points for normal data. Acad Radiol 2013; 20:908-14. [PMID: 23747155 DOI: 10.1016/j.acra.2013.04.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2013] [Revised: 04/15/2013] [Accepted: 04/16/2013] [Indexed: 11/23/2022]
Abstract
RATIONALE AND OBJECTIVES Traditional two-class receiver operating characteristic (ROC) analysis is inadequate for the complete evaluation of observer performance in tasks with more than two classes. MATERIALS AND METHODS Here, a Monte Carlo estimation method for operating point coordinates on a three-class ROC surface is developed and compared with analytically calculated coordinates in two special cases: (1) univariate and (2) restricted bivariate trinormal underlying data. RESULTS In both cases, the statistical estimates were found to be good in the sense that the analytical values lay within the 95% confidence interval of the estimated values about 95% of the time. CONCLUSIONS The statistical estimation method should be key in the development of a pragmatic performance metric for evaluation of observers in classification tasks with three or more classes.
Collapse
|
3
|
He X, Gallas BD, Frey EC. Three-class ROC analysis--toward a general decision theoretic solution. IEEE TRANSACTIONS ON MEDICAL IMAGING 2010; 29:206-215. [PMID: 19884079 PMCID: PMC2821068 DOI: 10.1109/tmi.2009.2034516] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Multiclass receiver operating characteristic (ROC) analysis has remained an open theoretical problem since the introduction of binary ROC analysis in the 1950s. Previously, we have developed a paradigm for three-class ROC analysis that extends and unifies decision theoretic, linear discriminant analysis, and probabilistic foundations of binary ROC analysis in a three-class paradigm. One critical element in this paradigm is the equal error utility (EEU) assumption. This assumption allows us to reduce the intrinsic space of the three-class ROC analysis (5-D hypersurface in 6-D hyperspace) to a 2-D surface in the 3-D space of true positive fractions (sensitivity space). In this work, we show that this 2-D ROC surface fully and uniquely provides a complete descriptor for the optimal performance of a system for a three-class classification task, i.e., the triplet of likelihood ratio distributions, assuming such a triplet exists. To be specific, we consider two classifiers that utilize likelihood ratios, and we assumed each classifier has a continuous and differentiable 2-D sensitivity-space ROC surface. Under these conditions, we proved that the classifiers have the same triplet of likelihood ratio distributions if and only if they have the same 2-D sensitivity-space ROC surfaces. As a result, the 2-D sensitivity surface contains complete information on the optimal three-class task performance for the corresponding likelihood ratio classifier.
Collapse
Affiliation(s)
- Xin He
- Department of Radiology, Johns Hopkins School of Medicine, Baltimore, MD 21287 USA ()
| | - Brandon D. Gallas
- DIAM/OSEL/CDRH, Food and Drug Administration, Silver Spring, MD, 20993 USA ()
| | - Eric C. Frey
- Department of Radiology, Johns Hopkins School of Medicine, Baltimore, MD 21287 USA ()
| |
Collapse
|
4
|
Giger ML, Chan HP, Boone J. Anniversary paper: History and status of CAD and quantitative image analysis: the role of Medical Physics and AAPM. Med Phys 2009; 35:5799-820. [PMID: 19175137 PMCID: PMC2673617 DOI: 10.1118/1.3013555] [Citation(s) in RCA: 165] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
The roles of physicists in medical imaging have expanded over the years, from the study of imaging systems (sources and detectors) and dose to the assessment of image quality and perception, the development of image processing techniques, and the development of image analysis methods to assist in detection and diagnosis. The latter is a natural extension of medical physicists' goals in developing imaging techniques to help physicians acquire diagnostic information and improve clinical decisions. Studies indicate that radiologists do not detect all abnormalities on images that are visible on retrospective review, and they do not always correctly characterize abnormalities that are found. Since the 1950s, the potential use of computers had been considered for analysis of radiographic abnormalities. In the mid-1980s, however, medical physicists and radiologists began major research efforts for computer-aided detection or computer-aided diagnosis (CAD), that is, using the computer output as an aid to radiologists-as opposed to a completely automatic computer interpretation-focusing initially on methods for the detection of lesions on chest radiographs and mammograms. Since then, extensive investigations of computerized image analysis for detection or diagnosis of abnormalities in a variety of 2D and 3D medical images have been conducted. The growth of CAD over the past 20 years has been tremendous-from the early days of time-consuming film digitization and CPU-intensive computations on a limited number of cases to its current status in which developed CAD approaches are evaluated rigorously on large clinically relevant databases. CAD research by medical physicists includes many aspects-collecting relevant normal and pathological cases; developing computer algorithms appropriate for the medical interpretation task including those for segmentation, feature extraction, and classifier design; developing methodology for assessing CAD performance; validating the algorithms using appropriate cases to measure performance and robustness; conducting observer studies with which to evaluate radiologists in the diagnostic task without and with the use of the computer aid; and ultimately assessing performance with a clinical trial. Medical physicists also have an important role in quantitative imaging, by validating the quantitative integrity of scanners and developing imaging techniques, and image analysis tools that extract quantitative data in a more accurate and automated fashion. As imaging systems become more complex and the need for better quantitative information from images grows, the future includes the combined research efforts from physicists working in CAD with those working on quantitative imaging systems to readily yield information on morphology, function, molecular structure, and more-from animal imaging research to clinical patient care. A historical review of CAD and a discussion of challenges for the future are presented here, along with the extension to quantitative image analysis.
Collapse
Affiliation(s)
- Maryellen L Giger
- Department of Radiology, University of Chicago, Chicago, Illinois 60637, USA.
| | | | | |
Collapse
|
5
|
He X, Frey EC. The validity of three-class Hotelling trace (3-HT) in describing three-class task performance: comparison of three-class volume under ROC surface (VUS) and 3-HT. IEEE TRANSACTIONS ON MEDICAL IMAGING 2009; 28:185-193. [PMID: 19188107 PMCID: PMC2760394 DOI: 10.1109/tmi.2008.928919] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
In order to describe multiclass classification performance, several figures of merit (FOM) have been proposed. Among the earliest and most widely known of these is the three-class Hotelling trace (3-HT). The goal of this paper is to present theoretical and empirical data demonstrating the failure of 3-HT as a measure of three-class task performance. To help do this, we contrast it to a newly proposed three-class FOM, the volume under the three-class receiver operating characteristic (ROC) surface (VUS). The VUS is obtained from a decision theory based three-class ROC analysis method which has been proved to extend the decision theoretic, linear discriminant analysis (LDA), and psychophysical foundations of binary ROC analysis to a three-class paradigm. We demonstrate empirically that the VUS and 3-HT do not have a monotonic relationship in general when describing three-class task performance. Numerical experiments demonstrated that the VUS provided reasonable results, while the 3-HT failed to distinguish between the case where all objects could be perfectly classified from the case where only one pair of the classes could be perfectly classified. We have provided theoretical explanations of this failure of 3-HT. The significance of this work goes beyond merely demonstrating the problems of the 3-HT, it demonstrates that a FOM that is mathematically correct and has a strong theoretical basis can provide results that violate a common sense understanding of three-class task performance. This fact raises the question of "how to evaluate a classification performance evaluation method?" We believe the answer to this question lies in the theoretical foundations of binary ROC analysis. We have thus contrasted the two FOMs in terms of three fundamental theories underlying binary ROC analysis: decision theory, binary linear discriminant analysis, and the equivalence of two psychophysical classification procedures. These theoretical investigations demonstrated the importance of extending and unifying all the fundamental theories of binary classification in the development of a three-class FOM; violating one of theses fundamental binary classification theories may, as it did for the L-HT, provide predictions of three-class task performance that do not agree with a common sense understanding of three-class task performance.
Collapse
Affiliation(s)
- Xin He
- Department of Radiology, Johns Hopkins School of Medicine, Baltimore, MD 21287 USA (e-mail: )
| | - Eric C. Frey
- Department of Radiology, Johns Hopkins School of Medicine, Baltimore, MD 21287 USA (e-mail: )
| |
Collapse
|
6
|
Sampat MP, Patel AC, Wang Y, Gupta S, Kan CW, Bovik AC, Markey MK. Indexes for three-class classification performance assessment--an empirical comparison. ACTA ACUST UNITED AC 2009; 13:300-12. [PMID: 19171528 DOI: 10.1109/titb.2008.2009440] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Assessment of classifier performance is critical for fair comparison of methods, including considering alternative models or parameters during system design. The assessment must not only provide meaningful data on the classifier efficacy, but it must do so in a concise and clear manner. For two-class classification problems, receiver operating characteristic analysis provides a clear and concise assessment methodology for reporting performance and comparing competing systems. However, many other important biomedical questions cannot be posed as "two-class" classification tasks and more than two classes are often necessary. While several methods have been proposed for assessing the performance of classifiers for such multiclass problems, none has been widely accepted. The purpose of this paper is to critically review methods that have been proposed for assessing multiclass classifiers. A number of these methods provide a classifier performance index called the volume under surface (VUS). Empirical comparisons are carried out using 4 three-class case studies, in which three popular classification techniques are evaluated with these methods. Since the same classifier was assessed using multiple performance indexes, it is possible to gain insight into the relative strengths and weakness of the measures. We conclude that: 1) the method proposed by Scurfield provides the most detailed description of classifier performance and insight about the sources of error in a given classification task and 2) the methods proposed by He and Nakas also have great practical utility as they provide both the VUS and an estimate of the variance of the VUS. These estimates can be used to statistically compare two classification algorithms.
Collapse
Affiliation(s)
- Mehul P Sampat
- Center for Neurological Imaging, Department of Radiology, Brigham and Women's Hospital, Boston, MA 02115, USA.
| | | | | | | | | | | | | |
Collapse
|
7
|
He X, Song X, Frey EC. Application of three-class ROC analysis to task-based image quality assessment of simultaneous dual-isotope myocardial perfusion SPECT (MPS). IEEE TRANSACTIONS ON MEDICAL IMAGING 2008; 27:1556-67. [PMID: 18955172 PMCID: PMC2668219 DOI: 10.1109/tmi.2008.928921] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
The diagnosis of cardiac disease using dual-isotope myocardial perfusion SPECT (MPS) is based on the defect status in both stress and rest images, and can be modeled as a three-class task of classifying patients as having no, reversible, or fixed perfusion defects. Simultaneous acquisition protocols for dual-isotope MPS imaging have gained much interest due to their advantages including perfect registration of the (201)Tl and (99m)Tc images in space and time, increased patient comfort, and higher clinical throughput. As a result of simultaneous acquisition, however, crosstalk contamination, where photons emitted by one isotope contribute to the image of the other isotope, degrades image quality. Minimizing the crosstalk is important in obtaining the best possible image quality. One way to minimize the crosstalk is to optimize the injected activity of the two isotopes by considering the three-class nature of the diagnostic problem. To effectively do so, we have previously developed a three-class receiver operating characteristic (ROC) analysis methodology that extends and unifies the decision theoretic, linear discriminant analysis, and psychophysical foundations of binary ROC analysis in a three-class paradigm. In this work, we applied the proposed three-class ROC methodology to the assessment of the image quality of simultaneous dual-isotope MPS imaging techniques and the determination of the optimal injected activity combination. In addition to this application, the rapid development of diagnostic imaging techniques has produced an increasing number of clinical diagnostic tasks that involve not only disease detection, but also disease characterization and are thus multiclass tasks. This paper provides a practical example of the application of the proposed three-class ROC analysis methodology to medical problems.
Collapse
Affiliation(s)
- Xin He
- X. He is with the Department of Radiology, Johns Hopkins School of Medicine, 601 N. Caroline Street, Baltimore, MD 21287 USA (e-mail: )
| | - Xiyun Song
- Philips Medical Systems, San Jose, CA 95134 USA
| | - Eric C. Frey
- Department of Radiology, Johns Hopkins School of Medicine, Baltimore, MD 21287 USA (e-mail: )
| |
Collapse
|
8
|
He X, Frey EC. The meaning and use of the volume under a three-class ROC surface (VUS). IEEE TRANSACTIONS ON MEDICAL IMAGING 2008; 27:577-88. [PMID: 18450532 PMCID: PMC2654215 DOI: 10.1109/tmi.2007.908687] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Previously, we have proposed a method for three-class receiver operating characteristic (ROC) analysis based on decision theory. In this method, the volume under a three-class ROC surface (VUS) serves as a figure-of-merit (FOM) and measures three-class task performance. The proposed three-class ROC analysis method was demonstrated to be optimal under decision theory according to several decision criteria. Further, an optimal three-class linear observer was proposed to simultaneously maximize the signal-to-noise ratio (SNR) between the test statistics of each pair of the classes provided certain data linearity condition. Applicability of this three-class ROC analysis method would be further enhanced by the development of an intuitive meaning of the VUS and a more general method to calculate the VUS that provides an estimate of its standard error. In this paper, we investigated the general meaning and usage of VUS as a FOM for three-class classification task performance. We showed that the VUS value, which is obtained from a rating procedure, equals the percent correct in a corresponding categorization procedure for continuous rating data. The significance of this relationship goes beyond providing another theoretical basis for three-class ROC analysis-it enables statistical analysis of the VUS value. Based on this relationship, we developed and tested algorithms for calculating the VUS and its variance. Finally, we reviewed the current status of the proposed three-class ROC analysis methodology, and concluded that it extends and unifies decision theoretic, linear discriminant analysis, and psychophysical foundations of binary ROC analysis in a three-class paradigm.
Collapse
Affiliation(s)
- Xin He
- Department of Radiology, Johns Hopkins School of Medicine, 601 N. Caroline Street, Baltimore, MD 21287 USA
| | - Eric. C. Frey
- Department of Radiology, Johns Hopkins School of Medicine, Baltimore, MD 21287 USA (e-mail: )
| |
Collapse
|
9
|
Sahiner B, Chan HP, Hadjiiski LM. Performance analysis of three-class classifiers: properties of a 3-D ROC surface and the normalized volume under the surface for the ideal observer. IEEE TRANSACTIONS ON MEDICAL IMAGING 2008; 27:215-227. [PMID: 18334443 PMCID: PMC3023151 DOI: 10.1109/tmi.2007.905822] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Classification of a given observation to one of three classes is an important task in many decision processes or pattern recognition applications. A general analysis of the performance of three-class classifiers results in a complex 6-D receiver operating characteristic (ROC) space, for which no simple analytical tool exists at present. We investigate the performance of an ideal observer under a specific set of assumptions that reduces the 6-D ROC space to 3-D by constraining the utilities of some of the decisions in the classification task. These assumptions lead to a 3-D ROC space in which the true-positive fraction (TPF) can be expressed in terms of the two types of false-positive fractions (FPFs). We demonstrate that the TPF is uniquely determined by, and therefore is a function of, the two FPFs. The domain of this function is shown to be related to the decision boundaries in the likelihood ratio plane. Based on these properties of the 3-D ROC space, we can define a summary measure, referred to as the normalized volume under the surface (NVUS), that is analogous to the area under the ROC curve (AUC) for a two-class classifier. We further investigate the properties of the 3-D ROC surface and the NVUS for the ideal observer under the condition that the three class distributions are multivariate normal with equal covariance matrices. The probability density functions (pdfs) of the decision variables are shown to follow a bivariate log-normal distribution. By considering these pdfs, we express the TPF in terms of the FPFs, and integrate the TPF over its domain numerically to obtain the NVUS. In addition, we performed a Monte Carlo simulation study, in which the 3-D ROC surface was generated by empirical "optimal" classification of case samples in the multidimensional feature space following the assumed distributions, to obtain an independent estimate of NVUS. The NVUS value obtained by using the analytical pdfs was found to be in good agreemen- t with that obtained from the Monte Carlo simulation study. We also found that, under all conditions studied, the NVUS increased when the difficulty of the classification task was reduced by changing the parameters of the class distributions, thereby exhibiting the properties of a performance metric in analogous to AUC. Our results indicate that, under the conditions that lead to our 3-D ROC analysis, the performance of a three-class classifier may be analyzed by considering the ROC surface, and its accuracy characterized by the NVUS.
Collapse
Affiliation(s)
- Berkman Sahiner
- Department of Radiology, University of Michigan, Ann Arbor, MI 48109, USA.
| | | | | |
Collapse
|