1
|
Yu Z, Rahman A, Laforest R, Schindler TH, Gropler RJ, Wahl RL, Siegel BA, Jha AK. Need for objective task-based evaluation of deep learning-based denoising methods: A study in the context of myocardial perfusion SPECT. Med Phys 2023; 50:4122-4137. [PMID: 37010001 PMCID: PMC10524194 DOI: 10.1002/mp.16407] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 01/20/2023] [Accepted: 03/01/2023] [Indexed: 04/04/2023] Open
Abstract
BACKGROUND Artificial intelligence-based methods have generated substantial interest in nuclear medicine. An area of significant interest has been the use of deep-learning (DL)-based approaches for denoising images acquired with lower doses, shorter acquisition times, or both. Objective evaluation of these approaches is essential for clinical application. PURPOSE DL-based approaches for denoising nuclear-medicine images have typically been evaluated using fidelity-based figures of merit (FoMs) such as root mean squared error (RMSE) and structural similarity index measure (SSIM). However, these images are acquired for clinical tasks and thus should be evaluated based on their performance in these tasks. Our objectives were to: (1) investigate whether evaluation with these FoMs is consistent with objective clinical-task-based evaluation; (2) provide a theoretical analysis for determining the impact of denoising on signal-detection tasks; and (3) demonstrate the utility of virtual imaging trials (VITs) to evaluate DL-based methods. METHODS A VIT to evaluate a DL-based method for denoising myocardial perfusion SPECT (MPS) images was conducted. To conduct this evaluation study, we followed the recently published best practices for the evaluation of AI algorithms for nuclear medicine (the RELAINCE guidelines). An anthropomorphic patient population modeling clinically relevant variability was simulated. Projection data for this patient population at normal and low-dose count levels (20%, 15%, 10%, 5%) were generated using well-validated Monte Carlo-based simulations. The images were reconstructed using a 3-D ordered-subsets expectation maximization-based approach. Next, the low-dose images were denoised using a commonly used convolutional neural network-based approach. The impact of DL-based denoising was evaluated using both fidelity-based FoMs and area under the receiver operating characteristic curve (AUC), which quantified performance on the clinical task of detecting perfusion defects in MPS images as obtained using a model observer with anthropomorphic channels. We then provide a mathematical treatment to probe the impact of post-processing operations on signal-detection tasks and use this treatment to analyze the findings of this study. RESULTS Based on fidelity-based FoMs, denoising using the considered DL-based method led to significantly superior performance. However, based on ROC analysis, denoising did not improve, and in fact, often degraded detection-task performance. This discordance between fidelity-based FoMs and task-based evaluation was observed at all the low-dose levels and for different cardiac-defect types. Our theoretical analysis revealed that the major reason for this degraded performance was that the denoising method reduced the difference in the means of the reconstructed images and of the channel operator-extracted feature vectors between the defect-absent and defect-present cases. CONCLUSIONS The results show the discrepancy between the evaluation of DL-based methods with fidelity-based metrics versus the evaluation on clinical tasks. This motivates the need for objective task-based evaluation of DL-based denoising approaches. Further, this study shows how VITs provide a mechanism to conduct such evaluations computationally, in a time and resource-efficient setting, and avoid risks such as radiation dose to the patient. Finally, our theoretical treatment reveals insights into the reasons for the limited performance of the denoising approach and may be used to probe the effect of other post-processing operations on signal-detection tasks.
Collapse
Affiliation(s)
- Zitong Yu
- Department of Biomedical Engineering, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Ashequr Rahman
- Department of Biomedical Engineering, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Richard Laforest
- Mallinckrodt Institute of Radiology, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Thomas H. Schindler
- Mallinckrodt Institute of Radiology, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Robert J. Gropler
- Mallinckrodt Institute of Radiology, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Richard L. Wahl
- Mallinckrodt Institute of Radiology, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Barry A. Siegel
- Mallinckrodt Institute of Radiology, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Abhinav K. Jha
- Department of Biomedical Engineering, Washington University in St. Louis, St. Louis, Missouri, USA
- Mallinckrodt Institute of Radiology, Washington University in St. Louis, St. Louis, Missouri, USA
| |
Collapse
|
2
|
Li Y, O'Reilly S, Plyku D, Treves ST, Fahey F, Du Y, Cao X, Sexton-Stallone B, Brown J, Sgouros G, Bolch WE, Frey EC. Current pediatric administered activity guidelines for 99m Tc-DMSA SPECT based on patient weight do not provide the same task-based image quality. Med Phys 2019; 46:4847-4856. [PMID: 31448427 DOI: 10.1002/mp.13787] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2019] [Revised: 08/16/2019] [Accepted: 08/16/2019] [Indexed: 11/11/2022] Open
Abstract
PURPOSE In the current clinical practice, administered activity (AA) for pediatric molecular imaging is often based on the North American expert consensus guidelines or the European Association of Nuclear Medicine dosage card, both of which were developed based on the best clinical practice. These guidelines were not formulated using a rigorous evaluation of diagnostic image quality (IQ) relative to AA. In the guidelines, AA is determined by a weight-based scaling of the adult AA, along with minimum and maximum AA constraints. In this study, we use task-based IQ assessment methods to rigorously evaluate the efficacy of weight-based scaling in equalizing IQ using a population of pediatric patients of different ages and body weights. METHODS A previously developed projection image database was used. We measured task-based IQ, with respect to the detection of a renal functional defect at six different AA levels (AA relative to the AA obtained from the guidelines). IQ was assessed using an anthropomorphic model observer. Receiver-operating characteristics (ROC) analysis was applied; the area under the ROC curve (AUC) served as a figure-of-merit for task performance. In addition, we investigated patient girth (circumference) as a potential improved predictor of the IQ. RESULTS The data demonstrate a monotonic and modestly saturating increase in AUC with increasing AA, indicating that defect detectability was limited by quantum noise and the effects of object variability were modest over the range of AA levels studied. The AA for a given value of the AUC increased with increasing age. The AUC vs AA plots for all the patient ages indicate that, for the current guidelines, the newborn and 10- and 15-yr phantoms had similar IQ for the same AA suggested by the North American expert consensus guidelines, but the 5- and 1-yr phantoms had lower IQ. The results also showed that girth has a stronger correlation with the needed AA to provide a constant AUC for 99m Tc-DMSA renal SPECT. CONCLUSIONS The results suggest that (a) weight-based scaling is not sufficient to equalize task-based IQ for patients of different weights in pediatric 99m Tc-DMSA renal SPECT; and (b) patient girth should be considered instead of weight in developing new administration guidelines for pediatric patients.
Collapse
Affiliation(s)
- Ye Li
- Department of Electrical and Computer Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA.,The Russell H Morgan Department of Radiology and Radiological Science, School of Medicine, Johns Hopkins University, Baltimore, MD, 21287, USA
| | - Shannon O'Reilly
- Department of Radiation Oncology, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Donika Plyku
- The Russell H Morgan Department of Radiology and Radiological Science, School of Medicine, Johns Hopkins University, Baltimore, MD, 21287, USA
| | - S Ted Treves
- Department of Radiology, Brigham and Women's Hospital, Boston, MA, 02115, USA.,Department of Radiology, Harvard Medical School, Boston, MA, 02115, USA
| | - Frederic Fahey
- Department of Radiology, Harvard Medical School, Boston, MA, 02115, USA.,Department of Radiology, Boston Children's Hospital, Boston, MA, 02115, USA
| | - Yong Du
- The Russell H Morgan Department of Radiology and Radiological Science, School of Medicine, Johns Hopkins University, Baltimore, MD, 21287, USA
| | - Xinhua Cao
- Department of Radiology, Harvard Medical School, Boston, MA, 02115, USA.,Department of Radiology, Boston Children's Hospital, Boston, MA, 02115, USA
| | | | - Justin Brown
- J. Crayton Pruitt Family Department of Biomedical Engineering, University of Florida, Gainesville, FL, 32611, USA
| | - George Sgouros
- The Russell H Morgan Department of Radiology and Radiological Science, School of Medicine, Johns Hopkins University, Baltimore, MD, 21287, USA.,School of Medicine, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21287, USA
| | - Wesley E Bolch
- J. Crayton Pruitt Family Department of Biomedical Engineering, University of Florida, Gainesville, FL, 32611, USA
| | - Eric C Frey
- Department of Electrical and Computer Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA.,The Russell H Morgan Department of Radiology and Radiological Science, School of Medicine, Johns Hopkins University, Baltimore, MD, 21287, USA.,School of Medicine, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21287, USA
| |
Collapse
|
3
|
Li Y, O'Reilly S, Plyku D, Treves ST, Du Y, Fahey F, Cao X, Jha AK, Sgouros G, Bolch WE, Frey EC. A projection image database to investigate factors affecting image quality in weight-based dosing: application to pediatric renal SPECT. Phys Med Biol 2018; 63:145004. [PMID: 29893291 DOI: 10.1088/1361-6560/aacbf0] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Balancing the tradeoff between radiation dose, acquisition duration and diagnostic image quality is essential for medical imaging modalities involving ionizing radiation. Lower administered activities to the patient can reduce absorbed dose, but can result in reduced diagnostic image quality or require longer acquisition durations. In pediatric nuclear medicine, it is desirable to use the lowest amount of administered radiopharmaceutical activity and the shortest acquisition duration that gives sufficient image quality for clinical diagnosis. However, diagnostic image quality is a complex function of patient factors including body morphometry. In this study, we present a digital population of 90 computational anatomic phantoms that model realistic variations in body morphometry and internal anatomy. These phantoms were used to generate a large database of projection images modeling pediatric SPECT imaging using a 99mTc-DMSA tracer. We used an analytic projection code that models attenuation, spatially varying collimator-detector response, and object-dependent scatter to generate the projections. The projections for each organ were generated separately and can be subsequently scaled by parameters extracted from a pharmacokinetics model to simulate realistic tracer biodistribution, including variations in uptake, inside each relevant organ or tissue structure for a given tracer. Noise-free projection images can be obtained by summing these individual organ projections and scaling by the system sensitivity and acquisition duration. We applied this database in the context of 99mTc-DMSA renal SPECT, the most common nuclear medicine imaging procedure in pediatric patients. Organ uptake fractions based on literature values and patient studies were used. Patient SPECT images were used to verify that the sum of counts in the simulated projection images was clinically realistic. For each phantom, 384 uptake realizations, modeling random variations in the uptakes of organs of interest, were generated, producing 34 560 noise-free projection datasets (384 uptake realizations times 90 phantoms). Noisy images modeling various count levels (corresponding to different products of acquisition duration and administered activity) were generated by appropriately scaling these images and simulating Poisson noise. Acquisition duration was fixed; six count levels were simulated corresponding to projection images acquired using 25%, 50%, 75%, 100%, 125%, and 150% of the original weight-based administrated activity as computed using the North American Guidelines (Gelfand et al 2011 J. Nucl. Med. 52 318-22). Combined, a total number of 207 360 noisy projection images were generated, creating a realistic projection database for use in renal pediatric SPECT imaging research. The phantoms and projection datasets were used to calculate three surrogate indices for factors affecting image quality: renal count density, average radius of rotation, and scatter-to-primary ratio. Differences in these indices were seen across the phantoms for dosing based on current guidelines, and especially for the phantom modeling the newborn. We also performed an image quality study using an anthropomorphic model observer that demonstrates that the weight-based dose scaling does not equalize image quality as measured by the area under the receiver-operating characteristics curve. These studies suggest that a dosing procedure beyond weight-based scaling of administered activities is required to equalize image quality in pediatric renal SPECT.
Collapse
Affiliation(s)
- Ye Li
- Department of Electrical and Computer Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218, United States of America. The Russell H Morgan Department of Radiology and Radiological Science, School of Medicine, Johns Hopkins University, Baltimore, MD 21287, United States of America
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
4
|
Elshahaby FEA, Jha AK, Ghaly M, Frey EC. A comparison of resampling schemes for estimating model observer performance with small ensembles. Phys Med Biol 2017; 62:7300-7320. [PMID: 28829044 DOI: 10.1088/1361-6560/aa807a] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
In objective assessment of image quality, an ensemble of images is used to compute the 1st and 2nd order statistics of the data. Often, only a finite number of images is available, leading to the issue of statistical variability in numerical observer performance. Resampling-based strategies can help overcome this issue. In this paper, we compared different combinations of resampling schemes (the leave-one-out (LOO) and the half-train/half-test (HT/HT)) and model observers (the conventional channelized Hotelling observer (CHO), channelized linear discriminant (CLD) and channelized quadratic discriminant). Observer performance was quantified by the area under the ROC curve (AUC). For a binary classification task and for each observer, the AUC value for an ensemble size of 2000 samples per class served as a gold standard for that observer. Results indicated that each observer yielded a different performance depending on the ensemble size and the resampling scheme. For a small ensemble size, the combination [CHO, HT/HT] had more accurate rankings than the combination [CHO, LOO]. Using the LOO scheme, the CLD and CHO had similar performance for large ensembles. However, the CLD outperformed the CHO and gave more accurate rankings for smaller ensembles. As the ensemble size decreased, the performance of the [CHO, LOO] combination seriously deteriorated as opposed to the [CLD, LOO] combination. Thus, it might be desirable to use the CLD with the LOO scheme when smaller ensemble size is available.
Collapse
Affiliation(s)
- Fatma E A Elshahaby
- Department of Electrical and Computer Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218, United States of America. The Russell H Morgan Department of Radiology and Radiological Science, School of Medicine, Johns Hopkins University, Baltimore, MD 21287, United States of America. Department of Computers and Systems, Electronics Research Institute, Cairo, Egypt
| | | | | | | |
Collapse
|
5
|
Li X, Jha AK, Ghaly M, Elshahaby FEA, Links JM, Frey EC. Use of Sub-Ensembles and Multi-Template Observers to Evaluate Detection Task Performance for Data That are Not Multivariate Normal. IEEE TRANSACTIONS ON MEDICAL IMAGING 2017; 36:917-929. [PMID: 28026757 PMCID: PMC5496770 DOI: 10.1109/tmi.2016.2643684] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
The Hotelling Observer (HO) is widely used to evaluate image quality in medical imaging. However, applying it to data that are not multivariate-normally (MVN) distributed is not optimal. In this paper, we apply two multi-template linear observer strategies to handle such data. First, the entire data ensemble is divided into sub-ensembles that are exactly or approximately MVN and homoscedastic. Next, a different linear observer template is estimated for and applied to each sub-ensemble. The first multi-template strategy, adapted from previous work, applies the HO to each sub-ensemble, calculates the area under the receiver operating characteristics curve (AUC) for each sub-ensemble, and averages the AUCs from all the sub-ensembles. The second strategy applies the Linear Discriminant (LD) to estimate test statistics for each sub-ensemble and calculates a single global AUC using the pooled test statistics from all the sub-ensembles. We show that this second strategy produces the maximum AUC when only shifting of the HO test statistics is allowed. We compared these strategies to the use of a single HO template for the entire data ensemble by applying them to the non-MVN data obtained from reconstructed images of a realistic simulated population of myocardial perfusion SPECT studies with the goal of optimizing the reconstruction parameters. Of the strategies investigated, the multi-template LD strategy yielded the highest AUC for any given set of reconstruction parameters. The optimal reconstruction parameters obtained by the two multi-template strategies were comparable and produced higher AUCs for each sub-ensemble than the single-template HO strategy.
Collapse
|