1
|
Wu D, Smith D, VanBerlo B, Roshankar A, Lee H, Li B, Ali F, Rahman M, Basmaji J, Tschirhart J, Ford A, VanBerlo B, Durvasula A, Vannelli C, Dave C, Deglint J, Ho J, Chaudhary R, Clausdorff H, Prager R, Millington S, Shah S, Buchanan B, Arntfield R. Improving the Generalizability and Performance of an Ultrasound Deep Learning Model Using Limited Multicenter Data for Lung Sliding Artifact Identification. Diagnostics (Basel) 2024; 14:1081. [PMID: 38893608 PMCID: PMC11172006 DOI: 10.3390/diagnostics14111081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 05/18/2024] [Accepted: 05/20/2024] [Indexed: 06/21/2024] Open
Abstract
Deep learning (DL) models for medical image classification frequently struggle to generalize to data from outside institutions. Additional clinical data are also rarely collected to comprehensively assess and understand model performance amongst subgroups. Following the development of a single-center model to identify the lung sliding artifact on lung ultrasound (LUS), we pursued a validation strategy using external LUS data. As annotated LUS data are relatively scarce-compared to other medical imaging data-we adopted a novel technique to optimize the use of limited external data to improve model generalizability. Externally acquired LUS data from three tertiary care centers, totaling 641 clips from 238 patients, were used to assess the baseline generalizability of our lung sliding model. We then employed our novel Threshold-Aware Accumulative Fine-Tuning (TAAFT) method to fine-tune the baseline model and determine the minimum amount of data required to achieve predefined performance goals. A subgroup analysis was also performed and Grad-CAM++ explanations were examined. The final model was fine-tuned on one-third of the external dataset to achieve 0.917 sensitivity, 0.817 specificity, and 0.920 area under the receiver operator characteristic curve (AUC) on the external validation dataset, exceeding our predefined performance goals. Subgroup analyses identified LUS characteristics that most greatly challenged the model's performance. Grad-CAM++ saliency maps highlighted clinically relevant regions on M-mode images. We report a multicenter study that exploits limited available external data to improve the generalizability and performance of our lung sliding model while identifying poorly performing subgroups to inform future iterative improvements. This approach may contribute to efficiencies for DL researchers working with smaller quantities of external validation data.
Collapse
Affiliation(s)
- Derek Wu
- Department of Medicine, Western University, London, ON N6A 5C1, Canada;
| | - Delaney Smith
- Faculty of Mathematics, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (D.S.); (H.L.)
| | - Blake VanBerlo
- Faculty of Mathematics, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (D.S.); (H.L.)
| | - Amir Roshankar
- Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (A.R.); (B.L.); (F.A.); (M.R.)
| | - Hoseok Lee
- Faculty of Mathematics, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (D.S.); (H.L.)
| | - Brian Li
- Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (A.R.); (B.L.); (F.A.); (M.R.)
| | - Faraz Ali
- Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (A.R.); (B.L.); (F.A.); (M.R.)
| | - Marwan Rahman
- Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (A.R.); (B.L.); (F.A.); (M.R.)
| | - John Basmaji
- Division of Critical Care Medicine, Western University, London, ON N6A 5C1, Canada; (J.B.); (C.D.); (R.P.); (R.A.)
| | - Jared Tschirhart
- Schulich School of Medicine and Dentistry, Western University, London, ON N6A 5C1, Canada; (J.T.); (A.D.); (C.V.)
| | - Alex Ford
- Independent Researcher, London, ON N6A 1L8, Canada;
| | - Bennett VanBerlo
- Faculty of Engineering, Western University, London, ON N6A 5C1, Canada;
| | - Ashritha Durvasula
- Schulich School of Medicine and Dentistry, Western University, London, ON N6A 5C1, Canada; (J.T.); (A.D.); (C.V.)
| | - Claire Vannelli
- Schulich School of Medicine and Dentistry, Western University, London, ON N6A 5C1, Canada; (J.T.); (A.D.); (C.V.)
| | - Chintan Dave
- Division of Critical Care Medicine, Western University, London, ON N6A 5C1, Canada; (J.B.); (C.D.); (R.P.); (R.A.)
| | - Jason Deglint
- Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (A.R.); (B.L.); (F.A.); (M.R.)
| | - Jordan Ho
- Department of Family Medicine, Western University, London, ON N6A 5C1, Canada;
| | - Rushil Chaudhary
- Department of Medicine, Western University, London, ON N6A 5C1, Canada;
| | - Hans Clausdorff
- Departamento de Medicina de Urgencia, Pontificia Universidad Católica de Chile, Santiago 8331150, Chile;
| | - Ross Prager
- Division of Critical Care Medicine, Western University, London, ON N6A 5C1, Canada; (J.B.); (C.D.); (R.P.); (R.A.)
| | - Scott Millington
- Department of Critical Care Medicine, University of Ottawa, Ottawa, ON K1N 6N5, Canada;
| | - Samveg Shah
- Department of Medicine, University of Alberta, Edmonton, AB T6G 2R3, Canada;
| | - Brian Buchanan
- Department of Critical Care, University of Alberta, Edmonton, AB T6G 2R3, Canada;
| | - Robert Arntfield
- Division of Critical Care Medicine, Western University, London, ON N6A 5C1, Canada; (J.B.); (C.D.); (R.P.); (R.A.)
| |
Collapse
|
2
|
Rajaraman S, Zamzmi G, Yang F, Liang Z, Xue Z, Antani S. Uncovering the effects of model initialization on deep model generalization: A study with adult and pediatric chest X-ray images. PLOS DIGITAL HEALTH 2024; 3:e0000286. [PMID: 38232121 DOI: 10.1371/journal.pdig.0000286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 12/04/2023] [Indexed: 01/19/2024]
Abstract
Model initialization techniques are vital for improving the performance and reliability of deep learning models in medical computer vision applications. While much literature exists on non-medical images, the impacts on medical images, particularly chest X-rays (CXRs) are less understood. Addressing this gap, our study explores three deep model initialization techniques: Cold-start, Warm-start, and Shrink and Perturb start, focusing on adult and pediatric populations. We specifically focus on scenarios with periodically arriving data for training, thereby embracing the real-world scenarios of ongoing data influx and the need for model updates. We evaluate these models for generalizability against external adult and pediatric CXR datasets. We also propose novel ensemble methods: F-score-weighted Sequential Least-Squares Quadratic Programming (F-SLSQP) and Attention-Guided Ensembles with Learnable Fuzzy Softmax to aggregate weight parameters from multiple models to capitalize on their collective knowledge and complementary representations. We perform statistical significance tests with 95% confidence intervals and p-values to analyze model performance. Our evaluations indicate models initialized with ImageNet-pretrained weights demonstrate superior generalizability over randomly initialized counterparts, contradicting some findings for non-medical images. Notably, ImageNet-pretrained models exhibit consistent performance during internal and external testing across different training scenarios. Weight-level ensembles of these models show significantly higher recall (p<0.05) during testing compared to individual models. Thus, our study accentuates the benefits of ImageNet-pretrained weight initialization, especially when used with weight-level ensembles, for creating robust and generalizable deep learning solutions.
Collapse
Affiliation(s)
- Sivaramakrishnan Rajaraman
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Ghada Zamzmi
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Feng Yang
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Zhaohui Liang
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Zhiyun Xue
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Sameer Antani
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|
3
|
Rajaraman S, Yang F, Zamzmi G, Xue Z, Antani S. Can Deep Adult Lung Segmentation Models Generalize to the Pediatric Population? EXPERT SYSTEMS WITH APPLICATIONS 2023; 229:120531. [PMID: 37397242 PMCID: PMC10310063 DOI: 10.1016/j.eswa.2023.120531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Lung segmentation in chest X-rays (CXRs) is an important prerequisite for improving the specificity of diagnoses of cardiopulmonary diseases in a clinical decision support system. Current deep learning models for lung segmentation are trained and evaluated on CXR datasets in which the radiographic projections are captured predominantly from the adult population. However, the shape of the lungs is reported to be significantly different across the developmental stages from infancy to adulthood. This might result in age-related data domain shifts that would adversely impact lung segmentation performance when the models trained on the adult population are deployed for pediatric lung segmentation. In this work, our goal is to (i) analyze the generalizability of deep adult lung segmentation models to the pediatric population and (ii) improve performance through a stage-wise, systematic approach consisting of CXR modality-specific weight initializations, stacked ensembles, and an ensemble of stacked ensembles. To evaluate segmentation performance and generalizability, novel evaluation metrics consisting of mean lung contour distance (MLCD) and average hash score (AHS) are proposed in addition to the multi-scale structural similarity index measure (MS-SSIM), the intersection of union (IoU), Dice score, 95% Hausdorff distance (HD95), and average symmetric surface distance (ASSD). Our results showed a significant improvement (p < 0.05) in cross-domain generalization through our approach. This study could serve as a paradigm to analyze the cross-domain generalizability of deep segmentation models for other medical imaging modalities and applications.
Collapse
Affiliation(s)
- Sivaramakrishnan Rajaraman
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Feng Yang
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Ghada Zamzmi
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Zhiyun Xue
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Sameer Antani
- Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
4
|
Krokos G, MacKewn J, Dunn J, Marsden P. A review of PET attenuation correction methods for PET-MR. EJNMMI Phys 2023; 10:52. [PMID: 37695384 PMCID: PMC10495310 DOI: 10.1186/s40658-023-00569-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Accepted: 08/07/2023] [Indexed: 09/12/2023] Open
Abstract
Despite being thirteen years since the installation of the first PET-MR system, the scanners constitute a very small proportion of the total hybrid PET systems installed. This is in stark contrast to the rapid expansion of the PET-CT scanner, which quickly established its importance in patient diagnosis within a similar timeframe. One of the main hurdles is the development of an accurate, reproducible and easy-to-use method for attenuation correction. Quantitative discrepancies in PET images between the manufacturer-provided MR methods and the more established CT- or transmission-based attenuation correction methods have led the scientific community in a continuous effort to develop a robust and accurate alternative. These can be divided into four broad categories: (i) MR-based, (ii) emission-based, (iii) atlas-based and the (iv) machine learning-based attenuation correction, which is rapidly gaining momentum. The first is based on segmenting the MR images in various tissues and allocating a predefined attenuation coefficient for each tissue. Emission-based attenuation correction methods aim in utilising the PET emission data by simultaneously reconstructing the radioactivity distribution and the attenuation image. Atlas-based attenuation correction methods aim to predict a CT or transmission image given an MR image of a new patient, by using databases containing CT or transmission images from the general population. Finally, in machine learning methods, a model that could predict the required image given the acquired MR or non-attenuation-corrected PET image is developed by exploiting the underlying features of the images. Deep learning methods are the dominant approach in this category. Compared to the more traditional machine learning, which uses structured data for building a model, deep learning makes direct use of the acquired images to identify underlying features. This up-to-date review goes through the literature of attenuation correction approaches in PET-MR after categorising them. The various approaches in each category are described and discussed. After exploring each category separately, a general overview is given of the current status and potential future approaches along with a comparison of the four outlined categories.
Collapse
Affiliation(s)
- Georgios Krokos
- School of Biomedical Engineering and Imaging Sciences, The PET Centre at St Thomas' Hospital London, King's College London, 1st Floor Lambeth Wing, Westminster Bridge Road, London, SE1 7EH, UK.
| | - Jane MacKewn
- School of Biomedical Engineering and Imaging Sciences, The PET Centre at St Thomas' Hospital London, King's College London, 1st Floor Lambeth Wing, Westminster Bridge Road, London, SE1 7EH, UK
| | - Joel Dunn
- School of Biomedical Engineering and Imaging Sciences, The PET Centre at St Thomas' Hospital London, King's College London, 1st Floor Lambeth Wing, Westminster Bridge Road, London, SE1 7EH, UK
| | - Paul Marsden
- School of Biomedical Engineering and Imaging Sciences, The PET Centre at St Thomas' Hospital London, King's College London, 1st Floor Lambeth Wing, Westminster Bridge Road, London, SE1 7EH, UK
| |
Collapse
|
5
|
Beheshtian E, Putman K, Santomartino SM, Parekh VS, Yi PH. Generalizability and Bias in a Deep Learning Pediatric Bone Age Prediction Model Using Hand Radiographs. Radiology 2023; 306:e220505. [PMID: 36165796 DOI: 10.1148/radiol.220505] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Background Although deep learning (DL) models have demonstrated expert-level ability for pediatric bone age prediction, they have shown poor generalizability and bias in other use cases. Purpose To quantify generalizability and bias in a bone age DL model measured by performance on external versus internal test sets and performance differences between different demographic groups, respectively. Materials and Methods The winning DL model of the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated and trained on 12 611 pediatric hand radiographs from two U.S. hospitals. The DL model was tested from September 2021 to December 2021 on an internal validation set and an external test set of pediatric hand radiographs with diverse demographic representation. Images reporting ground-truth bone age were included for study. Mean absolute difference (MAD) between ground-truth bone age and the model prediction bone age was calculated for each set. Generalizability was evaluated by comparing MAD between internal and external evaluation sets with use of t tests. Bias was evaluated by comparing MAD and clinically significant error rate (rate of errors changing the clinical diagnosis) between demographic groups with use of t tests or analysis of variance and χ2 tests, respectively (statistically significant difference defined as P < .05). Results The internal validation set had images from 1425 individuals (773 boys), and the external test set had images from 1202 individuals (mean age, 133 months ± 60 [SD]; 614 boys). The bone age model generalized well to the external test set, with no difference in MAD (6.8 months in the validation set vs 6.9 months in the external set; P = .64). Model predictions would have led to clinically significant errors in 194 of 1202 images (16%) in the external test set. The MAD was greater for girls than boys in the internal validation set (P = .01) and in the subcategories of age and Tanner stage in the external test set (P < .001 for both). Conclusion A deep learning (DL) bone age model generalized well to an external test set, although clinically significant sex-, age-, and sexual maturity-based biases in DL bone age were identified. © RSNA, 2022 Online supplemental material is available for this article See also the editorial by Larson in this issue.
Collapse
Affiliation(s)
- Elham Beheshtian
- From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Room 1172, Baltimore, MD 21201
| | - Kristin Putman
- From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Room 1172, Baltimore, MD 21201
| | - Samantha M Santomartino
- From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Room 1172, Baltimore, MD 21201
| | - Vishwa S Parekh
- From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Room 1172, Baltimore, MD 21201
| | - Paul H Yi
- From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Room 1172, Baltimore, MD 21201
| |
Collapse
|
6
|
Chua M, Kim D, Choi J, Lee NG, Deshpande V, Schwab J, Lev MH, Gonzalez RG, Gee MS, Do S. Tackling prediction uncertainty in machine learning for healthcare. Nat Biomed Eng 2022:10.1038/s41551-022-00988-x. [PMID: 36581695 DOI: 10.1038/s41551-022-00988-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Accepted: 11/17/2022] [Indexed: 12/31/2022]
Abstract
Predictive machine-learning systems often do not convey the degree of confidence in the correctness of their outputs. To prevent unsafe prediction failures from machine-learning models, the users of the systems should be aware of the general accuracy of the model and understand the degree of confidence in each individual prediction. In this Perspective, we convey the need of prediction-uncertainty metrics in healthcare applications, with a focus on radiology. We outline the sources of prediction uncertainty, discuss how to implement prediction-uncertainty metrics in applications that require zero tolerance to errors and in applications that are error-tolerant, and provide a concise framework for understanding prediction uncertainty in healthcare contexts. For machine-learning-enabled automation to substantially impact healthcare, machine-learning models with zero tolerance for false-positive or false-negative errors must be developed intentionally.
Collapse
Affiliation(s)
- Michelle Chua
- Department of Radiology, Massachusetts General Hospital, Boston, MA, USA
| | - Doyun Kim
- Department of Radiology, Massachusetts General Hospital, Boston, MA, USA
| | - Jongmun Choi
- Department of Radiology, Massachusetts General Hospital, Boston, MA, USA
| | - Nahyoung G Lee
- Department of Ophthalmology, Massachusetts Eye and Ear Infirmary, Boston, MA, USA
| | - Vikram Deshpande
- Department of Pathology, Massachusetts General Hospital, Boston, MA, USA
| | - Joseph Schwab
- Department of Orthopedic Surgery, Massachusetts General Hospital, Boston, MA, USA
| | - Michael H Lev
- Department of Radiology, Massachusetts General Hospital, Boston, MA, USA
| | - Ramon G Gonzalez
- Department of Radiology, Massachusetts General Hospital, Boston, MA, USA
| | - Michael S Gee
- Department of Radiology, Massachusetts General Hospital, Boston, MA, USA
| | - Synho Do
- Department of Radiology, Massachusetts General Hospital, Boston, MA, USA. .,Department of Pathology, Massachusetts General Hospital, Boston, MA, USA.
| |
Collapse
|
7
|
Can images crowdsourced from the internet be used to train generalizable joint dislocation deep learning algorithms? Skeletal Radiol 2022; 51:2121-2128. [PMID: 35624310 DOI: 10.1007/s00256-022-04077-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 05/18/2022] [Accepted: 05/19/2022] [Indexed: 02/02/2023]
Abstract
OBJECTIVE Deep learning has the potential to automatically triage orthopedic emergencies, such as joint dislocations. However, due to the rarity of these injuries, collecting large numbers of images to train algorithms may be infeasible for many centers. We evaluated if the Internet could be used as a source of images to train convolutional neural networks (CNNs) for joint dislocations that would generalize well to real-world clinical cases. METHODS We collected datasets from online radiology repositories of 100 radiographs each (50 dislocated, 50 located) for four joints: native shoulder, elbow, hip, and total hip arthroplasty (THA). We trained a variety of CNN binary classifiers using both on-the-fly and static data augmentation to identify the various joint dislocations. The best-performing classifier for each joint was evaluated on an external test set of 100 corresponding radiographs (50 dislocations) from three hospitals. CNN performance was evaluated using area under the ROC curve (AUROC). To determine areas emphasized by the CNN for decision-making, class activation map (CAM) heatmaps were generated for test images. RESULTS The best-performing CNNs for elbow, hip, shoulder, and THA dislocation achieved high AUROCs on both internal and external test sets (internal/external AUC): elbow (1.0/0.998), hip (0.993/0.880), shoulder (1.0/0.993), THA (1.0/0.950). Heatmaps demonstrated appropriate emphasis of joints for both located and dislocated joints. CONCLUSION With modest numbers of images, radiographs from the Internet can be used to train clinically-generalizable CNNs for joint dislocations. Given the rarity of joint dislocations at many centers, online repositories may be a viable source for CNN-training data.
Collapse
|