1
|
Tang CC, Nagesh S, Fussell DA, Glavis-Bloom J, Mishra N, Li C, Cortes G, Hill R, Zhao J, Gordon A, Wright J, Troutt H, Tarrago R, Chow DS. Generating colloquial radiology reports with large language models. J Am Med Inform Assoc 2024; 31:2660-2667. [PMID: 39178375 PMCID: PMC11491646 DOI: 10.1093/jamia/ocae223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 08/05/2024] [Accepted: 08/08/2024] [Indexed: 08/25/2024] Open
Abstract
OBJECTIVES Patients are increasingly being given direct access to their medical records. However, radiology reports are written for clinicians and typically contain medical jargon, which can be confusing. One solution is for radiologists to provide a "colloquial" version that is accessible to the layperson. Because manually generating these colloquial translations would represent a significant burden for radiologists, a way to automatically produce accurate, accessible patient-facing reports is desired. We propose a novel method to produce colloquial translations of radiology reports by providing specialized prompts to a large language model (LLM). MATERIALS AND METHODS Our method automatically extracts and defines medical terms and includes their definitions in the LLM prompt. Using our method and a naive strategy, translations were generated at 4 different reading levels for 100 de-identified neuroradiology reports from an academic medical center. Translations were evaluated by a panel of radiologists for accuracy, likability, harm potential, and readability. RESULTS Our approach translated the Findings and Impression sections at the 8th-grade level with accuracies of 88% and 93%, respectively. Across all grade levels, our approach was 20% more accurate than the baseline method. Overall, translations were more readable than the original reports, as evaluated using standard readability indices. CONCLUSION We find that our translations at the eighth-grade level strike an optimal balance between accuracy and readability. Notably, this corresponds to nationally recognized recommendations for patient-facing health communication. We believe that using this approach to draft patient-accessible reports will benefit patients without significantly increasing the burden on radiologists.
Collapse
Affiliation(s)
- Cynthia Crystal Tang
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Supriya Nagesh
- Amazon Web Services, East Palo Alto, CA 94303, United States
| | - David A Fussell
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Justin Glavis-Bloom
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Nina Mishra
- Amazon Web Services, East Palo Alto, CA 94303, United States
| | - Charles Li
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Gillean Cortes
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Robert Hill
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Jasmine Zhao
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Angellica Gordon
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Joshua Wright
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Hayden Troutt
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Rod Tarrago
- Amazon Web Services, Seattle, WA 98121, United States
| | - Daniel S Chow
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| |
Collapse
|
2
|
Sarangi PK, Datta S, Swarup MS, Panda S, Nayak DSK, Malik A, Datta A, Mondal H. Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models-Bing, Claude, ChatGPT, and Perplexity. Indian J Radiol Imaging 2024; 34:653-660. [PMID: 39318561 PMCID: PMC11419749 DOI: 10.1055/s-0044-1787974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2024] Open
Abstract
Background Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)-Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity-in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE). Methods Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score). Result In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = -0.067, p = 0.54), while there is strong agreement for SATA (ICC = 0.875, p < 0.001). Conclusion The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making.
Collapse
Affiliation(s)
- Pradosh Kumar Sarangi
- Department of Radiodiagnosis, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India
| | - Suvrankar Datta
- Department of Radiodiagnosis, All India Institute of Medical Sciences New Delhi, New Delhi, India
| | - M. Sarthak Swarup
- Department of Radiodiagnosis, Vardhman Mahavir Medical College and Safdarjung Hospital New Delhi, New Delhi, India
| | - Swaha Panda
- Department of Otorhinolaryngology and Head and Neck Surgery, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India
| | - Debasish Swapnesh Kumar Nayak
- Department of Computer Science and Engineering, SOET, Centurion University of Technology and Management, Bhubaneswar, Odisha, India
| | - Archana Malik
- Department of Pulmonary Medicine, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India
| | - Ananda Datta
- Department of Pulmonary Medicine, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India
| | - Himel Mondal
- Department of Physiology, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India
| |
Collapse
|
3
|
Lyo S, Mohan S, Hassankhani A, Noor A, Dako F, Cook T. From Revisions to Insights: Converting Radiology Report Revisions into Actionable Educational Feedback Using Generative AI Models. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024:10.1007/s10278-024-01233-4. [PMID: 39160366 DOI: 10.1007/s10278-024-01233-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 08/07/2024] [Accepted: 08/08/2024] [Indexed: 08/21/2024]
Abstract
Expert feedback on trainees' preliminary reports is crucial for radiologic training, but real-time feedback can be challenging due to non-contemporaneous, remote reading and increasing imaging volumes. Trainee report revisions contain valuable educational feedback, but synthesizing data from raw revisions is challenging. Generative AI models can potentially analyze these revisions and provide structured, actionable feedback. This study used the OpenAI GPT-4 Turbo API to analyze paired synthesized and open-source analogs of preliminary and finalized reports, identify discrepancies, categorize their severity and type, and suggest review topics. Expert radiologists reviewed the output by grading discrepancies, evaluating the severity and category accuracy, and suggested review topic relevance. The reproducibility of discrepancy detection and maximal discrepancy severity was also examined. The model exhibited high sensitivity, detecting significantly more discrepancies than radiologists (W = 19.0, p < 0.001) with a strong positive correlation (r = 0.778, p < 0.001). Interrater reliability for severity and type were fair (Fleiss' kappa = 0.346 and 0.340, respectively; weighted kappa = 0.622 for severity). The LLM achieved a weighted F1 score of 0.66 for severity and 0.64 for type. Generated teaching points were considered relevant in ~ 85% of cases, and relevance correlated with the maximal discrepancy severity (Spearman ρ = 0.76, p < 0.001). The reproducibility was moderate to good (ICC (2,1) = 0.690) for the number of discrepancies and substantial for maximal discrepancy severity (Fleiss' kappa = 0.718; weighted kappa = 0.94). Generative AI models can effectively identify discrepancies in report revisions and generate relevant educational feedback, offering promise for enhancing radiology training.
Collapse
Affiliation(s)
- Shawn Lyo
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA.
| | - Suyash Mohan
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA
| | - Alvand Hassankhani
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA
| | - Abass Noor
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA
| | - Farouk Dako
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA
| | - Tessa Cook
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
4
|
Bala W, Li H, Moon J, Trivedi H, Gichoya J, Balthazar P. Enhancing radiology training with GPT-4: Pilot analysis of automated feedback in trainee preliminary reports. Curr Probl Diagn Radiol 2024:S0363-0188(24)00149-X. [PMID: 39179466 DOI: 10.1067/j.cpradiol.2024.08.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 07/29/2024] [Accepted: 08/08/2024] [Indexed: 08/26/2024]
Abstract
RATIONALE AND OBJECTIVES Radiology residents often receive limited feedback on preliminary reports issued during independent call. This study aimed to determine if Large Language Models (LLMs) can supplement traditional feedback by identifying missed diagnoses in radiology residents' preliminary reports. MATERIALS & METHODS A randomly selected subset of 500 (250 train/250 validation) paired preliminary and final reports between 12/17/2022 and 5/22/2023 were extracted and de-identified from our institutional database. The prompts and report text were input into the GPT-4 language model via the GPT-4 API (gpt-4-0314 model version). Iterative prompt tuning was used on a subset of the training/validation sets to direct the model to identify important findings in the final report that were absent in preliminary reports. For testing, a subset of 10 reports with confirmed diagnostic errors were randomly selected. Fourteen residents with on-call experience assessed the LLM-generated discrepancies and completed a survey on their experience using a 5-point Likert scale. RESULTS The model identified 24 unique missed diagnoses across 10 test reports with i% model prediction accuracy as rated by 14 residents. Five additional diagnoses were identified by users, resulting in a model sensitivity of 79.2 %. Post-evaluation surveys showed a mean satisfaction rating of 3.50 and perceived accuracy rating of 3.64 out of 5 for LLM-generated feedback. Most respondents (71.4 %) favored a combination of LLM-generated and traditional feedback. CONCLUSION This pilot study on the use of LLM-generated feedback for radiology resident preliminary reports demonstrated notable accuracy in identifying missed diagnoses and was positively received, highlighting LLMs' potential role in supplementing conventional feedback methods.
Collapse
Affiliation(s)
- Wasif Bala
- Department of Radiology and Imaging Sciences, Emory University School of Medicine, USA.
| | - Hanzhou Li
- Department of Radiology and Imaging Sciences, Emory University School of Medicine, USA
| | - John Moon
- Department of Radiology and Imaging Sciences, Emory University School of Medicine, USA
| | - Hari Trivedi
- Department of Radiology and Imaging Sciences, Emory University School of Medicine, USA
| | - Judy Gichoya
- Department of Radiology and Imaging Sciences, Emory University School of Medicine, USA
| | - Patricia Balthazar
- Department of Radiology and Imaging Sciences, Emory University School of Medicine, USA
| |
Collapse
|
5
|
Leguízamo-Isaza JM, Olarte Bermúdez LM, Campaña Perilla LA, Romero Enciso JA. ChatGPT in the Practice of Radiology: An Avant-Garde Tool With Challenges to Overcome Prior to Widespread Implementation. Can Assoc Radiol J 2024; 75:673. [PMID: 38240332 DOI: 10.1177/08465371241227197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/02/2024] Open
Affiliation(s)
- Juan Martín Leguízamo-Isaza
- Department of Diagnostic Imaging, Fundación Santa Fe de Bogotá University Hospital, Bogotá, Colombia
- Diagnostic Radiology Residency Program, Universidad El Bosque, Bogotá, Colombia
| | | | | | | |
Collapse
|
6
|
Mohamed I, Bera K, Ramaiya N. The Undermined ACGME Subcompetency: A Roadmap for Radiology Residency Programs to Foster Residents-as-Educators. Acad Radiol 2024; 31:1189-1197. [PMID: 38052673 DOI: 10.1016/j.acra.2023.10.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 10/09/2023] [Accepted: 10/13/2023] [Indexed: 12/07/2023]
Abstract
Radiology Residency programs in the United States use a set of six core competencies as laid out by the Accreditation Council for Graduate Medical Education (ACGME) to evaluate the foundational skills of every resident. Despite the fact that educational skills are included under the heading of Practice-Based Learning and Improvement in the ACGME guidelines for radiology residents, it is often underappreciated and undervalued, when compared with medical knowledge or patient care. In this paper, the authors lay out the important role of residents-as-educators and how it can be inculcated as part of formal training during residency. They enunciate five pillars for academic programs to build and maintain the pedagogical skills of their radiology residents: Training, Practicing, Providing Feedback, Mentoring, and Changing the Culture. The authors believe that implementing this will holistically benefit radiology residents as well as radiology in building future educators. The authors also delineate the challenges that programs currently face in implementation and ways to overcome them.
Collapse
Affiliation(s)
- Inas Mohamed
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH 44106
| | - Kaustav Bera
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH 44106.
| | - Nikhil Ramaiya
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH 44106
| |
Collapse
|
7
|
Chien A, Tang H, Jagessar B, Chang KW, Peng N, Nael K, Salamon N. AI-Assisted Summarization of Radiologic Reports: Evaluating GPT3davinci, BARTcnn, LongT5booksum, LEDbooksum, LEDlegal, and LEDclinical. AJNR Am J Neuroradiol 2024; 45:244-248. [PMID: 38238092 DOI: 10.3174/ajnr.a8102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 11/09/2023] [Indexed: 02/09/2024]
Abstract
BACKGROUND AND PURPOSE The review of clinical reports is an essential part of monitoring disease progression. Synthesizing multiple imaging reports is also important for clinical decisions. It is critical to aggregate information quickly and accurately. Machine learning natural language processing (NLP) models hold promise to address an unmet need for report summarization. MATERIALS AND METHODS We evaluated NLP methods to summarize longitudinal aneurysm reports. A total of 137 clinical reports and 100 PubMed case reports were used in this study. Models were 1) compared against expert-generated summary using longitudinal imaging notes collected in our institute and 2) compared using publicly accessible PubMed case reports. Five AI models were used to summarize the clinical reports, and a sixth model, the online GPT3davinci NLP large language model (LLM), was added for the summarization of PubMed case reports. We assessed the summary quality through comparison with expert summaries using quantitative metrics and quality reviews by experts. RESULTS In clinical summarization, BARTcnn had the best performance (BERTscore = 0.8371), followed by LongT5Booksum and LEDlegal. In the analysis using PubMed case reports, GPT3davinci demonstrated the best performance, followed by models BARTcnn and then LEDbooksum (BERTscore = 0.894, 0.872, and 0.867, respectively). CONCLUSIONS AI NLP summarization models demonstrated great potential in summarizing longitudinal aneurysm reports, though none yet reached the level of quality for clinical usage. We found the online GPT LLM outperformed the others; however, the BARTcnn model is potentially more useful because it can be implemented on-site. Future work to improve summarization, address other types of neuroimaging reports, and develop structured reports may allow NLP models to ease clinical workflow.
Collapse
Affiliation(s)
- Aichi Chien
- From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
| | - Hubert Tang
- From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
| | - Bhavita Jagessar
- From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
| | - Kai-Wei Chang
- Department of Computer Science (K.C., N.P.), University of California, Los Angeles, Los Angeles, California
| | - Nanyun Peng
- Department of Computer Science (K.C., N.P.), University of California, Los Angeles, Los Angeles, California
| | - Kambiz Nael
- From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
| | - Noriko Salamon
- From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
| |
Collapse
|
8
|
Truhn D, Weber CD, Braun BJ, Bressem K, Kather JN, Kuhl C, Nebelung S. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci Rep 2023; 13:20159. [PMID: 37978240 PMCID: PMC10656559 DOI: 10.1038/s41598-023-47500-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 11/14/2023] [Indexed: 11/19/2023] Open
Abstract
Large language models (LLMs) have shown potential in various applications, including clinical practice. However, their accuracy and utility in providing treatment recommendations for orthopedic conditions remain to be investigated. Thus, this pilot study aims to evaluate the validity of treatment recommendations generated by GPT-4 for common knee and shoulder orthopedic conditions using anonymized clinical MRI reports. A retrospective analysis was conducted using 20 anonymized clinical MRI reports, with varying severity and complexity. Treatment recommendations were elicited from GPT-4 and evaluated by two board-certified specialty-trained senior orthopedic surgeons. Their evaluation focused on semiquantitative gradings of accuracy and clinical utility and potential limitations of the LLM-generated recommendations. GPT-4 provided treatment recommendations for 20 patients (mean age, 50 years ± 19 [standard deviation]; 12 men) with acute and chronic knee and shoulder conditions. The LLM produced largely accurate and clinically useful recommendations. However, limited awareness of a patient's overall situation, a tendency to incorrectly appreciate treatment urgency, and largely schematic and unspecific treatment recommendations were observed and may reduce its clinical usefulness. In conclusion, LLM-based treatment recommendations are largely adequate and not prone to 'hallucinations', yet inadequate in particular situations. Critical guidance by healthcare professionals is obligatory, and independent use by patients is discouraged, given the dependency on precise data input.
Collapse
Grants
- ODELIA, 101057091 European Union's Horizon Europe programme
- COMFORT, 101079894 European Union's Horizon Europe programme
- TR 1700/7-1 Deutsche Forschungsgemeinschaft
- NE 2136/3-1 Deutsche Forschungsgemeinschaft
- DEEP LIVER, ZMVI1-2520DAT111 Bundesministerium für Gesundheit
- #70113864 Max-Eder-Programme of the German Cancer Aid
- PEARL, 01KD2104C German Federal Ministry of Education and Research
- CAMINO, 01EO2101 German Federal Ministry of Education and Research
- SWAG, 01KD2215A German Federal Ministry of Education and Research
- TRANSFORM LIVER, 031L0312A German Federal Ministry of Education and Research
- TANGERINE, 01KT2302 through ERA-NET Transcan German Federal Ministry of Education and Research
- SECAI, 57616814 Deutscher Akademischer Austauschdienst
- Transplant.KI, 01VSF21048 German Federal Joint Committee
- ODELIA, 101057091 European Union's Horizon Europe and innovation programme
- GENIAL, 101096312 European Union's Horizon Europe and innovation programme
- NIHR, NIHR213331 National Institute for Health and Care Research
- European Union’s Horizon Europe programme
- European Union’s Horizon Europe and innovation programme
- RWTH Aachen University (3131)
Collapse
Affiliation(s)
- Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwels Street 30, 52074, Aachen, Germany
| | - Christian D Weber
- Department of Orthopaedics and Trauma Surgery, University Hospital RWTH Aachen, Aachen, Germany
| | - Benedikt J Braun
- University Hospital Tuebingen on Behalf of the Eberhard-Karls-University Tuebingen, BG Hospital, Schnarrenbergstr. 95, Tübingen, Germany
| | - Keno Bressem
- Department of Radiology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Hindenburgdamm 30, 12203, Berlin, Germany
| | - Jakob N Kather
- Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
- Department of Medicine I, University Hospital Dresden, Dresden, Germany
- Department of Medicine III, University Hospital RWTH Aachen, Aachen, Germany
- Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany
| | - Christiane Kuhl
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwels Street 30, 52074, Aachen, Germany
| | - Sven Nebelung
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwels Street 30, 52074, Aachen, Germany.
| |
Collapse
|