1
|
Liu W, He Z, Huang X. Time Matters: Examine Temporal Effects on Biomedical Language Models. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:723-732. [PMID: 40417490 PMCID: PMC12099427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Time roots in applying language models for biomedical applications: models are trained on historical data and will be deployed for new or future data, which may vary from training data. While increasing biomedical tasks have employed state-of-the-art language models, there are very few studies have examined temporal effects on biomedical models when data usually shifts across development and deployment. This study fills the gap by statistically probing relations between language model performance and data shifts across three biomedical tasks. We deploy diverse metrics to evaluate model performance, distance methods to measure data drifts, and statistical methods to quantify temporal effects on biomedical language models. Our study shows that time matters for deploying biomedical language models, while the degree of performance degradation varies by biomedical tasks and statistical quantification approaches. We believe this study can establish a solid benchmark to evaluate and assess temporal effects on deploying biomedical language models.
Collapse
Affiliation(s)
- Weisi Liu
- University of Memphis, Memphis, TN, USA
| | - Zhe He
- Florida State University, Tallahassee, FL, USA
| | | |
Collapse
|
2
|
Tordjman M, Bolger I, Yuce M, Restrepo F, Liu Z, Dercle L, McGale J, Meribout AL, Liu MM, Beddok A, Lee HC, Rohren S, Yu R, Mei X, Taouli B. Large Language Models in Cancer Imaging: Applications and Future Perspectives. J Clin Med 2025; 14:3285. [PMID: 40429281 PMCID: PMC12112367 DOI: 10.3390/jcm14103285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2025] [Revised: 04/10/2025] [Accepted: 05/06/2025] [Indexed: 05/29/2025] Open
Abstract
Recently, there has been tremendous interest on the use of large language models (LLMs) in radiology. LLMs have been employed for various applications in cancer imaging, including improving reporting speed and accuracy via generation of standardized reports, automating the classification and staging of abnormal findings in reports, incorporating appropriate guidelines, and calculating individualized risk scores. Another use of LLMs is their ability to improve patient comprehension of imaging reports with simplification of the medical terms and possible translations to multiple languages. Additional future applications of LLMs include multidisciplinary tumor board standardizations, aiding patient management, and preventing and predicting adverse events (contrast allergies, MRI contraindications) and cancer imaging research. However, limitations such as hallucinations and variable performances could present obstacles to widespread clinical implementation. Herein, we present a review of the current and future applications of LLMs in cancer imaging, as well as pitfalls and limitations.
Collapse
Affiliation(s)
- Mickael Tordjman
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Ian Bolger
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Murat Yuce
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Francisco Restrepo
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Zelong Liu
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Laurent Dercle
- Department of Radiology, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Jeremy McGale
- Department of Radiology, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Anis L. Meribout
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Mira M. Liu
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Arnaud Beddok
- Department of Radiation Oncology, Institut Godinot, 51454 Reims, France
- Faculty of Medicine, Université de Reims Champagne-Ardenne, CRESTIC, 51100 Reims, France
- Yale PET Center, Department of Radiology & Biomedical Imaging, Yale University School of Medicine, New Haven, CT 06520, USA
| | - Hao-Chih Lee
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Scott Rohren
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Ryan Yu
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Xueyan Mei
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| | - Bachir Taouli
- Biomedical Engineering & Imaging Institute, Mount Sinai Health System, New York, NY 10029, USA
- Department of Diagnostic, Molecular and Interventional Radiology, Mount Sinai Health System, New York, NY 10029, USA
| |
Collapse
|
3
|
Niyonkuru E, Gomez MS, Casarighi E, Antogiovanni S, Blau H, Reese JT, Valentini G, Robinson PN. Replacing non-biomedical concepts improves embedding of biomedical concepts. PLoS One 2025; 20:e0322498. [PMID: 40324016 PMCID: PMC12052101 DOI: 10.1371/journal.pone.0322498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Accepted: 03/21/2025] [Indexed: 05/07/2025] Open
Abstract
Embeddings are semantically meaningful representations of words in a vector space, commonly used to enhance downstream machine learning applications. Traditional biomedical embedding techniques often replace all synonymous words representing biological or medical concepts with a unique token, ensuring consistent representation and improving embedding quality. However, the potential impact of replacing non-biomedical concept synonyms has received less attention. Embedding approaches often employ concept replacement to replace concepts that span multiple words, such as non-small-cell lung carcinoma, with a single concept identifier (e.g., D002289). Also, all synonyms of each concept are merged into the same identifier. Here, we additionally leveraged WordNet to identify and replace sets of non-biomedical synonyms with their most common representatives. This combined approach aimed to reduce embedding noise from non-biomedical terms while preserving the integrity of biomedical concept representations. We applied this method to 1,055 biomedical concept sets representing molecular signatures or medical categories and assessed the mean pairwise distance of embeddings with and without non-biomedical synonym replacement. A smaller mean pairwise distance was interpreted as greater intra-cluster coherence and higher embedding quality. Embeddings were generated using the Word2Vec algorithm applied to a corpus of 10 million PubMed abstracts. Our results demonstrate that the addition of non-biomedical synonym replacement reduced the mean intra-cluster distance by an average of 8%, suggesting that this complementary approach enhances embedding quality. Future work will assess its applicability to other embedding techniques and downstream tasks. Python code implementing this method is provided under an open-source license.
Collapse
Affiliation(s)
- Enock Niyonkuru
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut, United States of America
- Trinity College, Hartford, Connecticut, United States of America
| | - Mauricio Soto Gomez
- AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Elena Casarighi
- AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
- Computer Science Department, Aalto University, Espoo, Finland
| | - Stephan Antogiovanni
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut, United States of America
- Trinity College, Hartford, Connecticut, United States of America
| | - Hannah Blau
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut, United States of America
| | - Justin T. Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Giorgio Valentini
- AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Peter N. Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut, United States of America
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
| |
Collapse
|
4
|
Jabal MS, Warman P, Zhang J, Gupta K, Jain A, Mazurowski M, Wiggins W, Magudia K, Calabrese E. Open-Weight Language Models and Retrieval-Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports: Assessment of Approaches and Parameters. Radiol Artif Intell 2025; 7:e240551. [PMID: 40072216 DOI: 10.1148/ryai.240551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/01/2025]
Abstract
Purpose To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weight language models (LMs) and retrieval-augmented generation (RAG) and to assess the effects of model configuration variables on extraction performance. Materials and Methods This retrospective study used two datasets: 7294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2154 pathology reports annotated for IDH mutation status (January 2017-July 2021). An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations for accuracy of structured data extraction from reports. The effect of model size, quantization, prompting strategies, output formatting, and inference parameters on model accuracy was systematically evaluated. Results The best-performing models achieved up to 98% accuracy in extracting BT-RADS scores from radiology reports and greater than 90% accuracy for extraction of IDH mutation status from pathology reports. The best model was medical fine-tuned Llama 3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models (mean accuracy, 86% vs 75%; P < .001). Model quantization had minimal effect on performance. Few-shot prompting significantly improved accuracy (mean [±SD] increase, 32% ± 32; P = .02). RAG improved performance for complex pathology reports by a mean of 48% ± 11 (P = .001) but not for shorter radiology reports (-8% ± 31; P = .39). Conclusion This study demonstrates the potential of open LMs in automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semiautomated optimization using annotated data are critical for optimal performance. Keywords: Large Language Models, Retrieval-Augmented Generation, Radiology, Pathology, Health Care Reports Supplemental material is available for this article. © RSNA, 2025 See also commentary by Tejani and Rauschecker in this issue.
Collapse
Affiliation(s)
- Mohamed Sobhi Jabal
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
| | | | - Jikai Zhang
- Department of Electrical and Computer Engineering, Duke University, Durham, NC
- Duke Center for Artificial Intelligence in Radiology, Duke University, Durham, NC
| | - Kartikeye Gupta
- Department of Radiology, Duke University Medical Center, Durham, NC
| | - Ayush Jain
- Department of Radiology, Duke University Medical Center, Durham, NC
| | - Maciej Mazurowski
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
- Duke University School of Medicine, Durham, NC
- Department of Electrical and Computer Engineering, Duke University, Durham, NC
| | - Walter Wiggins
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
| | - Kirti Magudia
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
| | - Evan Calabrese
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
- Department of Radiology, Duke University Medical Center, Durham, NC
| |
Collapse
|
5
|
Kunze KN. Generative Artificial Intelligence and Musculoskeletal Health Care. HSS J 2025:15563316251335334. [PMID: 40297632 PMCID: PMC12033169 DOI: 10.1177/15563316251335334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/27/2025] [Accepted: 01/29/2025] [Indexed: 04/30/2025]
Abstract
Generative artificial intelligence (AI) comprises a class of AI models that generate synthetic outputs based on learning acquired from a dataset that trained the model. This means that they can create entirely new outputs that resemble real-world data despite not being explicitly instructed to do so during training. Regarding technological capabilities, computing power, and data availability, generative AI has given rise to more advanced and versatile models including diffusion and large language models that hold promise in healthcare. In musculoskeletal healthcare, generative AI applications may involve the enhancement of images, generation of audio and video, automation of clinical documentation and administrative tasks, use of surgical planning aids, augmentation of treatment decisions, and personalization of patient communication. Limitations of the use of generative AI in healthcare include hallucinations, model bias, ethical considerations during clinical use, knowledge gaps, and lack of transparency. This review introduces critical concepts of generative AI, presents clinical applications relevant to musculoskeletal healthcare that are in development, and highlights limitations preventing deployment in clinical settings.
Collapse
Affiliation(s)
- Kyle N. Kunze
- Department of Orthopedic Surgery, Hospital for Special Surgery, New York, NY, USA
| |
Collapse
|
6
|
Kulyabin M, Zhdanov A, Lee IO, Skuse DH, Thompson DA, Maier A, Constable PA. Synthetic electroretinogram signal generation using a conditional generative adversarial network. Doc Ophthalmol 2025:10.1007/s10633-025-10019-0. [PMID: 40240677 DOI: 10.1007/s10633-025-10019-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 03/20/2025] [Indexed: 04/18/2025]
Abstract
PURPOSE The electroretinogram (ERG) records the functional response of the retina. In some neurological conditions, the ERG waveform may be altered and could support biomarker discovery. In heterogeneous or rare populations, where either large data sets or the availability of data may be a challenge, synthetic signals with Artificial Intelligence (AI) may help to mitigate against these factors to support classification models. METHODS This approach was tested using a publicly available dataset of real ERGs, n = 560 (ASD) and n = 498 (Control) recorded at 9 different flash strengths from n = 18 ASD (mean age 12.2 ± 2.7 years) and n = 31 Controls (mean age 11.8 ± 3.3 years) that were augmented with synthetic waveforms, generated through a Conditional Generative Adversarial Network. Two deep learning models were used to classify the groups using either the real only or combined real and synthetic ERGs. One was a Time Series Transformer (with waveforms in their original form) and the second was a Visual Transformer model utilizing images of the wavelets derived from a Continuous Wavelet Transform of the ERGs. Model performance at classifying the groups was evaluated with Balanced Accuracy (BA) as the main outcome measure. RESULTS The BA improved from 0.756 to 0.879 when synthetic ERGs were included across all recordings for the training of the Time Series Transformer. This model also achieved the best performance with a BA of 0.89 using real and synthetic waveforms from a single flash strength of 0.95 log cd s m-2. CONCLUSIONS The improved performance of the deep learning models with synthetic waveforms supports the application of AI to improve group classification with ERG recordings.
Collapse
Affiliation(s)
- Mikhail Kulyabin
- Pattern Recognition Lab, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | | | - Irene O Lee
- Behavioural and Brain Sciences Unit, Population Policy and Practice Programme, UCL Great Ormond Street Institute of Child Health, University College London, London, UK
| | - David H Skuse
- Behavioural and Brain Sciences Unit, Population Policy and Practice Programme, UCL Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Dorothy A Thompson
- The Tony Kriss Visual Electrophysiology Unit, Clinical and Academic, Department of Ophthalmology, Great Ormond Street Hospital for Children NHS Trust, London, UK
- UCL Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Andreas Maier
- Pattern Recognition Lab, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Paul A Constable
- College of Nursing and Health Sciences, Caring Futures Institute, Flinders University, Adelaide, 5000, Australia.
| |
Collapse
|
7
|
Williams EL, Huynh D, Estai M, Sinha T, Summerscales M, Kanagasingam Y. Predicting Inpatient Admissions From Emergency Department Triage Using Machine Learning: A Systematic Review. MAYO CLINIC PROCEEDINGS. DIGITAL HEALTH 2025; 3:100197. [PMID: 40206990 PMCID: PMC11975823 DOI: 10.1016/j.mcpdig.2025.100197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/11/2025]
Abstract
This study aimed to evaluate the quality of evidence for using machine learning models to predict inpatient admissions from emergency department triage data, ultimately aiming to improve patient flow management. A comprehensive literature search was conducted according to the PRISMA guidelines across 5 databases, PubMed, Embase, Web of Science, Scopus, and CINAHL, on August 1, 2024, for English-language studies published between August 1, 2014, and August 1, 2024. This yielded 700 articles, of which 66 were screened in full, and 31 met the inclusion and exclusion criteria. Model quality was assessed using the PROBAST appraisal tool and a modified TRIPOD+AI framework, alongside reported model performance metrics. Seven studies demonstrated rigorous methodology and promising in silico performance, with an area under the receiver operating characteristic ranging from 0.81 to 0.93. However, further performance analysis was limited by heterogeneity in model development and an unclear-to-high risk of bias and applicability concerns in the remaining 24 articles, as evaluated by the PROBAST tool. The current literature demonstrates a good degree of in silico accuracy in predicting inpatient admission from triage data alone. Future research should emphasize transparent model development and reporting, temporal validation, concept drift analysis, exploration of emerging artificial intelligence techniques, and analysis of real-world patient flow metrics to comprehensively assess the usefulness of these models.
Collapse
Affiliation(s)
- Ethan L. Williams
- School of Medicine, The University of Notre Dame, Fremantle, Western Australia, Australia
- Emergency Department, St John of God Midland Public and Private Hospitals, Midland, Western Australia, Australia
| | - Daniel Huynh
- General Medicine Department, Royal North Shore Hospital, St Leonards, New South Wales, Australia
| | - Mohamed Estai
- School of Human Sciences, The University of Western Australia, Crawley, Western Australia, Australia
| | - Toshi Sinha
- School of Medicine, The University of Notre Dame, Fremantle, Western Australia, Australia
| | - Matthew Summerscales
- Emergency Department, St John of God Midland Public and Private Hospitals, Midland, Western Australia, Australia
| | - Yogesan Kanagasingam
- School of Medicine, The University of Notre Dame, Fremantle, Western Australia, Australia
- Emergency Department, St John of God Midland Public and Private Hospitals, Midland, Western Australia, Australia
| |
Collapse
|
8
|
Gencer G, Gencer K. Large Language Models in Healthcare: A Bibliometric Analysis and Examination of Research Trends. J Multidiscip Healthc 2025; 18:223-238. [PMID: 39844924 PMCID: PMC11750729 DOI: 10.2147/jmdh.s502351] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Accepted: 01/07/2025] [Indexed: 01/24/2025] Open
Abstract
Background The integration of large language models (LLMs) in healthcare has generated significant interest due to their potential to improve diagnostic accuracy, personalization of treatment, and patient care efficiency. Objective This study aims to conduct a comprehensive bibliometric analysis to identify current research trends, main themes and future directions regarding applications in the healthcare sector. Methods A systematic scan of publications until 08.05.2024 was carried out from an important database such as Web of Science.Using bibliometric tools such as VOSviewer and CiteSpace, we analyzed data covering publication counts, citation analysis, co-authorship, co- occurrence of keywords and thematic development to map the intellectual landscape and collaborative networks in this field. Results The analysis included more than 500 articles published between 2021 and 2024. The United States, Germany and the United Kingdom were the top contributors to this field. The study highlights that neural network applications in diagnostic imaging, natural language processing for clinical documentation, and patient data in the field of general internal medicine, radiology, medical informatics, health care services, surgery, oncology, ophthalmology, neurology, orthopedics and psychiatry have seen significant growth in publications over the past two years. Keyword trend analysis revealed emerging sub-themes such as clinical research, artificial intelligence, ChatGPT, education, natural language processing, clinical management, virtual reality, chatbot, indicating a shift towards addressing the broader implications of LLM application in healthcare. Conclusion The use of LLM in healthcare is an expanding field with significant academic and clinical interest. This bibliometric analysis not only maps the current state of the research, but also identifies important areas that require further research and development. Continued advances in this field are expected to significantly impact future healthcare applications, with a focus on increasing the accuracy and personalization of patient care through advanced data analytics.
Collapse
Affiliation(s)
- Gülcan Gencer
- Department of Biostatistics and Medical Informatics, Afyonkarahisar Health Sciences University, Faculty of Medicine, Afyonkarahisar, Turkey
| | - Kerem Gencer
- Department of Computer Engineering, Afyon Kocatepe University, Faculty of Engineering, Afyonkarahisar, Turkey
| |
Collapse
|
9
|
Li B, Gilbert S. Artificial Intelligence awarded two Nobel Prizes for innovations that will shape the future of medicine. NPJ Digit Med 2024; 7:336. [PMID: 39587223 PMCID: PMC11589127 DOI: 10.1038/s41746-024-01345-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2024] [Accepted: 11/15/2024] [Indexed: 11/27/2024] Open
Affiliation(s)
- Ben Li
- Division of Vascular Surgery, University of Toronto, Toronto, ON, Canada
- Temerty Centre for Artificial Intelligence Research and Education in Medicine, University of Toronto, Toronto, ON, Canada
| | - Stephen Gilbert
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany.
| |
Collapse
|
10
|
Irmici G, Cozzi A, Della Pepa G, De Berardinis C, D'Ascoli E, Cellina M, Cè M, Depretto C, Scaperrotta G. How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini. LA RADIOLOGIA MEDICA 2024; 129:1463-1467. [PMID: 39138732 DOI: 10.1007/s11547-024-01872-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 08/01/2024] [Indexed: 08/15/2024]
Abstract
Applications of large language models (LLMs) in the healthcare field have shown promising results in processing and summarizing multidisciplinary information. This study evaluated the ability of three publicly available LLMs (GPT-3.5, GPT-4, and Google Gemini-then called Bard) to answer 60 multiple-choice questions (29 sourced from public databases, 31 newly formulated by experienced breast radiologists) about different aspects of breast cancer care: treatment and prognosis, diagnostic and interventional techniques, imaging interpretation, and pathology. Overall, the rate of correct answers significantly differed among LLMs (p = 0.010): the best performance was achieved by GPT-4 (95%, 57/60) followed by GPT-3.5 (90%, 54/60) and Google Gemini (80%, 48/60). Across all LLMs, no significant differences were observed in the rates of correct replies to questions sourced from public databases and newly formulated ones (p ≥ 0.593). These results highlight the potential benefits of LLMs in breast cancer care, which will need to be further refined through in-context training.
Collapse
Affiliation(s)
- Giovanni Irmici
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy.
| | - Andrea Cozzi
- Imaging Institute of Southern Switzerland (IIMSI), Ente Ospedaliero Cantonale (EOC), Lugano, Switzerland
| | - Gianmarco Della Pepa
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Claudia De Berardinis
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Elisa D'Ascoli
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Michaela Cellina
- Radiology Department, ASST Fatebenefratelli Sacco, Milano, Italy
| | - Maurizio Cè
- Postgraduation School in Radiodiagnostics, Università degli Studi di Milano, Milano, Italy
| | - Catherine Depretto
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Gianfranco Scaperrotta
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| |
Collapse
|