1
|
Wang J, Shue K, Liu L, Hu G. Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics. Sci Rep 2025; 15:10426. [PMID: 40140500 PMCID: PMC11947261 DOI: 10.1038/s41598-025-95233-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2024] [Accepted: 03/19/2025] [Indexed: 03/28/2025] Open
Abstract
Large language model chatbots such as ChatGPT have shown the potential in assisting health professionals in emergency departments (EDs). However, the diagnostic accuracy of newer ChatGPT models remains unclear. This retrospective study evaluated the diagnostic performance of various ChatGPT models-including GPT-3.5, GPT-4, GPT-4o, and o1 series-in predicting diagnoses for ED patients (n = 30) and examined the impact of explicitly invoking reasoning (thoughts). Earlier models, such as GPT-3.5, demonstrated high accuracy for top-three differential diagnoses (80.0% in accuracy) but underperformed in identifying leading diagnoses (47.8%) compared to newer models such as chatgpt-4o-latest (60%, p < 0.01) and o1-preview (60%, p < 0.01). Asking for thoughts to be provided significantly enhanced the performance on predicting leading diagnosis for 4o models such as 4o-2024-0513 (from 45.6 to 56.7%; p = 0.03) and 4o-mini-2024-07-18 (from 54.4 to 60.0%; p = 0.04) but had minimal impact on o1-mini and o1-preview. In challenging cases, such as pneumonia without fever, all models generally failed to predict the correct diagnosis, indicating atypical presentations as a major limitation for ED application of current ChatGPT models.
Collapse
Affiliation(s)
- Jinge Wang
- Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV, 26506, USA
| | - Kenneth Shue
- Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV, 26506, USA
| | - Li Liu
- College of Health Solutions, Arizona State University, Phoenix, AZ, 85004, USA
- Biodesign Institute, Arizona State University, Tempe, AZ, 85281, USA
| | - Gangqing Hu
- Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV, 26506, USA.
| |
Collapse
|
2
|
Lombardo R, Gallo G, Stira J, Turchi B, Santoro G, Riolo S, Romagnoli M, Cicione A, Tema G, Pastore A, Al Salhi Y, Fuschi A, Franco G, Nacchia A, Tubaro A, De Nunzio C. Quality of information and appropriateness of Open AI outputs for prostate cancer. Prostate Cancer Prostatic Dis 2025; 28:229-231. [PMID: 38228809 DOI: 10.1038/s41391-024-00789-0] [Citation(s) in RCA: 24] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 12/22/2023] [Accepted: 01/05/2024] [Indexed: 01/18/2024]
Abstract
Chat-GPT, a natural language processing (NLP) tool created by Open-AI, can potentially be used as a quick source for obtaining information related to prostate cancer. This study aims to analyze the quality and appropriateness of Chat-GPT's responses to inquiries related to prostate cancer compared to those of the European Urology Association's (EAU) 2023 prostate cancer guidelines. Overall, 195 questions were prepared according to the recommendations gathered in the prostate cancer section of the EAU 2023 Guideline. All questions were systematically presented to Chat-GPT's August 3 Version, and two expert urologists independently assessed and assigned scores ranging from 1 to 4 to each response (1: completely correct, 2: correct but inadequate, 3: a mix of correct and misleading information, and 4: completely incorrect). Sub-analysis per chapter and per grade of recommendation were performed. Overall, 195 recommendations were evaluated. Overall, 50/195 (26%) were completely correct, 51/195 (26%) correct but inadequate, 47/195 (24%) a mix of correct and misleading and 47/195 (24%) incorrect. When looking at different chapters Open AI was particularly accurate in answering questions on follow-up and QoL. Worst performance was recorded for the diagnosis and treatment chapters with respectively 19% and 30% of the answers completely incorrect. When looking at the strength of recommendation, no differences in terms of accuracy were recorded when comparing weak and strong recommendations (p > 0,05). Chat-GPT has a poor accuracy when answering questions on the PCa EAU guidelines recommendations. Future studies should assess its performance after adequate training.
Collapse
Affiliation(s)
| | - Giacomo Gallo
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Jordi Stira
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Beatrice Turchi
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Giuseppe Santoro
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Sara Riolo
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Matteo Romagnoli
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Antonio Cicione
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Giorgia Tema
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Antonio Pastore
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Yazan Al Salhi
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Andrea Fuschi
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Giorgio Franco
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Antonio Nacchia
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Andrea Tubaro
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| | - Cosimo De Nunzio
- Department of Urology, 'Sapienza' University of Rome, Rome, Italy
| |
Collapse
|
3
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
4
|
Zhang K, Meng X, Yan X, Ji J, Liu J, Xu H, Zhang H, Liu D, Wang J, Wang X, Gao J, Wang YGS, Shao C, Wang W, Li J, Zheng MQ, Yang Y, Tang YD. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine. J Med Internet Res 2025; 27:e59069. [PMID: 39773666 PMCID: PMC11751657 DOI: 10.2196/59069] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 08/26/2024] [Accepted: 09/10/2024] [Indexed: 01/11/2025] Open
Abstract
Large language models (LLMs) are rapidly advancing medical artificial intelligence, offering revolutionary changes in health care. These models excel in natural language processing (NLP), enhancing clinical support, diagnosis, treatment, and medical research. Breakthroughs, like GPT-4 and BERT (Bidirectional Encoder Representations from Transformer), demonstrate LLMs' evolution through improved computing power and data. However, their high hardware requirements are being addressed through technological advancements. LLMs are unique in processing multimodal data, thereby improving emergency, elder care, and digital medical procedures. Challenges include ensuring their empirical reliability, addressing ethical and societal implications, especially data privacy, and mitigating biases while maintaining privacy and accountability. The paper emphasizes the need for human-centric, bias-free LLMs for personalized medicine and advocates for equitable development and access. LLMs hold promise for transformative impacts in health care.
Collapse
Affiliation(s)
- Kuo Zhang
- Department of Cardiology, State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | | | - Xiangyu Yan
- School of Disaster and Emergency Medicine, Tianjin University, Tianjin, China
| | - Jiaming Ji
- Institute for Artificial Intelligence, Peking University, Beijing, China
| | | | - Hua Xu
- Division of Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology, Hong Kong, China (Hong Kong)
| | - Heng Zhang
- Institute for Artificial Intelligence, Hefei University of Technology, Hefei, Anhui, China
| | - Da Liu
- Department of Cardiology, the First Hospital of Hebei Medical University, Graduate School of Hebei Medical University, Shijiazhuang, Hebei, China
| | - Jingjia Wang
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Xuliang Wang
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Jun Gao
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Yuan-Geng-Shuo Wang
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Chunli Shao
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Wenyao Wang
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| | - Jiarong Li
- Henley Business School, University of Reading, RG6 6UD, United Kingdom
| | - Ming-Qi Zheng
- Department of Cardiology, the First Hospital of Hebei Medical University, Graduate School of Hebei Medical University, Shijiazhuang, Hebei, China
| | - Yaodong Yang
- Institute for Artificial Intelligence, Peking University, Beijing, China
| | - Yi-Da Tang
- Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, Beijing, China
| |
Collapse
|
5
|
Collin H, Roberts MJ, Keogh K, Siriwardana A, Basto M. Improving clinical efficiency using retrieval-augmented generation in urologic oncology: A guideline-enhanced artificial intelligence approach. BJUI COMPASS 2025; 6:e427. [PMID: 39877560 PMCID: PMC11771495 DOI: 10.1002/bco2.427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 07/25/2024] [Accepted: 07/26/2024] [Indexed: 01/31/2025] Open
Affiliation(s)
- Harry Collin
- Department of UrologyRoyal Brisbane and Women's HospitalHerstonQueenslandAustralia
- Faculty of MedicineThe University of QueenslandBrisbaneQueenslandAustralia
| | - Matthew J. Roberts
- Department of UrologyRoyal Brisbane and Women's HospitalHerstonQueenslandAustralia
- Faculty of MedicineUniversity of Queensland Centre for Clinical ResearchBrisbaneQueenslandAustralia
| | - Kandice Keogh
- Department of UrologyRoyal Brisbane and Women's HospitalHerstonQueenslandAustralia
- Faculty of MedicineThe University of QueenslandBrisbaneQueenslandAustralia
| | - Amila Siriwardana
- Department of UrologyRoyal Brisbane and Women's HospitalHerstonQueenslandAustralia
| | - Marnique Basto
- Department of UrologyRoyal Brisbane and Women's HospitalHerstonQueenslandAustralia
| |
Collapse
|
6
|
Watson AL. Ethical considerations for artificial intelligence use in nursing informatics. Nurs Ethics 2024; 31:1031-1040. [PMID: 38318798 DOI: 10.1177/09697330241230515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
Artificial intelligence revolutionizes nursing informatics and healthcare by enhancing patient outcomes and healthcare access while streamlining nursing workflow. These advancements, while promising, have sparked debates on traditional nursing ethics like patient data handling and implicit bias. The key to unlocking the next frontier in holistic nursing care lies in nurses navigating the delicate balance between artificial intelligence and the core values of empathy and compassion. Mindful utilization of artificial intelligence coupled with an unwavering ethical commitment by nurses may transform the very essence of nursing.
Collapse
|
7
|
Talyshinskii A, Naik N, Hameed BMZ, Zhanbyrbekuly U, Khairli G, Guliev B, Juilebø-Jones P, Tzelves L, Somani BK. Expanding horizons and navigating challenges for enhanced clinical workflows: ChatGPT in urology. Front Surg 2023; 10:1257191. [PMID: 37744723 PMCID: PMC10512827 DOI: 10.3389/fsurg.2023.1257191] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 08/28/2023] [Indexed: 09/26/2023] Open
Abstract
Purpose of review ChatGPT has emerged as a potential tool for facilitating doctors' workflows. However, when it comes to applying these findings within a urological context, there have not been many studies. Thus, our objective was rooted in analyzing the pros and cons of ChatGPT use and how it can be exploited and used by urologists. Recent findings ChatGPT can facilitate clinical documentation and note-taking, patient communication and support, medical education, and research. In urology, it was proven that ChatGPT has the potential as a virtual healthcare aide for benign prostatic hyperplasia, an educational and prevention tool on prostate cancer, educational support for urological residents, and as an assistant in writing urological papers and academic work. However, several concerns about its exploitation are presented, such as lack of web crawling, risk of accidental plagiarism, and concerns about patients-data privacy. Summary The existing limitations mediate the need for further improvement of ChatGPT, such as ensuring the privacy of patient data and expanding the learning dataset to include medical databases, and developing guidance on its appropriate use. Urologists can also help by conducting studies to determine the effectiveness of ChatGPT in urology in clinical scenarios and nosologies other than those previously listed.
Collapse
Affiliation(s)
- Ali Talyshinskii
- Department of Urology, Astana Medical University, Astana, Kazakhstan
| | - Nithesh Naik
- Department of Mechanical and Industrial Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
| | | | | | - Gafur Khairli
- Department of Urology, Astana Medical University, Astana, Kazakhstan
| | - Bakhman Guliev
- Department of Urology, Mariinsky Hospital, St Petersburg, Russia
| | | | - Lazaros Tzelves
- Department of Urology, National and Kapodistrian University of Athens, Sismanogleion Hospital, Athens, Marousi, Greece
| | - Bhaskar Kumar Somani
- Department of Urology, University Hospital Southampton NHS Trust, Southampton, United Kingdom
| |
Collapse
|