1
|
Balta KY, Javidan AP, Walser E, Arntfield R, Prager R. Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations. J Intensive Care Med 2025; 40:184-190. [PMID: 39118320 PMCID: PMC11639400 DOI: 10.1177/08850666241267871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 06/12/2024] [Accepted: 07/18/2024] [Indexed: 08/10/2024]
Abstract
Background: We assessed 2 versions of the large language model (LLM) ChatGPT-versions 3.5 and 4.0-in generating appropriate, consistent, and readable recommendations on core critical care topics. Research Question: How do successive large language models compare in terms of generating appropriate, consistent, and readable recommendations on core critical care topics? Design and Methods: A set of 50 LLM-generated responses to clinical questions were evaluated by 2 independent intensivists based on a 5-point Likert scale for appropriateness, consistency, and readability. Results: ChatGPT 4.0 showed significantly higher median appropriateness scores compared to ChatGPT 3.5 (4.0 vs 3.0, P < .001). However, there was no significant difference in consistency between the 2 versions (40% vs 28%, P = 0.291). Readability, assessed by the Flesch-Kincaid Grade Level, was also not significantly different between the 2 models (14.3 vs 14.4, P = 0.93). Interpretation: Both models produced "hallucinations"-misinformation delivered with high confidence-which highlights the risk of relying on these tools without domain expertise. Despite potential for clinical application, both models lacked consistency producing different results when asked the same question multiple times. The study underscores the need for clinicians to understand the strengths and limitations of LLMs for safe and effective implementation in critical care settings. Registration: https://osf.io/8chj7/.
Collapse
Affiliation(s)
- Kaan Y. Balta
- Schulich School of Medicine & Dentistry, Western University, London, Ontario, Canada
| | - Arshia P. Javidan
- Division of Vascular Surgery, Department of Surgery, University of Toronto, Toronto, Ontario, Canada
| | - Eric Walser
- Division of Critical Care, London Health Sciences Centre, Western University, London, Ontario, Canada
- Department of Surgery, Trauma Program, London Health Sciences Centre, London, Ontario, Canada
| | - Robert Arntfield
- Division of Critical Care, London Health Sciences Centre, Western University, London, Ontario, Canada
| | - Ross Prager
- Division of Critical Care, London Health Sciences Centre, Western University, London, Ontario, Canada
| |
Collapse
|
2
|
Ayoub NF, Rameau A, Brenner MJ, Bur AM, Ator GA, Briggs SE, Takashima M, Stankovic KM. American Academy of Otolaryngology-Head and Neck Surgery (AAO-HNS) Report on Artificial Intelligence. Otolaryngol Head Neck Surg 2025; 172:734-743. [PMID: 39666770 DOI: 10.1002/ohn.1080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 10/31/2024] [Accepted: 11/22/2024] [Indexed: 12/14/2024]
Abstract
This report synthesizes the American Academy of Otolaryngology-Head and Neck Surgery (AAO-HNS) Task Force's guidance on the integration of artificial intelligence (AI) in otolaryngology-head and neck surgery (OHNS). A comprehensive literature review was conducted, focusing on the applications, benefits, and challenges of AI in OHNS, alongside ethical, legal, and social implications. The Task Force, formulated by otolaryngologist experts in AI, used an iterative approach, adapted from the Delphi method, to prioritize topics for inclusion and to reach a consensus on guiding principles. The Task Force's findings highlight AI's transformative potential for OHNS, offering potential advancements in precision medicine, clinical decision support, operational efficiency, research, and education. However, challenges such as data quality, health equity, privacy concerns, transparency, regulatory gaps, and ethical dilemmas necessitate careful navigation. Incorporating AI into otolaryngology practice in a safe, equitable, and patient-centered manner requires clinician judgment, transparent AI systems, and adherence to ethical and legal standards. The Task Force principles underscore the importance of otolaryngologists' involvement in AI's ethical development, implementation, and regulation to harness benefits while mitigating risks. The proposed principles inform the integration of AI in otolaryngology, aiming to enhance patient outcomes, clinician well-being, and efficiency of health care delivery.
Collapse
Affiliation(s)
- Noel F Ayoub
- Department of Otolaryngology-Head and Neck Surgery, Mass Eye & Ear, Boston, Massachusetts, USA
- Department of Otolaryngology-Head and Neck Surgery, Stanford University, Palo Alto, California, USA
| | - Anaïs Rameau
- Department of Otolaryngology-Head and Neck Surgery, Weill Cornell Medical College, Ithaca, New York, USA
| | - Michael J Brenner
- Department of Otolaryngology-Head and Neck Surgery, University of Michigan Medical School, Ann Arbor, Michigan, USA
| | - Andrés M Bur
- Department of Otolaryngology-Head and Neck Surgery, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Gregory A Ator
- Department of Otolaryngology-Head and Neck Surgery, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Selena E Briggs
- Department of Otolaryngology-Head and Neck Surgery, MedStar Georgetown University Hospital, Washington, District of Columbia, USA
| | - Masayoshi Takashima
- Department Otolaryngology-Head and Neck Surgery, Houston Methodist, Houston, Texas, USA
| | - Konstantina M Stankovic
- Department of Otolaryngology-Head and Neck Surgery, Stanford University, Palo Alto, California, USA
| |
Collapse
|
3
|
Rafiq K, Beery S, Palmer MS, Harchaoui Z, Abrahms B. Generative AI as a tool to accelerate the field of ecology. Nat Ecol Evol 2025:10.1038/s41559-024-02623-1. [PMID: 39880986 DOI: 10.1038/s41559-024-02623-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 12/06/2024] [Indexed: 01/31/2025]
Abstract
The emergence of generative artificial intelligence (AI) models specializing in the generation of new data with the statistical patterns and properties of the data upon which the models were trained has profoundly influenced a range of academic disciplines, industry and public discourse. Combined with the vast amounts of diverse data now available to ecologists, from genetic sequences to remotely sensed animal tracks, generative AI presents enormous potential applications within ecology. Here we draw upon a range of fields to discuss unique potential applications in which generative AI could accelerate the field of ecology, including augmenting data-scarce datasets, extending observations of ecological patterns and increasing the accessibility of ecological data. We also highlight key challenges, risks and considerations when using generative AI within ecology, such as privacy risks, model biases and environmental effects. Ultimately, the future of generative AI in ecology lies in the development of robust interdisciplinary collaborations between ecologists and computer scientists. Such partnerships will be important for embedding ecological knowledge within AI, leading to more ecologically meaningful and relevant models. This will be critical for leveraging the power of generative AI to drive ecological insights into species across the globe.
Collapse
Affiliation(s)
- Kasim Rafiq
- Center for Ecosystem Sentinels, Department of Biology, University of Washington, Seattle, WA, USA.
| | - Sara Beery
- AI and Decision Making, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Meredith S Palmer
- Center for Biodiversity and Global Change, Yale University, New Haven, CT, USA
| | - Zaid Harchaoui
- Allen School in Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Briana Abrahms
- Center for Ecosystem Sentinels, Department of Biology, University of Washington, Seattle, WA, USA
| |
Collapse
|
4
|
Khan AA, Khan AR, Munshi S, Dandapani H, Jimale M, Bogni FM, Khawaja H. Assessing the performance of ChatGPT in medical ethical decision-making: a comparative study with USMLE-based scenarios. JOURNAL OF MEDICAL ETHICS 2025:jme-2024-110240. [PMID: 39863417 DOI: 10.1136/jme-2024-110240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Accepted: 12/17/2024] [Indexed: 01/27/2025]
Abstract
INTRODUCTION The integration of artificial intelligence (AI) into healthcare introduces innovative possibilities but raises ethical, legal and professional concerns. Assessing the performance of AI in core components of the United States Medical Licensing Examination (USMLE), such as communication skills, ethics, empathy and professionalism, is crucial. This study evaluates how well ChatGPT versions 3.5 and 4.0 handle complex medical scenarios using USMLE-Rx, AMBOSS and UWorld question banks, aiming to understand its ability to navigate patient interactions according to medical ethics and standards. METHODS We compiled 273 questions from AMBOSS, USMLE-Rx and UWorld, focusing on communication, social sciences, healthcare policy and ethics. GPT-3.5 and GPT-4 were tasked with answering and justifying their choices in new chat sessions to minimise model interference. Responses were compared against question bank rationales and average student performance to evaluate AI effectiveness in medical ethical decision-making. RESULTS GPT-3.5 answered 38.9% correctly in AMBOSS, 54.1% in USMLE-Rx and 57.4% in UWorld, with rationale accuracy rates of 83.3%, 90.0% and 87.0%, respectively. GPT-4 answered 75.9% correctly in AMBOSS, 64.9% in USMLE-Rx and 79.6% in UWorld, with rationale accuracy rates of 85.4%, 88.9%, and 98.8%, respectively. Both versions generally scored below average student performance, except GPT-4 in UWorld. CONCLUSION ChatGPT, particularly version 4.0, shows potential in navigating ethical and interpersonal medical scenarios. However, human reasoning currently surpasses AI in average performance. Continued development and training of AI systems can enhance proficiency in these critical healthcare aspects.
Collapse
Affiliation(s)
- Ali A Khan
- Warren Alpert Medical School, Brown University, Providence, Rhode Island, USA
| | - Ali R Khan
- The University of Texas Medical Branch at Galveston, Galveston, Texas, USA
| | - Saminah Munshi
- Warren Alpert Medical School, Brown University, Providence, Rhode Island, USA
| | - Hari Dandapani
- Warren Alpert Medical School, Brown University, Providence, Rhode Island, USA
| | - Mohamed Jimale
- The University of Texas Medical Branch at Galveston, Galveston, Texas, USA
| | - Franck M Bogni
- Warren Alpert Medical School, Brown University, Providence, Rhode Island, USA
| | - Hussain Khawaja
- Warren Alpert Medical School, Brown University, Providence, Rhode Island, USA
- Division of General Internal Medicine, Rhode Island Hospital, Providence, Rhode Island, USA
| |
Collapse
|
5
|
Gupta N, Khatri K, Malik Y, Lakhani A, Kanwal A, Aggarwal S, Dahuja A. Exploring prospects, hurdles, and road ahead for generative artificial intelligence in orthopedic education and training. BMC MEDICAL EDUCATION 2024; 24:1544. [PMID: 39732679 DOI: 10.1186/s12909-024-06592-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Accepted: 12/20/2024] [Indexed: 12/30/2024]
Abstract
Generative Artificial Intelligence (AI), characterized by its ability to generate diverse forms of content including text, images, video and audio, has revolutionized many fields, including medical education. Generative AI leverages machine learning to create diverse content, enabling personalized learning, enhancing resource accessibility, and facilitating interactive case studies. This narrative review explores the integration of generative artificial intelligence (AI) into orthopedic education and training, highlighting its potential, current challenges, and future trajectory. A review of recent literature was conducted to evaluate the current applications, identify potential benefits, and outline limitations of integrating generative AI in orthopedic education. Key findings indicate that generative AI holds substantial promise in enhancing orthopedic training through its various applications such as providing real-time explanations, adaptive learning materials tailored to individual student's specific needs, and immersive virtual simulations. However, despite its potential, the integration of generative AI into orthopedic education faces significant issues such as accuracy, bias, inconsistent outputs, ethical and regulatory concerns and the critical need for human oversight. Although generative AI models such as ChatGPT and others have shown impressive capabilities, their current performance on orthopedic exams remains suboptimal, highlighting the need for further development to match the complexity of clinical reasoning and knowledge application. Future research should focus on addressing these challenges through ongoing research, optimizing generative AI models for medical content, exploring best practices for ethical AI usage, curriculum integration and evaluating the long-term impact of these technologies on learning outcomes. By expanding AI's knowledge base, refining its ability to interpret clinical images, and ensuring reliable, unbiased outputs, generative AI holds the potential to revolutionize orthopedic education. This work aims to provides a framework for incorporating generative AI into orthopedic curricula to create a more effective, engaging, and adaptive learning environment for future orthopedic practitioners.
Collapse
Affiliation(s)
- Nikhil Gupta
- Department of Pharmacology, All India Institute of Medical Sciences, Bathinda, Punjab, 151001, India
| | - Kavin Khatri
- Department of Orthopedics, Postgraduate Institute of Medical Education and Research (PGIMER) Satellite Centre, Sangrur, Punjab, 148001, India.
| | - Yogender Malik
- Department of Forensic Medicine and Toxicology, Bhagat Phool Singh Govt Medical College for Women, Khanpur Kalan, Sonepat, Haryana, 131305, India
| | - Amit Lakhani
- Department of Orthopedics, Dr B.R. Ambedkar State Institute of Medical Sciences, Mohali, Punjab, 160055, India
| | - Abhinav Kanwal
- Department of Pharmacology, All India Institute of Medical Sciences, Bathinda, Punjab, 151001, India.
| | - Sameer Aggarwal
- Department of Orthopedics, Postgraduate Institute of Medical Education and Research (PGIMER), Chandigarh, 160012, India
| | - Anshul Dahuja
- Department of Orthopedics, Guru Gobind Singh Medical College and Hospital, Faridkot, Punjab, 151203, India
| |
Collapse
|
6
|
Agbareia R, Omar M, Soffer S, Glicksberg BS, Nadkarni GN, Klang E. Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis. Comput Struct Biotechnol J 2024; 27:184-189. [PMID: 39850658 PMCID: PMC11754970 DOI: 10.1016/j.csbj.2024.12.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Revised: 12/17/2024] [Accepted: 12/20/2024] [Indexed: 01/25/2025] Open
Abstract
Background and aim Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief concern, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases. Results LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8 %, Claude Sonnet 3.5: 59.5 %, Physicians: 39.5 %, p < 0.001, Bonferroni-adjusted). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5 %, p < 0.001; Claude Sonnet 3.5: 67.3 %, p = 0.060; Physicians: 78.8 %, p < 0.001, all Bonferroni-adjusted). LLMs altered their explanatory reasoning in 45-60 % of cases when images were provided. Conclusion Multimodal LLMs showed higher diagnostic accuracy than physicians in text-only scenarios, even in cases designed to require visual interpretation, suggesting that while images can enhance diagnostic accuracy, they may not be essential in every instance. Although adding images further improved LLM performance, the magnitude of this improvement was smaller than that observed in physicians. These findings suggest that enhanced visual data processing may be needed for LLMs to achieve the degree of image-related performance gains seen in human examiners.
Collapse
Affiliation(s)
- Reem Agbareia
- Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel
| | - Mahmud Omar
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Shelly Soffer
- Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center, Petah-Tikva, Israel
| | - Benjamin S. Glicksberg
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Girish N. Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
7
|
Rocha-Silva R, Rodrigues MAM, Viana RB, Nakamoto FP, Vancini RL, Andrade MS, Rosemann T, Weiss K, Knechtle B, de Lira CAB. Critical analysis of information provided by ChatGPT on lactate, exercise, fatigue, and muscle pain: current insights and future prospects for enhancement. ADVANCES IN PHYSIOLOGY EDUCATION 2024; 48:898-903. [PMID: 39262324 DOI: 10.1152/advan.00073.2024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 07/30/2024] [Accepted: 09/05/2024] [Indexed: 09/13/2024]
Abstract
This study aimed to critically evaluate the information provided by ChatGPT on the role of lactate in fatigue and muscle pain during physical exercise. We inserted the prompt "What is the cause of fatigue and pain during exercise?" using ChatGPT versions 3.5 and 4o. In both versions, ChatGPT associated muscle fatigue with glycogen depletion and "lactic acid" accumulation, whereas pain was linked to processes such as inflammation and microtrauma. We deepened the investigation with ChatGPT 3.5, implementing user feedback to question the accuracy of the information about lactate. The response was then reformulated, involving a scientific debate about the true role of lactate in physical exercise and debunking the idea that it is the primary cause of muscle fatigue and pain. We also utilized the creation of a "well-crafted prompt," which included persona identification and thematic characterization, resulting in much more accurate information in both the ChatGPT 3.5 and 4o models, presenting a range of information from the physiological process of lactate to its true role in physical exercise. The results indicated that the accuracy of the responses provided by ChatGPT can vary depending on the data available in its database and, more importantly, on how the question is formulated. Therefore, it is indispensable that educators guide their students in the processes of managing the AI tool to mitigate risks of misinformation.NEW & NOTEWORTHY Generative artificial intelligence (AI), exemplified by ChatGPT, provides immediate and easily accessible answers about lactate and exercise. However, the reliability of this information may fluctuate, contingent upon the scope and intricacy of the knowledge derived from the training process before most recent update. Furthermore, a deep understanding of the basic principles of human physiology becomes crucial for the effective correction and safe use of this technology.
Collapse
Affiliation(s)
- Rizia Rocha-Silva
- Faculty of Physical Education and Dance, Federal University of Goiás, Goiânia, Brazil
| | | | - Ricardo Borges Viana
- Institute of Physical Education and Sports, Federal University of Ceará, Fortaleza, Brazil
| | | | - Rodrigo Luiz Vancini
- Center for Physical Education and Sports, Federal University of Espírito Santo, Vitória, Brazil
| | | | - Thomas Rosemann
- Institute of Primary Care, University of Zurich, Zurich, Switzerland
| | - Katja Weiss
- Institute of Primary Care, University of Zurich, Zurich, Switzerland
| | - Beat Knechtle
- Institute of Primary Care, University of Zurich, Zurich, Switzerland
- Medbase St. Gallen Am Vadianplatz, St. Gallen, Switzerland
| | | |
Collapse
|
8
|
Pillai J, Pillai K. ChatGPT as a medical education resource in cardiology: Mitigating replicability challenges and optimizing model performance. Curr Probl Cardiol 2024; 49:102879. [PMID: 39393621 DOI: 10.1016/j.cpcardiol.2024.102879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Accepted: 10/08/2024] [Indexed: 10/13/2024]
Abstract
Given the rapid development of large language models (LLMs), such as ChatGPT, in its ability to understand and generate human-like texts, these technologies inspired efforts to explore their capabilities in natural language processing tasks, especially those in healthcare contexts. The performance of these tools have been evaluated thoroughly across medicine in diverse tasks, including standardized medical examinations, medical-decision making, and many others. In this journal, Anaya et al. published a study comparing the readability metrics of medical education resources formulated by ChatGPT with those of major U.S. institutions (AHA, ACC, HFSA) about heart failure. In this work, we provide a critical review of this article and further describe approaches to help mitigate challenges in reproducibility of studies evaluating LLMs in cardiology. Additionally, we provide suggestions to optimize sampling of responses provided by LLMs for future studies. Overall, while the study by Anaya et al. provides a meaningful contribution to literature of LLMs in cardiology, further comprehensive studies are necessary to address current limitations and further strengthen our understanding of these novel tools.
Collapse
Affiliation(s)
- Joshua Pillai
- Department of Neurosciences, School of Medicine, University of California San Diego, 9375, Gilman Dr, La Jolla, CA 92161, USA.
| | - Kathryn Pillai
- Department of Medical Education, School of Medicine, California University of Science and Medicine, 1501 Violet St, Colton, CA, USA
| |
Collapse
|
9
|
Gupta V, Gu Y, Lustik SJ, Park W, Yin S, Rubinger D, Chang FM, Panda K, Besharat S, Sadhra H, Glance LG. Performance of a Large Language Model on the Anesthesiology Continuing Education Exam. Anesthesiology 2024; 141:1196-1199. [PMID: 39530718 DOI: 10.1097/aln.0000000000005181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Affiliation(s)
- Vardaan Gupta
- University of Rochester School of Medicine, Rochester, New York (V.G.).
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Seo J, Choi D, Kim T, Cha WC, Kim M, Yoo H, Oh N, Yi Y, Lee KH, Choi E. Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study. J Med Internet Res 2024; 26:e58329. [PMID: 39566044 PMCID: PMC11618017 DOI: 10.2196/58329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 06/30/2024] [Accepted: 09/24/2024] [Indexed: 11/22/2024] Open
Abstract
BACKGROUND The advancement of large language models (LLMs) offers significant opportunities for health care, particularly in the generation of medical documentation. However, challenges related to ensuring the accuracy and reliability of LLM outputs, coupled with the absence of established quality standards, have raised concerns about their clinical application. OBJECTIVE This study aimed to develop and validate an evaluation framework for assessing the accuracy and clinical applicability of LLM-generated emergency department (ED) records, aiming to enhance artificial intelligence integration in health care documentation. METHODS We organized the Healthcare Prompt-a-thon, a competitive event designed to explore the capabilities of LLMs in generating accurate medical records. The event involved 52 participants who generated 33 initial ED records using HyperCLOVA X, a Korean-specialized LLM. We applied a dual evaluation approach. First, clinical evaluation: 4 medical professionals evaluated the records using a 5-point Likert scale across 5 criteria-appropriateness, accuracy, structure/format, conciseness, and clinical validity. Second, quantitative evaluation: We developed a framework to categorize and count errors in the LLM outputs, identifying 7 key error types. Statistical methods, including Pearson correlation and intraclass correlation coefficients (ICC), were used to assess consistency and agreement among evaluators. RESULTS The clinical evaluation demonstrated strong interrater reliability, with ICC values ranging from 0.653 to 0.887 (P<.001), and a test-retest reliability Pearson correlation coefficient of 0.776 (P<.001). Quantitative analysis revealed that invalid generation errors were the most common, constituting 35.38% of total errors, while structural malformation errors had the most significant negative impact on the clinical evaluation score (Pearson r=-0.654; P<.001). A strong negative correlation was found between the number of quantitative errors and clinical evaluation scores (Pearson r=-0.633; P<.001), indicating that higher error rates corresponded to lower clinical acceptability. CONCLUSIONS Our research provides robust support for the reliability and clinical acceptability of the proposed evaluation framework. It underscores the framework's potential to mitigate clinical burdens and foster the responsible integration of artificial intelligence technologies in health care, suggesting a promising direction for future research and practical applications in the field.
Collapse
Affiliation(s)
- Junhyuk Seo
- Department of Digital Health, Samsung Advanced Institute of Health Sciences and Technology (SAIHST), Sungkyunkwan University, Seoul, Republic of Korea
- Department of Nursing, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
| | - Dasol Choi
- Department of Digital Health, Samsung Advanced Institute of Health Sciences and Technology (SAIHST), Sungkyunkwan University, Seoul, Republic of Korea
| | - Taerim Kim
- Department of Digital Health, Samsung Advanced Institute of Health Sciences and Technology (SAIHST), Sungkyunkwan University, Seoul, Republic of Korea
- Department of Emergency Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
| | - Won Chul Cha
- Department of Digital Health, Samsung Advanced Institute of Health Sciences and Technology (SAIHST), Sungkyunkwan University, Seoul, Republic of Korea
- Department of Emergency Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
| | - Minha Kim
- Department of Emergency Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
| | - Haanju Yoo
- NAVER Digital Healthcare Lab, Seongnam, Republic of Korea
| | - Namkee Oh
- Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
| | - YongJin Yi
- Department of Internal Medicine, College of Medicine, Dankook University, Cheonan, Republic of Korea
| | - Kye Hwa Lee
- Department of Information Medicine, Asan Medical Center and University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Edward Choi
- Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
| |
Collapse
|
11
|
Wang D, Liang J, Ye J, Li J, Li J, Zhang Q, Hu Q, Pan C, Wang D, Liu Z, Shi W, Shi D, Li F, Qu B, Zheng Y. Enhancement of the Performance of Large Language Models in Diabetes Education through Retrieval-Augmented Generation: Comparative Study. J Med Internet Res 2024; 26:e58041. [PMID: 39046096 PMCID: PMC11584532 DOI: 10.2196/58041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 06/03/2024] [Accepted: 07/15/2024] [Indexed: 07/25/2024] Open
Abstract
BACKGROUND Large language models (LLMs) demonstrated advanced performance in processing clinical information. However, commercially available LLMs lack specialized medical knowledge and remain susceptible to generating inaccurate information. Given the need for self-management in diabetes, patients commonly seek information online. We introduce the Retrieval-augmented Information System for Enhancement (RISE) framework and evaluate its performance in enhancing LLMs to provide accurate responses to diabetes-related inquiries. OBJECTIVE This study aimed to evaluate the potential of the RISE framework, an information retrieval and augmentation tool, to improve the LLM's performance to accurately and safely respond to diabetes-related inquiries. METHODS The RISE, an innovative retrieval augmentation framework, comprises 4 steps: rewriting query, information retrieval, summarization, and execution. Using a set of 43 common diabetes-related questions, we evaluated 3 base LLMs (GPT-4, Anthropic Claude 2, Google Bard) and their RISE-enhanced versions respectively. Assessments were conducted by clinicians for accuracy and comprehensiveness and by patients for understandability. RESULTS The integration of RISE significantly improved the accuracy and comprehensiveness of responses from all 3 base LLMs. On average, the percentage of accurate responses increased by 12% (15/129) with RISE. Specifically, the rates of accurate responses increased by 7% (3/43) for GPT-4, 19% (8/43) for Claude 2, and 9% (4/43) for Google Bard. The framework also enhanced response comprehensiveness, with mean scores improving by 0.44 (SD 0.10). Understandability was also enhanced by 0.19 (SD 0.13) on average. Data collection was conducted from September 30, 2023 to February 5, 2024. CONCLUSIONS The RISE significantly improves LLMs' performance in responding to diabetes-related inquiries, enhancing accuracy, comprehensiveness, and understandability. These improvements have crucial implications for RISE's future role in patient education and chronic illness self-management, which contributes to relieving medical resource pressures and raising public awareness of medical knowledge.
Collapse
Affiliation(s)
- Dingqiao Wang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Jiangbo Liang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Jinguo Ye
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Jingni Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Jingpeng Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Qikai Zhang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Qiuling Hu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Caineng Pan
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Dongliang Wang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Zhong Liu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Wen Shi
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Danli Shi
- Research Centre for SHARP Vision, The Hong Kong Polytechnic University, Hong Kong, China
| | - Fei Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| | - Bo Qu
- Peking University Third Hospital, Beijing, China
| | - Yingfeng Zheng
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, GuangZhou, China
| |
Collapse
|
12
|
Waldock WJ, Zhang J, Guni A, Nabeel A, Darzi A, Ashrafian H. The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis. J Med Internet Res 2024; 26:e56532. [PMID: 39499913 PMCID: PMC11576595 DOI: 10.2196/56532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 06/26/2024] [Accepted: 09/25/2024] [Indexed: 11/20/2024] Open
Abstract
BACKGROUND Large language models (LLMs) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text. However, there is a lack of clarity about the accuracy and capability standards of LLMs in health care examinations. OBJECTIVE We conducted a systematic review of LLM accuracy, as tested under health care examination conditions, as compared to known human performance standards. METHODS We quantified the accuracy of LLMs in responding to health care examination questions and evaluated the consistency and quality of study reporting. The search included all papers up until September 10, 2023, with all LLMs published in English journals that report clear LLM accuracy standards. The exclusion criteria were as follows: the assessment was not a health care exam, there was no LLM, there was no evaluation of comparable success accuracy, and the literature was not original research.The literature search included the following Medical Subject Headings (MeSH) terms used in all possible combinations: "artificial intelligence," "ChatGPT," "GPT," "LLM," "large language model," "machine learning," "neural network," "Generative Pre-trained Transformer," "Generative Transformer," "Generative Language Model," "Generative Model," "medical exam," "healthcare exam," and "clinical exam." Sensitivity, accuracy, and precision data were extracted, including relevant CIs. RESULTS The search identified 1673 relevant citations. After removing duplicate results, 1268 (75.8%) papers were screened for titles and abstracts, and 32 (2.5%) studies were included for full-text review. Our meta-analysis suggested that LLMs are able to perform with an overall medical examination accuracy of 0.61 (CI 0.58-0.64) and a United States Medical Licensing Examination (USMLE) accuracy of 0.51 (CI 0.46-0.56), while Chat Generative Pretrained Transformer (ChatGPT) can perform with an overall medical examination accuracy of 0.64 (CI 0.6-0.67). CONCLUSIONS LLMs offer promise to remediate health care demand and staffing challenges by providing accurate and efficient context-specific information to critical decision makers. For policy and deployment decisions about LLMs to advance health care, we proposed a new framework called RUBRICC (Regulatory, Usability, Bias, Reliability [Evidence and Safety], Interoperability, Cost, and Codesign-Patient and Public Involvement and Engagement [PPIE]). This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services, while respecting patient safety considerations. TRIAL REGISTRATION OSF Registries osf.io/xqzkw; https://osf.io/xqzkw.
Collapse
Affiliation(s)
| | - Joe Zhang
- Imperial College London, London, United Kingdom
| | - Ahmad Guni
- Imperial College London, London, United Kingdom
| | - Ahmad Nabeel
- Institute of Global Health Innovation, Imperial College London, London, United Kingdom
| | - Ara Darzi
- Imperial College London, London, United Kingdom
| | - Hutan Ashrafian
- Institute of Global Health Innovation, Imperial College London, London, United Kingdom
| |
Collapse
|
13
|
Du X, Novoa-Laurentiev J, Plasek JM, Chuang YW, Wang L, Marshall GA, Mueller SK, Chang F, Datta S, Paek H, Lin B, Wei Q, Wang X, Wang J, Ding H, Manion FJ, Du J, Bates DW, Zhou L. Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notes. EBioMedicine 2024; 109:105401. [PMID: 39396423 DOI: 10.1016/j.ebiom.2024.105401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 09/28/2024] [Accepted: 09/30/2024] [Indexed: 10/15/2024] Open
Abstract
BACKGROUND Large language models (LLMs) have shown promising performance in various healthcare domains, but their effectiveness in identifying specific clinical conditions in real medical records is less explored. This study evaluates LLMs for detecting signs of cognitive decline in real electronic health record (EHR) clinical notes, comparing their error profiles with traditional models. The insights gained will inform strategies for performance enhancement. METHODS This study, conducted at Mass General Brigham in Boston, MA, analysed clinical notes from the four years prior to a 2019 diagnosis of mild cognitive impairment in patients aged 50 and older. We developed prompts for two LLMs, Llama 2 and GPT-4, on Health Insurance Portability and Accountability Act (HIPAA)-compliant cloud-computing platforms using multiple approaches (e.g., hard prompting, retrieval augmented generation, and error analysis-based instructions) to select the optimal LLM-based method. Baseline models included a hierarchical attention-based neural network and XGBoost. Subsequently, we constructed an ensemble of the three models using a majority vote approach. Confusion-matrix-based scores were used for model evaluation. FINDINGS We used a randomly annotated sample of 4949 note sections from 1969 patients (women: 1046 [53.1%]; age: mean, 76.0 [SD, 13.3] years), filtered with keywords related to cognitive functions, for model development. For testing, a random annotated sample of 1996 note sections from 1161 patients (women: 619 [53.3%]; age: mean, 76.5 [SD, 10.2] years) without keyword filtering was utilised. GPT-4 demonstrated superior accuracy and efficiency compared to Llama 2, but did not outperform traditional models. The ensemble model outperformed the individual models in terms of all evaluation metrics with statistical significance (p < 0.01), achieving a precision of 90.2% [95% CI: 81.9%-96.8%], a recall of 94.2% [95% CI: 87.9%-98.7%], and an F1-score of 92.1% [95% CI: 86.8%-96.4%]. Notably, the ensemble model showed a significant improvement in precision, increasing from a range of 70%-79% to above 90%, compared to the best-performing single model. Error analysis revealed that 63 samples were incorrectly predicted by at least one model; however, only 2 cases (3.2%) were mutual errors across all models, indicating diverse error profiles among them. INTERPRETATION LLMs and traditional machine learning models trained using local EHR data exhibited diverse error profiles. The ensemble of these models was found to be complementary, enhancing diagnostic performance. Future research should investigate integrating LLMs with smaller, localised models and incorporating medical data and domain knowledge to enhance performance on specific tasks. FUNDING This research was supported by the National Institute on Aging grants (R44AG081006, R01AG080429) and National Library of Medicine grant (R01LM014239).
Collapse
Affiliation(s)
- Xinsong Du
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA.
| | - John Novoa-Laurentiev
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | - Joseph M Plasek
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
| | - Ya-Wen Chuang
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA; Division of Nephrology, Taichung Veterans General Hospital, Taichung, 407219, Taiwan; Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, 402202, Taiwan; School of Medicine, College of Medicine, China Medical University, Taichung, 406040, Taiwan
| | - Liqin Wang
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
| | - Gad A Marshall
- Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA; Department of Neurology, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | - Stephanie K Mueller
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
| | - Frank Chang
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | - Surabhi Datta
- Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
| | - Hunki Paek
- Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
| | - Bin Lin
- Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
| | - Qiang Wei
- Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
| | - Xiaoyan Wang
- Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
| | - Jingqi Wang
- Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
| | - Hao Ding
- Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
| | - Frank J Manion
- Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
| | - Jingcheng Du
- Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
| | - David W Bates
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
| |
Collapse
|
14
|
Marshall RF, Mallem K, Xu H, Thorne J, Burkholder B, Chaon B, Liberman P, Berkenstock M. Investigating the Accuracy and Completeness of an Artificial Intelligence Large Language Model About Uveitis: An Evaluation of ChatGPT. Ocul Immunol Inflamm 2024; 32:2052-2055. [PMID: 38394625 DOI: 10.1080/09273948.2024.2317417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 12/20/2023] [Accepted: 02/06/2024] [Indexed: 02/25/2024]
Abstract
PURPOSE To assess the accuracy and completeness of ChatGPT-generated answers regarding uveitis description, prevention, treatment, and prognosis. METHODS Thirty-two uveitis-related questions were generated by a uveitis specialist and inputted into ChatGPT 3.5. Answers were compiled into a survey and were reviewed by five uveitis specialists using standardized Likert scales of accuracy and completeness. RESULTS In total, the median accuracy score for all the uveitis questions (n = 32) was 4.00 (between "more correct than incorrect" and "nearly all correct"), and the median completeness score was 2.00 ("adequate, addresses all aspects of the question and provides the minimum amount of information required to be considered complete"). The interrater variability assessment had a total kappa value of 0.0278 for accuracy and 0.0847 for completeness. CONCLUSION ChatGPT can provide relatively high accuracy responses for various questions related to uveitis; however, the answers it provides are incomplete, with some inaccuracies. Its utility in providing medical information requires further validation and development prior to serving as a source of uveitis information for patients.
Collapse
Affiliation(s)
- Rayna F Marshall
- The Drexel University College of Medicine, Philadelphia, Pennsylvania, USA
| | - Krishna Mallem
- The Drexel University College of Medicine, Philadelphia, Pennsylvania, USA
| | - Hannah Xu
- University of California San Diego, San Diego, California, USA
| | - Jennifer Thorne
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Bryn Burkholder
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Benjamin Chaon
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Paulina Liberman
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Meghan Berkenstock
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| |
Collapse
|
15
|
Crema C, Verde F, Tiraboschi P, Marra C, Arighi A, Fostinelli S, Giuffre GM, Maschio VPD, L'Abbate F, Solca F, Poletti B, Silani V, Rotondo E, Borracci V, Vimercati R, Crepaldi V, Inguscio E, Filippi M, Caso F, Rosati AM, Quaranta D, Binetti G, Pagnoni I, Morreale M, Burgio F, Maserati MS, Capellari S, Pardini M, Girtler N, Piras F, Piras F, Lalli S, Perdixi E, Lombardi G, Tella SD, Costa A, Capelli M, Fundaro C, Manera M, Muscio C, Pellencin E, Lodi R, Tagliavini F, Redolfi A. Medical Information Extraction With NLP-Powered QABots: A Real-World Scenario. IEEE J Biomed Health Inform 2024; 28:6906-6917. [PMID: 39190519 DOI: 10.1109/jbhi.2024.3450118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024]
Abstract
The advent of computerized medical recording systems in healthcare facilities has made data retrieval tasks easier, compared to manual recording. Nevertheless, the potential of the information contained within medical records remains largely untapped, mostly due to the time and effort required to extract data from unstructured documents. Natural Language Processing (NLP) represents a promising solution to this challenge, as it enables the use of automated text-mining tools for clinical practitioners. In this work, we present the architecture of the Virtual Dementia Institute (IVD), a consortium of sixteen Italian hospitals, using the NLP Extraction and Management Tool (NEMT), a (semi-) automated end-to-end pipeline that extracts relevant information from clinical documents and stores it in a centralized REDCap database. After defining a common Case Report Form (CRF) across the IVD hospitals, we implemented NEMT, the core of which is a Question Answering Bot (QABot) based on a modern NLP model. This QABot is fine-tuned on thousands of examples from IVD centers. Detailed descriptions of the process to define a common minimum dataset, Inter-Annotator Agreement calculated on clinical documents, and NEMT results are provided. The best QABot performance show an Exact Match score (EM) of 78.1%, a F1-score of 84.7%, a Lenient Accuracy (LAcc) of 0.834, and a Mean Reciprocal Rank (MRR) of 0.810. EM and F1 scores outperform the same metrics obtained with ChatGPTv3.5 (68.9% and 52.5%, respectively). With NEMT the IVD has been able to populate a database that will contain data from thousands of Italian patients, all screened with the same procedure. NEMT represents an efficient tool that paves the way for medical information extraction and exploitation for new research studies.
Collapse
|
16
|
Ng Yin Ling C, Zhu X, Ang M. Artificial intelligence in myopia in children: current trends and future directions. Curr Opin Ophthalmol 2024; 35:463-471. [PMID: 39259652 DOI: 10.1097/icu.0000000000001086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/13/2024]
Abstract
PURPOSE OF REVIEW Myopia is one of the major causes of visual impairment globally, with myopia and its complications thus placing a heavy healthcare and economic burden. With most cases of myopia developing during childhood, interventions to slow myopia progression are most effective when implemented early. To address this public health challenge, artificial intelligence has emerged as a potential solution in childhood myopia management. RECENT FINDINGS The bulk of artificial intelligence research in childhood myopia was previously focused on traditional machine learning models for the identification of children at high risk for myopia progression. Recently, there has been a surge of literature with larger datasets, more computational power, and more complex computation models, leveraging artificial intelligence for novel approaches including large-scale myopia screening using big data, multimodal data, and advancing imaging technology for myopia progression, and deep learning models for precision treatment. SUMMARY Artificial intelligence holds significant promise in transforming the field of childhood myopia management. Novel artificial intelligence modalities including automated machine learning, large language models, and federated learning could play an important role in the future by delivering precision medicine, improving health literacy, and allowing the preservation of data privacy. However, along with these advancements in technology come practical challenges including regulation and clinical integration.
Collapse
Affiliation(s)
| | - Xiangjia Zhu
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Marcus Ang
- Singapore National Eye Centre, Singapore
- Singapore Eye Research Institute
- Department of Ophthalmology and Visual Sciences, Duke-NUS Medical School, Singapore
| |
Collapse
|
17
|
Mankowski MA, Jaffe IS, Xu J, Bae S, Oermann EK, Aphinyanaphongs Y, McAdams-DeMarco MA, Lonze BE, Orandi BJ, Stewart D, Levan M, Massie A, Gentry S, Segev DL. ChatGPT Solving Complex Kidney Transplant Cases: A Comparative Study With Human Respondents. Clin Transplant 2024; 38:e15466. [PMID: 39329220 PMCID: PMC11441623 DOI: 10.1111/ctr.15466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Revised: 08/26/2024] [Accepted: 09/06/2024] [Indexed: 09/28/2024]
Abstract
INTRODUCTION ChatGPT has shown the ability to answer clinical questions in general medicine but may be constrained by the specialized nature of kidney transplantation. Thus, it is important to explore how ChatGPT can be used in kidney transplantation and how its knowledge compares to human respondents. METHODS We prompted ChatGPT versions 3.5, 4, and 4 Visual (4 V) with 12 multiple-choice questions related to six kidney transplant cases from 2013 to 2015 American Society of Nephrology (ASN) fellowship program quizzes. We compared the performance of ChatGPT with US nephrology fellowship program directors, nephrology fellows, and the audience of the ASN's annual Kidney Week meeting. RESULTS Overall, ChatGPT 4 V correctly answered 10 out of 12 questions, showing a performance level comparable to nephrology fellows (group majority correctly answered 9 of 12 questions) and training program directors (11 of 12). This surpassed ChatGPT 4 (7 of 12 correct) and 3.5 (5 of 12). All three ChatGPT versions failed to correctly answer questions where the consensus among human respondents was low. CONCLUSION Each iterative version of ChatGPT performed better than the prior version, with version 4 V achieving performance on par with nephrology fellows and training program directors. While it shows promise in understanding and answering kidney transplantation questions, ChatGPT should be seen as a complementary tool to human expertise rather than a replacement.
Collapse
Affiliation(s)
- Michal A Mankowski
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Ian S Jaffe
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Jingzhi Xu
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Sunjae Bae
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| | - Eric K Oermann
- Department of Neurosurgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Yindalon Aphinyanaphongs
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
- Department of Medicine, NYU Grossman School of Medicine, New York, New York, USA
| | - Mara A McAdams-DeMarco
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| | - Bonnie E Lonze
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Babak J Orandi
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Medicine, NYU Grossman School of Medicine, New York, New York, USA
| | - Darren Stewart
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Macey Levan
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| | - Allan Massie
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| | - Sommer Gentry
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| | - Dorry L Segev
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| |
Collapse
|
18
|
Parente DJ. Generative Artificial Intelligence and Large Language Models in Primary Care Medical Education. Fam Med 2024; 56:534-540. [PMID: 39207784 PMCID: PMC11493110 DOI: 10.22454/fammed.2024.775525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
Generative artificial intelligence and large language models are the continuation of a technological revolution in information processing that began with the invention of the transistor in 1947. These technologies, driven by transformer architectures for artificial neural networks, are poised to broadly influence society. It is already apparent that these technologies will be adapted to drive innovation in education. Medical education is a high-risk activity: Information that is incorrectly taught to a student may go unrecognized for years until a relevant clinical situation appears in which that error can lead to patient harm. In this article, I discuss the principal limitations to the use of generative artificial intelligence in medical education-hallucination, bias, cost, and security-and suggest some approaches to confronting these problems. Additionally, I identify the potential applications of generative artificial intelligence to medical education, including personalized instruction, simulation, feedback, evaluation, augmentation of qualitative research, and performance of critical assessment of the existing scientific literature.
Collapse
Affiliation(s)
- Daniel J. Parente
- Department of Family Medicine and Community Health, University of Kansas Medical CenterKansas City, KS
| |
Collapse
|
19
|
Ahaley SS, Pandey A, Juneja SK, Gupta TS, Vijayakumar S. ChatGPT in medical writing: A game-changer or a gimmick? Perspect Clin Res 2024; 15:165-171. [PMID: 39583920 PMCID: PMC11584153 DOI: 10.4103/picr.picr_167_23] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 08/22/2023] [Accepted: 09/06/2023] [Indexed: 11/26/2024] Open
Abstract
OpenAI's ChatGPT (Generative Pre-trained Transformer) is a chatbot that answers questions and performs writing tasks in a conversational tone. Within months of release, multiple sectors are contemplating the varied applications of this chatbot, including medicine, education, and research, all of which are involved in medical communication and scientific publishing. Medical writers and academics use several artificial intelligence (AI) tools and software for research, literature survey, data analyses, referencing, and writing. There are benefits of using different AI tools in medical writing. However, using chatbots for medical communications pose some major concerns such as potential inaccuracies, data bias, security, and ethical issues. Perceived incorrect notions also limit their use. Moreover, ChatGPT can also be challenging if used incorrectly and for irrelevant tasks. If used appropriately, ChatGPT will not only upgrade the knowledge of the medical writer but also save time and energy that could be directed toward more creative and analytical areas requiring expert skill sets. This review introduces chatbots, outlines the progress in ChatGPT research, elaborates the potential uses of ChatGPT in medical communications along with its challenges and limitations, and proposes future research perspectives. It aims to provide guidance for doctors, researchers, and medical writers on the uses of ChatGPT in medical communications.
Collapse
Affiliation(s)
- Shital Sarah Ahaley
- Hashtag Medical Writing Solutions Private Limited, Chennai, Tamil Nadu, India
| | - Ankita Pandey
- Hashtag Medical Writing Solutions Private Limited, Chennai, Tamil Nadu, India
| | - Simran Kaur Juneja
- Hashtag Medical Writing Solutions Private Limited, Chennai, Tamil Nadu, India
| | - Tanvi Suhane Gupta
- Hashtag Medical Writing Solutions Private Limited, Chennai, Tamil Nadu, India
| | - Sujatha Vijayakumar
- Hashtag Medical Writing Solutions Private Limited, Chennai, Tamil Nadu, India
| |
Collapse
|
20
|
Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC MEDICAL EDUCATION 2024; 24:1013. [PMID: 39285377 PMCID: PMC11406751 DOI: 10.1186/s12909-024-05944-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Accepted: 08/22/2024] [Indexed: 09/19/2024]
Abstract
BACKGROUND ChatGPT, a recently developed artificial intelligence (AI) chatbot, has demonstrated improved performance in examinations in the medical field. However, thus far, an overall evaluation of the potential of ChatGPT models (ChatGPT-3.5 and GPT-4) in a variety of national health licensing examinations is lacking. This study aimed to provide a comprehensive assessment of the ChatGPT models' performance in national licensing examinations for medical, pharmacy, dentistry, and nursing research through a meta-analysis. METHODS Following the PRISMA protocol, full-text articles from MEDLINE/PubMed, EMBASE, ERIC, Cochrane Library, Web of Science, and key journals were reviewed from the time of ChatGPT's introduction to February 27, 2024. Studies were eligible if they evaluated the performance of a ChatGPT model (ChatGPT-3.5 or GPT-4); related to national licensing examinations in the fields of medicine, pharmacy, dentistry, or nursing; involved multiple-choice questions; and provided data that enabled the calculation of effect size. Two reviewers independently completed data extraction, coding, and quality assessment. The JBI Critical Appraisal Tools were used to assess the quality of the selected articles. Overall effect size and 95% confidence intervals [CIs] were calculated using a random-effects model. RESULTS A total of 23 studies were considered for this review, which evaluated the accuracy of four types of national licensing examinations. The selected articles were in the fields of medicine (n = 17), pharmacy (n = 3), nursing (n = 2), and dentistry (n = 1). They reported varying accuracy levels, ranging from 36 to 77% for ChatGPT-3.5 and 64.4-100% for GPT-4. The overall effect size for the percentage of accuracy was 70.1% (95% CI, 65-74.8%), which was statistically significant (p < 0.001). Subgroup analyses revealed that GPT-4 demonstrated significantly higher accuracy in providing correct responses than its earlier version, ChatGPT-3.5. Additionally, in the context of health licensing examinations, the ChatGPT models exhibited greater proficiency in the following order: pharmacy, medicine, dentistry, and nursing. However, the lack of a broader set of questions, including open-ended and scenario-based questions, and significant heterogeneity were limitations of this meta-analysis. CONCLUSIONS This study sheds light on the accuracy of ChatGPT models in four national health licensing examinations across various countries and provides a practical basis and theoretical support for future research. Further studies are needed to explore their utilization in medical and health education by including a broader and more diverse range of questions, along with more advanced versions of AI chatbots.
Collapse
Affiliation(s)
- Hye Kyung Jin
- Research Institute of Pharmaceutical Sciences, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea
| | - Ha Eun Lee
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea
| | - EunYoung Kim
- Research Institute of Pharmaceutical Sciences, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea.
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea.
- Division of Licensing of Medicines and Regulatory Science, The Graduate School of Pharmaceutical Management, and Regulatory Science Policy, The Graduate School of Pharmaceutical Regulatory Sciences, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea.
| |
Collapse
|
21
|
Jung H, Oh J, Stephenson KAJ, Joe AW, Mammo ZN. Prompt engineering with ChatGPT3.5 and GPT4 to improve patient education on retinal diseases. CANADIAN JOURNAL OF OPHTHALMOLOGY 2024:S0008-4182(24)00258-8. [PMID: 39245293 DOI: 10.1016/j.jcjo.2024.08.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 04/24/2024] [Accepted: 08/18/2024] [Indexed: 09/10/2024]
Abstract
OBJECTIVE To assess the effect of prompt engineering on the accuracy, comprehensiveness, readability, and empathy of large language model (LLM)-generated responses to patient questions regarding retinal disease. DESIGN Prospective qualitative study. PARTICIPANTS Retina specialists, ChatGPT3.5, and GPT4. METHODS Twenty common patient questions regarding 5 retinal conditions were inputted to ChatGPT3.5 and GPT4 as a stand-alone question or preceded by an optimized prompt (prompt A) or preceded by prompt A with specified limits to length and grade reading level (prompt B). Accuracy and comprehensiveness were graded by 3 retina specialists on a Likert scale from 1 to 5 (1: very poor to 5: very good). Readability of responses was assessed using Readable.com, an online readability tool. RESULTS There were no significant differences between ChatGPT3.5 and GPT4 across any of the metrics tested. Median accuracy of responses to a stand-alone question, prompt A, and prompt B questions were 5.0, 5.0, and 4.0, respectively. Median comprehensiveness of responses to a stand-alone question, prompt A, and prompt B questions were 5.0, 5.0, and 4.0, respectively. The use of prompt B was associated with a lower accuracy and comprehensiveness than responses to stand-alone question or prompt A questions (p < 0.001). Average-grade reading level of responses across both LLMs were 13.45, 11.5, and 10.3 for a stand-alone question, prompt A, and prompt B questions, respectively (p < 0.001). CONCLUSIONS Prompt engineering can significantly improve readability of LLM-generated responses, although at the cost of reducing accuracy and comprehensiveness. Further study is needed to understand the utility and bioethical implications of LLMs as a patient educational resource.
Collapse
Affiliation(s)
- Hoyoung Jung
- Faculty of Medicine, University of British Columbia, Vancouver BC, Canada
| | - Jean Oh
- Faculty of Medicine, University of British Columbia, Vancouver BC, Canada
| | - Kirk A J Stephenson
- Department of Ophthalmology and Visual Sciences, University of British Columbia, Vancouver BC, Canada
| | - Aaron W Joe
- Department of Ophthalmology and Visual Sciences, University of British Columbia, Vancouver BC, Canada
| | - Zaid N Mammo
- Department of Ophthalmology and Visual Sciences, University of British Columbia, Vancouver BC, Canada.
| |
Collapse
|
22
|
Yiu E, Kosoy E, Gopnik A. Transmission Versus Truth, Imitation Versus Innovation: What Children Can Do That Large Language and Language-and-Vision Models Cannot (Yet). PERSPECTIVES ON PSYCHOLOGICAL SCIENCE 2024; 19:874-883. [PMID: 37883796 PMCID: PMC11373165 DOI: 10.1177/17456916231201401] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Abstract
Much discussion about large language models and language-and-vision models has focused on whether these models are intelligent agents. We present an alternative perspective. First, we argue that these artificial intelligence (AI) models are cultural technologies that enhance cultural transmission and are efficient and powerful imitation engines. Second, we explore what AI models can tell us about imitation and innovation by testing whether they can be used to discover new tools and novel causal structures and contrasting their responses with those of human children. Our work serves as a first step in determining which particular representations and competences, as well as which kinds of knowledge or skill, can be derived from particular learning techniques and data. In particular, we explore which kinds of cognitive capacities can be enabled by statistical analysis of large-scale linguistic data. Critically, our findings suggest that machines may need more than large-scale language and image data to allow the kinds of innovation that a small child can produce.
Collapse
Affiliation(s)
- Eunice Yiu
- Department of Psychology, University of California, Berkeley
| | - Eliza Kosoy
- Department of Psychology, University of California, Berkeley
| | - Alison Gopnik
- Department of Psychology, University of California, Berkeley
| |
Collapse
|
23
|
Kayastha A, Lakshmanan K, Valentine MJ, Nguyen A, Dholakia K, Wang D. Lumbar disc herniation with radiculopathy: a comparison of NASS guidelines and ChatGPT. NORTH AMERICAN SPINE SOCIETY JOURNAL 2024; 19:100333. [PMID: 39040948 PMCID: PMC11261487 DOI: 10.1016/j.xnsj.2024.100333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 05/25/2024] [Accepted: 05/27/2024] [Indexed: 07/24/2024]
Abstract
Background ChatGPT is an advanced language AI able to generate responses to clinical questions regarding lumbar disc herniation with radiculopathy. Artificial intelligence (AI) tools are increasingly being considered to assist clinicians in decision-making. This study compared ChatGPT-3.5 and ChatGPT-4.0 responses to established NASS clinical guidelines and evaluated concordance. Methods ChatGPT-3.5 and ChatGPT-4.0 were prompted with fifteen questions from The 2012 NASS Clinical Guidelines for the diagnosis and treatment of lumbar disc herniation with radiculopathy. Clinical questions organized into categories were directly entered as unmodified queries into ChatGPT. Language output was assessed by two independent authors on September 26, 2023 based on operationally-defined parameters of accuracy, over-conclusiveness, supplementary, and incompleteness. ChatGPT-3.5 and ChatGPT-4.0 performance was compared via chi-square analyses. Results Among the fifteen responses produced by ChatGPT-3.5, 7 (47%) were accurate, 7 (47%) were over-conclusive, fifteen (100%) were supplementary, and 6 (40%) were incomplete. For ChatGPT-4.0, ten (67%) were accurate, 5 (33%) were over-conclusive, 10 (67%) were supplementary, and 6 (40%) were incomplete. There was a statistically significant difference in supplementary information (100% vs. 67%; p=.014) between ChatGPT-3.5 and ChatGPT-4.0. Accuracy (47% vs. 67%; p=.269), over-conclusiveness (47% vs. 33%; p=.456), and incompleteness (40% vs. 40%; p=1.000) did not show significant differences between ChatGPT-3.5 and ChatGPT-4.0. ChatGPT-3.5 and ChatGPT-4.0 both yielded 100% accuracy for definition and history and physical examination categories. Diagnostic testing yielded 0% accuracy for ChatGPT-3.5 and 100% accuracy for ChatGPT-4.0. Nonsurgical interventions had 50% accuracy for ChatGPT-3.5 and 63% accuracy for ChatGPT-4.0. Surgical interventions resulted in 0% accuracy for ChatGPT-3.5 and 33% accuracy for ChatGPT-4.0. Conclusions ChatGPT-4.0 provided less supplementary information and overall higher accuracy in question categories than ChatGPT-3.5. ChatGPT showed reasonable concordance to NASS guidelines, but clinicians should caution use of ChatGPT in its current state as it fails to safeguard against misinformation.
Collapse
Affiliation(s)
| | | | | | - Anh Nguyen
- Kansas City University, Kansas City, MO, United States
| | | | - Daniel Wang
- MedStar Health, Baltimore, MD, United States
- Georgetown University Medical Center, Washington DC, United States
| |
Collapse
|
24
|
Kowalewski KF, Rodler S. [Large language models in science]. UROLOGIE (HEIDELBERG, GERMANY) 2024; 63:860-866. [PMID: 39048694 DOI: 10.1007/s00120-024-02396-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 06/26/2024] [Indexed: 07/27/2024]
Abstract
OBJECTIVE Large language models (LLMs) are gaining popularity due to their ability to communicate in a human-like manner. Their potential for science, including urology, is increasingly recognized. However, unresolved concerns regarding transparency, accountability, and the accuracy of LLM results still exist. RESEARCH QUESTION This review examines the ethical, technical, and practical challenges as well as the potential applications of LLMs in urology and science. MATERIALS AND METHODS A selective literature review was conducted to analyze current findings and developments in the field of LLMs. The review considered studies on technical aspects, ethical considerations, and practical applications in research and practice. RESULTS LLMs, such as GPT from OpenAI and Gemini from Google, show great potential for processing and analyzing text data. Applications in urology include creating patient information and supporting administrative tasks. However, for purely clinical and scientific questions, the methods do not yet seem mature. Currently, concerns about ethical issues and the accuracy of results persist. CONCLUSION LLMs have the potential to support research and practice through efficient data processing and information provision. Despite their advantages, ethical concerns and technical challenges must be addressed to ensure responsible and trustworthy use. Increased implementation could reduce the workload of urologists and improve communication with patients.
Collapse
Affiliation(s)
- Karl-Friedrich Kowalewski
- Klinik für Urologie und Urochirurgie, Universitätsmedizin Mannheim, Universität Heidelberg, 68167, Theodor-Kutzer-Ufer 1-3, Deutschland.
| | - Severin Rodler
- Klinik für Urologie, Universitätsklinikum Schleswig-Holstein, Campus Kiel, Arnold-Heller-Straße 3, 24105, Kiel, Deutschland.
| |
Collapse
|
25
|
Salvagno M, Cassai AD, Zorzi S, Zaccarelli M, Pasetto M, Sterchele ED, Chumachenko D, Gerli AG, Azamfirei R, Taccone FS. The state of artificial intelligence in medical research: A survey of corresponding authors from top medical journals. PLoS One 2024; 19:e0309208. [PMID: 39178224 PMCID: PMC11343420 DOI: 10.1371/journal.pone.0309208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 08/08/2024] [Indexed: 08/25/2024] Open
Abstract
Natural Language Processing (NLP) is a subset of artificial intelligence that enables machines to understand and respond to human language through Large Language Models (LLMs)‥ These models have diverse applications in fields such as medical research, scientific writing, and publishing, but concerns such as hallucination, ethical issues, bias, and cybersecurity need to be addressed. To understand the scientific community's understanding and perspective on the role of Artificial Intelligence (AI) in research and authorship, a survey was designed for corresponding authors in top medical journals. An online survey was conducted from July 13th, 2023, to September 1st, 2023, using the SurveyMonkey web instrument, and the population of interest were corresponding authors who published in 2022 in the 15 highest-impact medical journals, as ranked by the Journal Citation Report. The survey link has been sent to all the identified corresponding authors by mail. A total of 266 authors answered, and 236 entered the final analysis. Most of the researchers (40.6%) reported having moderate familiarity with artificial intelligence, while a minority (4.4%) had no associated knowledge. Furthermore, the vast majority (79.0%) believe that artificial intelligence will play a major role in the future of research. Of note, no correlation between academic metrics and artificial intelligence knowledge or confidence was found. The results indicate that although researchers have varying degrees of familiarity with artificial intelligence, its use in scientific research is still in its early phases. Despite lacking formal AI training, many scholars publishing in high-impact journals have started integrating such technologies into their projects, including rephrasing, translation, and proofreading tasks. Efforts should focus on providing training for their effective use, establishing guidelines by journal editors, and creating software applications that bundle multiple integrated tools into a single platform.
Collapse
Affiliation(s)
- Michele Salvagno
- Department of Intensive Care, Hôpital Universitaire de Bruxelles (HUB), Brussels, Belgium
| | - Alessandro De Cassai
- Sant’Antonio Anesthesia and Intensive Care Unit, University Hospital of Padua, Padua, Italy
| | - Stefano Zorzi
- Department of Intensive Care, Hôpital Universitaire de Bruxelles (HUB), Brussels, Belgium
| | - Mario Zaccarelli
- Department of Intensive Care, Hôpital Universitaire de Bruxelles (HUB), Brussels, Belgium
| | - Marco Pasetto
- Department of Intensive Care, Hôpital Universitaire de Bruxelles (HUB), Brussels, Belgium
| | - Elda Diletta Sterchele
- Department of Intensive Care, Hôpital Universitaire de Bruxelles (HUB), Brussels, Belgium
| | - Dmytro Chumachenko
- Department of Mathematical Modelling and Artificial Intelligence, National Aerospace University “Kharkiv Aviation Institute”, Kharkiv, Ukraine
- Ubiquitous Health Technologies Lab, University of Waterloo, Waterloo, Canada
| | - Alberto Giovanni Gerli
- Department of Clinical Sciences and Community Health, Università degli Studi di Milano, Milan, Italy
| | - Razvan Azamfirei
- Department of Anesthesiology and Critical Care Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America
| | - Fabio Silvio Taccone
- Department of Intensive Care, Hôpital Universitaire de Bruxelles (HUB), Brussels, Belgium
| |
Collapse
|
26
|
Palenzuela DL, Mullen JT, Phitayakorn R. AI Versus MD: Evaluating the surgical decision-making accuracy of ChatGPT-4. Surgery 2024; 176:241-245. [PMID: 38769038 DOI: 10.1016/j.surg.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/22/2024] [Accepted: 04/03/2024] [Indexed: 05/22/2024]
Abstract
BACKGROUND ChatGPT-4 is a large language model with possible applications to surgery education The aim of this study was to investigate the accuracy of ChatGPT-4's surgical decision-making compared with general surgery residents and attending surgeons. METHODS Five clinical scenarios were created from actual patient data based on common general surgery diagnoses. Scripts were developed to sequentially provide clinical information and ask decision-making questions. Responses to the prompts were scored based on a standardized rubric for a total of 50 points. Each clinical scenario was run through Chat GPT-4 and sent electronically to all general surgery residents and attendings at a single institution. Scores were compared using Wilcoxon rank sum tests. RESULTS On average, ChatGPT-4 scored 39.6 points (79.2%, standard deviation ± 0.89 points). A total of five junior residents, 12 senior residents, and five attendings completed the clinical scenarios (resident response rate = 15.9%; attending response rate = 13.8%). On average, the junior residents scored a total of 33.4 (66.8%, standard deviation ± 3.29), senior residents 38.0 (76.0%, standard deviation ± 4.75), and attendings 38.8 (77.6%, standard deviation ± 5.45). ChatGPT-4 scored significantly better than junior residents (P = .009) but was not significantly different from senior residents or attendings. ChatGPT-4 was significantly better than junior residents at identifying the correct operation to perform (P = .0182) and recommending additional workup for postoperative complications (P = .012). CONCLUSION ChatGPT-4 performed superior to junior residents and equivalent to senior residents and attendings when faced with surgical patient scenarios. Large language models, such as ChatGPT, may have the potential to be an educational resource for junior residents to develop surgical decision-making skills.
Collapse
Affiliation(s)
| | | | - Roy Phitayakorn
- Massachusetts General Hospital, Boston, MA. https://www.twitter.com/RoyPhit
| |
Collapse
|
27
|
Wachter S, Mittelstadt B, Russell C. Do large language models have a legal duty to tell the truth? ROYAL SOCIETY OPEN SCIENCE 2024; 11:240197. [PMID: 39113763 PMCID: PMC11303832 DOI: 10.1098/rsos.240197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Accepted: 05/17/2024] [Indexed: 08/10/2024]
Abstract
Careless speech is a new type of harm created by large language models (LLM) that poses cumulative, long-term risks to science, education and shared social truth in democratic societies. LLMs produce responses that are plausible, helpful and confident, but that contain factual inaccuracies, misleading references and biased information. These subtle mistruths are poised to cumulatively degrade and homogenize knowledge over time. This article examines the existence and feasibility of a legal duty for LLM providers to create models that 'tell the truth'. We argue that LLM providers should be required to mitigate careless speech and better align their models with truth through open, democratic processes. We define careless speech against 'ground truth' in LLMs and related risks including hallucinations, misinformation and disinformation. We assess the existence of truth-related obligations in EU human rights law and the Artificial Intelligence Act, Digital Services Act, Product Liability Directive and Artificial Intelligence Liability Directive. Current frameworks contain limited, sector-specific truth duties. Drawing on duties in science and academia, education, archives and libraries, and a German case in which Google was held liable for defamation caused by autocomplete, we propose a pathway to create a legal truth duty for providers of narrow- and general-purpose LLMs.
Collapse
Affiliation(s)
- Sandra Wachter
- Oxford Internet Institute, University of Oxford, 1 St Giles, Oxford OX1 3JS, UK
| | - Brent Mittelstadt
- Oxford Internet Institute, University of Oxford, 1 St Giles, Oxford OX1 3JS, UK
| | - Chris Russell
- Oxford Internet Institute, University of Oxford, 1 St Giles, Oxford OX1 3JS, UK
| |
Collapse
|
28
|
Teasdale A, Mills L, Costello R. Artificial Intelligence-Powered Surgical Consent: Patient Insights. Cureus 2024; 16:e68134. [PMID: 39347259 PMCID: PMC11438496 DOI: 10.7759/cureus.68134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/28/2024] [Indexed: 10/01/2024] Open
Abstract
Introduction The integration of artificial intelligence (AI) in healthcare has revolutionized patient interactions and service delivery. AI's role extends from supporting clinical diagnostics and enhancing operational efficiencies to potentially improving informed consent processes in surgical settings. This study investigates the application of AI, particularly large language models like OpenAI's ChatGPT, in facilitating surgical consent, focusing on patient understanding, satisfaction, and trust. Methods We employed a mixed-methods approach involving 86 participants, including laypeople and medical staff, who engaged in a simulated AI-driven consent process for a tonsillectomy. Participants interacted with ChatGPT-4, which provided detailed procedure explanations, risks, and benefits. Post-interaction, participants completed a survey assessing their experience through quantitative and qualitative measures. Results Participants had a cautiously optimistic response to AI in the surgical consent process. Notably, 71% felt adequately informed, 86% found the information clear, and 71% felt they could make informed decisions. Overall, 71% were satisfied, 57% felt respected and confident, and 57% would recommend it, indicating areas needing refinement. However, concerns about data privacy and the lack of personal interaction were significant, with only 42% reassured about the security of their data. The standardization of information provided by AI was appreciated for potentially reducing human error, but the absence of empathetic human interaction was noted as a drawback. Discussion While AI shows promise in enhancing the consistency and comprehensiveness of information delivered during the consent process, significant challenges remain. These include addressing data privacy concerns and bridging the gap in personal interaction. The potential for AI to misinform due to system "hallucinations" or inherent biases also needs consideration. Future research should focus on refining AI interactions to support more nuanced and empathetic engagements, ensuring that AI supplements rather than replacing human elements in healthcare. Conclusion The integration of AI into surgical consent processes could standardize and potentially improve the delivery of information but must be balanced with efforts to maintain the critical human elements of care. Collaborative efforts between developers, clinicians, and ethicists are essential to optimize AI use, ensuring it complements the traditional consent process while enhancing patient satisfaction and trust.
Collapse
Affiliation(s)
| | - Laura Mills
- General Practice, Dyfed Road Surgery, Swansea, GBR
| | | |
Collapse
|
29
|
Brant-Zawadzki G, Klapthor B, Ryba C, Youngquist DC, Burton B, Palatinus H, Youngquist ST. The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls. PREHOSP EMERG CARE 2024:1-8. [PMID: 38976859 DOI: 10.1080/10903127.2024.2376757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Accepted: 06/26/2024] [Indexed: 07/10/2024]
Abstract
OBJECTIVES This study assesses the feasibility, inter-rater reliability, and accuracy of using OpenAI's ChatGPT-4 and Google's Gemini Ultra large language models (LLMs), for Emergency Medical Services (EMS) quality assurance. The implementation of these LLMs for EMS quality assurance has the potential to significantly reduce the workload on medical directors and quality assurance staff by automating aspects of the processing and review of patient care reports. This offers the potential for more efficient and accurate identification of areas requiring improvement, thereby potentially enhancing patient care outcomes. METHODS Two expert human reviewers, ChatGPT GPT-4, and Gemini Ultra assessed and rated 150 consecutively sampled and anonymized prehospital records from 2 large urban EMS agencies for adherence to 2020 National Association of State EMS metrics for cardiac care. We evaluated the accuracy of scoring, inter-rater reliability, and review efficiency. The inter-rater reliability for the dichotomous outcome of each EMS metric was measured using the kappa statistic. RESULTS Human reviewers showed high interrater reliability, with 91.2% agreement and a kappa coefficient 0.782 (0.654-0.910). ChatGPT-4 achieved substantial agreement with human reviewers in EKG documentation and aspirin administration (76.2% agreement, kappa coefficient 0.401 (0.334-0.468), but performance varied across other metrics. Gemini Ultra's evaluation was discontinued due to poor performance. No significant differences were observed in median review times: 01:28 min (IQR 1:12 - 1:51 min) per human chart review, 01:24 min (IQR 01:09 - 01:53 min) per ChatGPT-4 chart review (p = 0.46), and 01:50 min (IQR 01:10-03:34 min) per Gemini Ultra review (p = 0.06). CONCLUSIONS Large language models demonstrate potential in supporting quality assurance by effectively and objectively extracting data elements. However, their accuracy in interpreting non-standardized and time-sensitive details remains inferior to human evaluators. Our findings suggest that current LLMs may best offer supplemental support to the human review processes, but their current value remains limited. Enhancements in LLM training and integration are recommended for improved and more reliable performance in the quality assurance processes.
Collapse
Affiliation(s)
- Graham Brant-Zawadzki
- Department of Emergency Medicine, University of Utah, Salt Lake City, Utah
- Unified Fire Authority, Salt Lake City, Utah
| | - Brent Klapthor
- Department of Emergency Medicine, University of Utah, Salt Lake City, Utah
| | - Chris Ryba
- Department of Emergency Medicine, University of Utah, Salt Lake City, Utah
- Salt Lake City Fire Department, Salt Lake City, Utah
| | - Drew C Youngquist
- Department of Emergency Medicine, University of Utah, Salt Lake City, Utah
| | | | - Helen Palatinus
- Department of Emergency Medicine, University of Utah, Salt Lake City, Utah
| | - Scott T Youngquist
- Department of Emergency Medicine, University of Utah, Salt Lake City, Utah
- Salt Lake City Fire Department, Salt Lake City, Utah
| |
Collapse
|
30
|
Stadler RD, Sudah SY, Moverman MA, Denard PJ, Duralde XA, Garrigues GE, Klifto CS, Levy JC, Namdari S, Sanchez-Sotelo J, Menendez ME. Identification of ChatGPT-Generated Abstracts Within Shoulder and Elbow Surgery Poses a Challenge for Reviewers. Arthroscopy 2024:S0749-8063(24)00495-X. [PMID: 38992513 DOI: 10.1016/j.arthro.2024.06.045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 06/21/2024] [Accepted: 06/27/2024] [Indexed: 07/13/2024]
Abstract
PURPOSE To evaluate the extent to which experienced reviewers can accurately discern between artificial intelligence (AI)-generated and original research abstracts published in the field of shoulder and elbow surgery and compare this with the performance of an AI detection tool. METHODS Twenty-five shoulder- and elbow-related articles published in high-impact journals in 2023 were randomly selected. ChatGPT was prompted with only the abstract title to create an AI-generated version of each abstract. The resulting 50 abstracts were randomly distributed to and evaluated by 8 blinded peer reviewers with at least 5 years of experience. Reviewers were tasked with distinguishing between original and AI-generated text. A Likert scale assessed reviewer confidence for each interpretation, and the primary reason guiding assessment of generated text was collected. AI output detector (0%-100%) and plagiarism (0%-100%) scores were evaluated using GPTZero. RESULTS Reviewers correctly identified 62% of AI-generated abstracts and misclassified 38% of original abstracts as being AI generated. GPTZero reported a significantly higher probability of AI output among generated abstracts (median, 56%; interquartile range [IQR], 51%-77%) compared with original abstracts (median, 10%; IQR, 4%-37%; P < .01). Generated abstracts scored significantly lower on the plagiarism detector (median, 7%; IQR, 5%-14%) relative to original abstracts (median, 82%; IQR, 72%-92%; P < .01). Correct identification of AI-generated abstracts was predominately attributed to the presence of unrealistic data/values. The primary reason for misidentifying original abstracts as AI was attributed to writing style. CONCLUSIONS Experienced reviewers faced difficulties in distinguishing between human and AI-generated research content within shoulder and elbow surgery. The presence of unrealistic data facilitated correct identification of AI abstracts, whereas misidentification of original abstracts was often ascribed to writing style. CLINICAL RELEVANCE With rapidly increasing AI advancements, it is paramount that ethical standards of scientific reporting are upheld. It is therefore helpful to understand the ability of reviewers to identify AI-generated content.
Collapse
Affiliation(s)
- Ryan D Stadler
- Rutgers Robert Wood Johnson Medical School, New Brunswick, New Jersey, U.S.A..
| | - Suleiman Y Sudah
- Department of Orthopaedic Surgery, Monmouth Medical Center, Monmouth, New Jersey, U.S.A
| | - Michael A Moverman
- Department of Orthopaedics, University of Utah School of Medicine, Salt Lake City, Utah, U.S.A
| | | | | | - Grant E Garrigues
- Midwest Orthopaedics at Rush University Medical Center, Chicago, Illinois, U.S.A
| | - Christopher S Klifto
- Department of Orthopaedic Surgery, Duke University School of Medicine, Durham, North Carolina, U.S.A
| | - Jonathan C Levy
- Levy Shoulder Center at Paley Orthopedic & Spine Institute, Boca Raton, Florida, U.S.A
| | - Surena Namdari
- Rothman Orthopaedic Institute at Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, U.S.A
| | | | - Mariano E Menendez
- Department of Orthopaedics, University of California Davis, Sacramento, California, U.S.A
| |
Collapse
|
31
|
Nakaura T, Ito R, Ueda D, Nozaki T, Fushimi Y, Matsui Y, Yanagawa M, Yamada A, Tsuboyama T, Fujima N, Tatsugami F, Hirata K, Fujita S, Kamagata K, Fujioka T, Kawamura M, Naganawa S. The impact of large language models on radiology: a guide for radiologists on the latest innovations in AI. Jpn J Radiol 2024; 42:685-696. [PMID: 38551772 PMCID: PMC11217134 DOI: 10.1007/s11604-024-01552-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/21/2024] [Indexed: 07/03/2024]
Abstract
The advent of Deep Learning (DL) has significantly propelled the field of diagnostic radiology forward by enhancing image analysis and interpretation. The introduction of the Transformer architecture, followed by the development of Large Language Models (LLMs), has further revolutionized this domain. LLMs now possess the potential to automate and refine the radiology workflow, extending from report generation to assistance in diagnostics and patient care. The integration of multimodal technology with LLMs could potentially leapfrog these applications to unprecedented levels.However, LLMs come with unresolved challenges such as information hallucinations and biases, which can affect clinical reliability. Despite these issues, the legislative and guideline frameworks have yet to catch up with technological advancements. Radiologists must acquire a thorough understanding of these technologies to leverage LLMs' potential to the fullest while maintaining medical safety and ethics. This review aims to aid in that endeavor.
Collapse
Affiliation(s)
- Takeshi Nakaura
- Department of Central Radiology, Kumamoto University Hospital, Honjo 1-1-1, Kumamoto, 860-8556, Japan.
| | - Rintaro Ito
- Department of Radiology, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1‑4‑3 Asahi‑Machi, Abeno‑ku, Osaka, 545‑8585, Japan
| | - Taiki Nozaki
- Department of Radiology, Keio University School of Medicine, Shinjuku‑ku, Tokyo, Japan
| | - Yasutaka Fushimi
- Department of Diagnostic Imaging and Nuclear Medicine, Kyoto University Graduate School of Medicine, Sakyoku, Kyoto, Japan
| | - Yusuke Matsui
- Department of Radiology, Faculty of Medicine, Dentistry and Pharmaceutical Sciences, Okayama University, Kita‑ku, Okayama, Japan
| | - Masahiro Yanagawa
- Department of Radiology, Osaka University Graduate School of Medicine, Suita City, Osaka, Japan
| | - Akira Yamada
- Department of Radiology, Shinshu University School of Medicine, Matsumoto, Nagano, Japan
| | - Takahiro Tsuboyama
- Department of Radiology, Osaka University Graduate School of Medicine, Suita City, Osaka, Japan
| | - Noriyuki Fujima
- Department of Diagnostic and Interventional Radiology, Hokkaido University Hospital, Sapporo, Japan
| | - Fuminari Tatsugami
- Department of Diagnostic Radiology, Hiroshima University, Minami‑ku, Hiroshima, Japan
| | - Kenji Hirata
- Department of Diagnostic Imaging, Graduate School of Medicine, Hokkaido University, Kita‑ku, Sapporo, Hokkaido, Japan
| | - Shohei Fujita
- Department of Radiology, University of Tokyo, Bunkyo‑ku, Tokyo, Japan
| | - Koji Kamagata
- Department of Radiology, Juntendo University Graduate School of Medicine, Bunkyo‑ku, Tokyo, Japan
| | - Tomoyuki Fujioka
- Department of Diagnostic Radiology, Tokyo Medical and Dental University, Bunkyo‑ku, Tokyo, Japan
| | - Mariko Kawamura
- Department of Radiology, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
| | - Shinji Naganawa
- Department of Radiology, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
| |
Collapse
|
32
|
Baxter SL, Longhurst CA, Millen M, Sitapati AM, Tai-Seale M. Generative artificial intelligence responses to patient messages in the electronic health record: early lessons learned. JAMIA Open 2024; 7:ooae028. [PMID: 38601475 PMCID: PMC11006101 DOI: 10.1093/jamiaopen/ooae028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 03/18/2024] [Accepted: 04/03/2024] [Indexed: 04/12/2024] Open
Abstract
Background Electronic health record (EHR)-based patient messages can contribute to burnout. Messages with a negative tone are particularly challenging to address. In this perspective, we describe our initial evaluation of large language model (LLM)-generated responses to negative EHR patient messages and contend that using LLMs to generate initial drafts may be feasible, although refinement will be needed. Methods A retrospective sample (n = 50) of negative patient messages was extracted from a health system EHR, de-identified, and inputted into an LLM (ChatGPT). Qualitative analyses were conducted to compare LLM responses to actual care team responses. Results Some LLM-generated draft responses varied from human responses in relational connection, informational content, and recommendations for next steps. Occasionally, the LLM draft responses could have potentially escalated emotionally charged conversations. Conclusion Further work is needed to optimize the use of LLMs for responding to negative patient messages in the EHR.
Collapse
Affiliation(s)
- Sally L Baxter
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, La Jolla, CA 92093, United States
- Health Department of Biomedical Informatics, University of California San Diego Health, La Jolla, CA 92093, United States
| | - Christopher A Longhurst
- Health Department of Biomedical Informatics, University of California San Diego Health, La Jolla, CA 92093, United States
| | - Marlene Millen
- Health Department of Biomedical Informatics, University of California San Diego Health, La Jolla, CA 92093, United States
- Division of Internal Medicine, Department of Medicine, University of California San Diego, La Jolla, CA 92093, United States
| | - Amy M Sitapati
- Health Department of Biomedical Informatics, University of California San Diego Health, La Jolla, CA 92093, United States
- Division of Internal Medicine, Department of Medicine, University of California San Diego, La Jolla, CA 92093, United States
| | - Ming Tai-Seale
- Health Department of Biomedical Informatics, University of California San Diego Health, La Jolla, CA 92093, United States
- Department of Family Medicine, University of California San Diego, La Jolla, CA 92093, United States
| |
Collapse
|
33
|
Terwilliger E, Bcharah G, Bcharah H, Bcharah E, Richardson C, Scheffler P. Advancing Medical Education: Performance of Generative Artificial Intelligence Models on Otolaryngology Board Preparation Questions With Image Analysis Insights. Cureus 2024; 16:e64204. [PMID: 39130878 PMCID: PMC11315421 DOI: 10.7759/cureus.64204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/09/2024] [Indexed: 08/13/2024] Open
Abstract
Objective To evaluate and compare the performance of Chat Generative Pre-Trained Transformer (ChatGPT), GPT-4, and Google Bard on United States otolaryngology board-style questions to scale their ability to act as an adjunctive study tool and resource for students and doctors. Methods A 1077 text question and 60 image-based questions from the otolaryngology board exam preparation tool BoardVitals were inputted into ChatGPT, GPT-4, and Google Bard. The questions were scaled true or false, depending on whether the artificial intelligence (AI) modality provided the correct response. Data analysis was performed in R Studio. Results GPT-4 scored the highest at 78.7% compared to ChatGPT and Bard at 55.3% and 61.7% (p<0.001), respectively. In terms of question difficulty, all three AI models performed best on easy questions (ChatGPT: 69.7%, GPT-4: 92.5%, and Bard: 76.4%) and worst on hard questions (ChatGPT: 42.3%, GPT-4: 61.3%, and Bard: 45.6%). Across all difficulty levels, GPT-4 did better than Bard and ChatGPT (p<0.0001). GPT-4 outperformed ChatGPT and Bard in all subspecialty sections, with significantly higher scores (p<0.05) on all sections except allergy (p>0.05). On image-based questions, GPT-4 performed better than Bard (56.7% vs 46.4%, p=0.368) and had better overall image interpretation capabilities. Conclusion This study showed that the GPT-4 model performed better than both ChatGPT and Bard on the United States otolaryngology board practice questions. Although the GPT-4 results were promising, AI should still be used with caution when being implemented in medical education or patient care settings.
Collapse
Affiliation(s)
- Emma Terwilliger
- Otolaryngology, Mayo Clinic Alix School of Medicine, Scottsdale, USA
| | - George Bcharah
- Otolaryngology, Mayo Clinic Alix School of Medicine, Scottsdale, USA
| | - Hend Bcharah
- Otolaryngology, Andrew Taylor Still University School of Osteopathic Medicine, Mesa, USA
| | | | | | | |
Collapse
|
34
|
Poje K, Brcic M, Kovac M, Babac MB. Effect of Private Deliberation: Deception of Large Language Models in Game Play. ENTROPY (BASEL, SWITZERLAND) 2024; 26:524. [PMID: 38920532 PMCID: PMC11203171 DOI: 10.3390/e26060524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 06/08/2024] [Accepted: 06/17/2024] [Indexed: 06/27/2024]
Abstract
Integrating large language model (LLM) agents within game theory demonstrates their ability to replicate human-like behaviors through strategic decision making. In this paper, we introduce an augmented LLM agent, called the private agent, which engages in private deliberation and employs deception in repeated games. Utilizing the partially observable stochastic game (POSG) framework and incorporating in-context learning (ICL) and chain-of-thought (CoT) prompting, we investigated the private agent's proficiency in both competitive and cooperative scenarios. Our empirical analysis demonstrated that the private agent consistently achieved higher long-term payoffs than its baseline counterpart and performed similarly or better in various game settings. However, we also found inherent deficiencies of LLMs in certain algorithmic capabilities crucial for high-quality decision making in games. These findings highlight the potential for enhancing LLM agents' performance in multi-player games using information-theoretic approaches of deception and communication with complex environments.
Collapse
Affiliation(s)
- Kristijan Poje
- Faculty of Electrical Engineering and Computing, University of Zagreb, 10000 Zagreb, Croatia; (M.B.); (M.K.); (M.B.B.)
| | | | | | | |
Collapse
|
35
|
Collins KM, Jiang AQ, Frieder S, Wong L, Zilka M, Bhatt U, Lukasiewicz T, Wu Y, Tenenbaum JB, Hart W, Gowers T, Li W, Weller A, Jamnik M. Evaluating language models for mathematics through interactions. Proc Natl Acad Sci U S A 2024; 121:e2318124121. [PMID: 38830100 PMCID: PMC11181017 DOI: 10.1073/pnas.2318124121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 02/15/2024] [Indexed: 06/05/2024] Open
Abstract
There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.
Collapse
Affiliation(s)
| | | | | | - Lionel Wong
- Massachusetts Institute of Technology, Cambridge, MA02139
| | - Miri Zilka
- University of Cambridge, CambridgeCB2 1TN, United Kingdom
| | - Umang Bhatt
- University of Cambridge, CambridgeCB2 1TN, United Kingdom
- The Alan Turing Institute, LondonNW1 2DB, United Kingdom
- New York University, New York, NY10011
| | - Thomas Lukasiewicz
- University of Oxford, OxfordOX1 4BH, United Kingdom
- Vienna University of Technology, Vienna1040, Austria
| | | | | | - William Hart
- University of Cambridge, CambridgeCB2 1TN, United Kingdom
| | - Timothy Gowers
- University of Cambridge, CambridgeCB2 1TN, United Kingdom
- Collége de France, Paris75001, France
| | - Wenda Li
- University of Cambridge, CambridgeCB2 1TN, United Kingdom
| | - Adrian Weller
- University of Cambridge, CambridgeCB2 1TN, United Kingdom
- The Alan Turing Institute, LondonNW1 2DB, United Kingdom
| | - Mateja Jamnik
- University of Cambridge, CambridgeCB2 1TN, United Kingdom
| |
Collapse
|
36
|
Ahsan H, McInerney DJ, Kim J, Potter C, Young G, Amir S, Wallace BC. Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges. PROCEEDINGS OF MACHINE LEARNING RESEARCH 2024; 248:489-505. [PMID: 39224857 PMCID: PMC11368037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
Unstructured data in Electronic Health Records (EHRs) often contains critical information-complementary to imaging-that could inform radiologists' diagnoses. But the large volume of notes often associated with patients together with time constraints renders manually identifying relevant evidence practically infeasible. In this work we propose and evaluate a zero-shot strategy for using LLMs as a mechanism to efficiently retrieve and summarize unstructured evidence in patient EHR relevant to a given query. Our method entails tasking an LLM to infer whether a patient has, or is at risk of, a particular condition on the basis of associated notes; if so, we ask the model to summarize the supporting evidence. Under expert evaluation, we find that this LLM-based approach provides outputs consistently preferred to a pre-LLM information retrieval baseline. Manual evaluation is expensive, so we also propose and validate a method using an LLM to evaluate (other) LLM outputs for this task, allowing us to scale up evaluation. Our findings indicate the promise of LLMs as interfaces to EHR, but also highlight the outstanding challenge posed by "hallucinations". In this setting, however, we show that model confidence in outputs strongly correlates with faithful summaries, offering a practical means to limit confabulations.
Collapse
Affiliation(s)
| | | | - Jisoo Kim
- Brigham and Women's Hospital, Boston, MA
| | | | | | | | | |
Collapse
|
37
|
Yaghy A, Porteny JR. A Letter to the Editor Regarding "The Use of ChatGPT to Assist in Diagnosing Glaucoma Based on Clinical Case Reports". Ophthalmol Ther 2024; 13:1813-1815. [PMID: 38637437 PMCID: PMC11109063 DOI: 10.1007/s40123-024-00934-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 03/12/2024] [Indexed: 04/20/2024] Open
Affiliation(s)
- Antonio Yaghy
- New England Eye Center, Tufts University Medical Center, Boston, MA, USA.
| | | |
Collapse
|
38
|
Tomassi A, Falegnami A, Romano E. Mapping automatic social media information disorder. The role of bots and AI in spreading misleading information in society. PLoS One 2024; 19:e0303183. [PMID: 38820281 PMCID: PMC11142451 DOI: 10.1371/journal.pone.0303183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Accepted: 04/19/2024] [Indexed: 06/02/2024] Open
Abstract
This paper presents an analysis on information disorder in social media platforms. The study employed methods such as Natural Language Processing, Topic Modeling, and Knowledge Graph building to gain new insights into the phenomenon of fake news and its impact on critical thinking and knowledge management. The analysis focused on four research questions: 1) the distribution of misinformation, disinformation, and malinformation across different platforms; 2) recurring themes in fake news and their visibility; 3) the role of artificial intelligence as an authoritative and/or spreader agent; and 4) strategies for combating information disorder. The role of AI was highlighted, both as a tool for fact-checking and building truthiness identification bots, and as a potential amplifier of false narratives. Strategies proposed for combating information disorder include improving digital literacy skills and promoting critical thinking among social media users.
Collapse
Affiliation(s)
- Andrea Tomassi
- Engineering Faculty, Uninettuno International Telematic University, Rome, Italy
| | - Andrea Falegnami
- Engineering Faculty, Uninettuno International Telematic University, Rome, Italy
| | - Elpidio Romano
- Engineering Faculty, Uninettuno International Telematic University, Rome, Italy
| |
Collapse
|
39
|
Kedia N, Sanjeev S, Ong J, Chhablani J. ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology. Eye (Lond) 2024; 38:1252-1261. [PMID: 38172581 PMCID: PMC11076576 DOI: 10.1038/s41433-023-02915-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 11/23/2023] [Accepted: 12/20/2023] [Indexed: 01/05/2024] Open
Abstract
ChatGPT, an artificial intelligence (AI) chatbot built on large language models (LLMs), has rapidly gained popularity. The benefits and limitations of this transformative technology have been discussed across various fields, including medicine. The widespread availability of ChatGPT has enabled clinicians to study how these tools could be used for a variety of tasks such as generating differential diagnosis lists, organizing patient notes, and synthesizing literature for scientific research. LLMs have shown promising capabilities in ophthalmology by performing well on the Ophthalmic Knowledge Assessment Program, providing fairly accurate responses to questions about retinal diseases, and in generating differential diagnoses list. There are current limitations to this technology, including the propensity of LLMs to "hallucinate", or confidently generate false information; their potential role in perpetuating biases in medicine; and the challenges in incorporating LLMs into research without allowing "AI-plagiarism" or publication of false information. In this paper, we provide a balanced overview of what LLMs are and introduce some of the LLMs that have been generated in the past few years. We discuss recent literature evaluating the role of these language models in medicine with a focus on ChatGPT. The field of AI is fast-paced, and new applications based on LLMs are being generated rapidly; therefore, it is important for ophthalmologists to be aware of how this technology works and how it may impact patient care. Here, we discuss the benefits, limitations, and future advancements of LLMs in patient care and research.
Collapse
Affiliation(s)
- Nikita Kedia
- Department of Ophthalmology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | | | - Joshua Ong
- Department of Ophthalmology and Visual Sciences, University of Michigan Kellogg Eye Center, Ann Arbor, MI, USA
| | - Jay Chhablani
- Department of Ophthalmology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.
| |
Collapse
|
40
|
Giuffrè M, You K, Shung DL. Evaluating ChatGPT in Medical Contexts: The Imperative to Guard Against Hallucinations and Partial Accuracies. Clin Gastroenterol Hepatol 2024; 22:1145-1146. [PMID: 37863408 DOI: 10.1016/j.cgh.2023.09.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 09/22/2023] [Indexed: 10/22/2023]
Affiliation(s)
- Mauro Giuffrè
- Section of Digestive Diseases, Department of Internal Medicine, Yale School of Medicine, Yale University, New Haven, Connecticut
| | - Kisung You
- Department of Mathematics, Baruch College, City University of New York, New York, New York
| | - Dennis L Shung
- Section of Digestive Diseases, Department of Internal Medicine, Yale School of Medicine, Yale University, New Haven, Connecticut
| |
Collapse
|
41
|
Frosolini A, Catarzi L, Benedetti S, Latini L, Chisci G, Franz L, Gennaro P, Gabriele G. The Role of Large Language Models (LLMs) in Providing Triage for Maxillofacial Trauma Cases: A Preliminary Study. Diagnostics (Basel) 2024; 14:839. [PMID: 38667484 PMCID: PMC11048758 DOI: 10.3390/diagnostics14080839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 04/10/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND In the evolving field of maxillofacial surgery, integrating advanced technologies like Large Language Models (LLMs) into medical practices, especially for trauma triage, presents a promising yet largely unexplored potential. This study aimed to evaluate the feasibility of using LLMs for triaging complex maxillofacial trauma cases by comparing their performance against the expertise of a tertiary referral center. METHODS Utilizing a comprehensive review of patient records in a tertiary referral center over a year-long period, standardized prompts detailing patient demographics, injury characteristics, and medical histories were created. These prompts were used to assess the triage suggestions of ChatGPT 4.0 and Google GEMINI against the center's recommendations, supplemented by evaluating the AI's performance using the QAMAI and AIPI questionnaires. RESULTS The results in 10 cases of major maxillofacial trauma indicated moderate agreement rates between LLM recommendations and the referral center, with some variances in the suggestion of appropriate examinations (70% ChatGPT and 50% GEMINI) and treatment plans (60% ChatGPT and 45% GEMINI). Notably, the study found no statistically significant differences in several areas of the questionnaires, except in the diagnosis accuracy (GEMINI: 3.30, ChatGPT: 2.30; p = 0.032) and relevance of the recommendations (GEMINI: 2.90, ChatGPT: 3.50; p = 0.021). A Spearman correlation analysis highlighted significant correlations within the two questionnaires, specifically between the QAMAI total score and AIPI treatment scores (rho = 0.767, p = 0.010). CONCLUSIONS This exploratory investigation underscores the potential of LLMs in enhancing clinical decision making for maxillofacial trauma cases, indicating a need for further research to refine their application in healthcare settings.
Collapse
Affiliation(s)
- Andrea Frosolini
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Lisa Catarzi
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Simone Benedetti
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Linda Latini
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Glauco Chisci
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Leonardo Franz
- Phoniatris and Audiology Unit, Department of Neuroscience DNS, University of Padova, 35122 Treviso, Italy;
- Artificial Intelligence in Medicine and Innovation in Clinical Research and Methodology (PhD Program), Department of Clinical and Experimental Sciences, University of Brescia, 25121 Brescia, Italy
| | - Paolo Gennaro
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Guido Gabriele
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| |
Collapse
|
42
|
Buehler MJ. Generative Retrieval-Augmented Ontologic Graph and Multiagent Strategies for Interpretive Large Language Model-Based Materials Design. ACS ENGINEERING AU 2024; 4:241-277. [PMID: 38646516 PMCID: PMC11027160 DOI: 10.1021/acsengineeringau.3c00058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 12/06/2023] [Accepted: 12/07/2023] [Indexed: 04/23/2024]
Abstract
Transformer neural networks show promising capabilities, in particular for uses in materials analysis, design, and manufacturing, including their capacity to work effectively with human language, symbols, code, and numerical data. Here, we explore the use of large language models (LLMs) as a tool that can support engineering analysis of materials, applied to retrieving key information about subject areas, developing research hypotheses, discovery of mechanistic relationships across disparate areas of knowledge, and writing and executing simulation codes for active knowledge generation based on physical ground truths. Moreover, when used as sets of AI agents with specific features, capabilities, and instructions, LLMs can provide powerful problem-solution strategies for applications in analysis and design problems. Our experiments focus on using a fine-tuned model, MechGPT, developed based on training data in the mechanics of materials domain. We first affirm how fine-tuning endows LLMs with a reasonable understanding of subject area knowledge. However, when queried outside the context of learned matter, LLMs can have difficulty recalling correct information and may hallucinate. We show how this can be addressed using retrieval-augmented Ontological Knowledge Graph strategies. The graph-based strategy helps us not only to discern how the model understands what concepts are important but also how they are related, which significantly improves generative performance and also naturally allows for injection of new and augmented data sources into generative AI algorithms. We find that the additional feature of relatedness provides advantages over regular retrieval augmentation approaches and not only improves LLM performance but also provides mechanistic insights for exploration of a material design process. Illustrated for a use case of relating distinct areas of knowledge, here, music and proteins, such strategies can also provide an interpretable graph structure with rich information at the node, edge, and subgraph level that provides specific insights into mechanisms and relationships. We discuss other approaches to improve generative qualities, including nonlinear sampling strategies and agent-based modeling that offer enhancements over single-shot generations, whereby LLMs are used to both generate content and assess content against an objective target. Examples provided include complex question answering, code generation, and execution in the context of automated force-field development from actively learned density functional theory (DFT) modeling and data analysis.
Collapse
Affiliation(s)
- Markus J. Buehler
- Laboratory
for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
- Department
of Civil and Environmental Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
- Department
of Mechanical Engineering, Massachusetts
Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
- Center
for Computational Science and Engineering, Schwarzman College of Computing, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
| |
Collapse
|
43
|
Chen X, Gao Y, Wang L, Cui W, Huang J, Du Y, Wang B. Large language model enhanced corpus of CO 2 reduction electrocatalysts and synthesis procedures. Sci Data 2024; 11:347. [PMID: 38582751 PMCID: PMC10998834 DOI: 10.1038/s41597-024-03180-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 03/22/2024] [Indexed: 04/08/2024] Open
Abstract
CO2 electroreduction has garnered significant attention from both the academic and industrial communities. Extracting crucial information related to catalysts from domain literature can help scientists find new and effective electrocatalysts. Herein, we used various advanced machine learning, natural language processing techniques and large language models (LLMs) approaches to extract relevant information about the CO2 electrocatalytic reduction process from scientific literature. By applying the extraction pipeline, we present an open-source corpus for electrocatalytic CO2 reduction. The database contains two types of corpus: (1) the benchmark corpus, which is a collection of 6,985 records extracted from 1,081 publications by catalysis postgraduates; and (2) the extended corpus, which consists of content extracted from 5,941 documents using traditional NLP techniques and LLMs techniques. The Extended Corpus I and II contain 77,016 and 30,283 records, respectively. Furthermore, several domain literature fine-tuned LLMs were developed. Overall, this work will contribute to the exploration of new and effective electrocatalysts by leveraging information from domain literature using cutting-edge computer techniques.
Collapse
Affiliation(s)
- Xueqing Chen
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yang Gao
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China
| | - Ludi Wang
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
| | - Wenjuan Cui
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China
| | - Jiamin Huang
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China
| | - Yi Du
- Laboratory of Big Data Knowledge, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
- Hangzhou Institute for Advanced Study, UCAS, Hangzhou, 310000, China.
| | - Bin Wang
- CAS Key Laboratory of Nanosystem and Hierarchical Fabrication, National Center for Nanoscience and Technology (NCNST), Beijing, 100190, China.
| |
Collapse
|
44
|
Sievert M, Aubreville M, Mueller SK, Eckstein M, Breininger K, Iro H, Goncalves M. Diagnosis of malignancy in oropharyngeal confocal laser endomicroscopy using GPT 4.0 with vision. Eur Arch Otorhinolaryngol 2024; 281:2115-2122. [PMID: 38329525 DOI: 10.1007/s00405-024-08476-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 01/11/2024] [Indexed: 02/09/2024]
Abstract
PURPOSE Confocal Laser Endomicroscopy (CLE) is an imaging tool, that has demonstrated potential for intraoperative, real-time, non-invasive, microscopical assessment of surgical margins of oropharyngeal squamous cell carcinoma (OPSCC). However, interpreting CLE images remains challenging. This study investigates the application of OpenAI's Generative Pretrained Transformer (GPT) 4.0 with Vision capabilities for automated classification of CLE images in OPSCC. METHODS CLE Images of histological confirmed SCC or healthy mucosa from a database of 12 809 CLE images from 5 patients with OPSCC were retrieved and anonymized. Using a training data set of 16 images, a validation set of 139 images, comprising SCC (83 images, 59.7%) and healthy normal mucosa (56 images, 40.3%) was classified using the application programming interface (API) of GPT4.0. The same set of images was also classified by CLE experts (two surgeons and one pathologist), who were blinded to the histology. Diagnostic metrics, the reliability of GPT and inter-rater reliability were assessed. RESULTS Overall accuracy of the GPT model was 71.2%, the intra-rater agreement was κ = 0.837, indicating an almost perfect agreement across the three runs of GPT-generated results. Human experts achieved an accuracy of 88.5% with a substantial level of agreement (κ = 0.773). CONCLUSIONS Though limited to a specific clinical framework, patient and image set, this study sheds light on some previously unexplored diagnostic capabilities of large language models using few-shot prompting. It suggests the model`s ability to extrapolate information and classify CLE images with minimal example data. Whether future versions of the model can achieve clinically relevant diagnostic accuracy, especially in uncurated data sets, remains to be investigated.
Collapse
Affiliation(s)
- Matti Sievert
- Department of Otorhinolaryngology, Head and Neck Surgery, Friedrich Alexander University of Erlangen-Nuremberg, Erlangen University Hospital, Erlangen, Germany
| | | | - Sarina Katrin Mueller
- Department of Otorhinolaryngology, Head and Neck Surgery, Friedrich Alexander University of Erlangen-Nuremberg, Erlangen University Hospital, Erlangen, Germany
| | - Markus Eckstein
- Institute of Pathology, Friedrich-Alexander-Universität Erlangen-Nürnberg, University Hospital, Erlangen, Germany
| | - Katharina Breininger
- Department of Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Heinrich Iro
- Department of Otorhinolaryngology, Head and Neck Surgery, Friedrich Alexander University of Erlangen-Nuremberg, Erlangen University Hospital, Erlangen, Germany
| | - Miguel Goncalves
- Department of Otorhinolaryngology, Plastic and Aesthetic Operations, University Hospital Würzburg, Joseph-Schneider-Straße 11, 97080, Würzburg, Germany.
| |
Collapse
|
45
|
Jacaruso L. Insights into the nutritional prevention of macular degeneration based on a comparative topic modeling approach. PeerJ Comput Sci 2024; 10:e1940. [PMID: 38660183 PMCID: PMC11042009 DOI: 10.7717/peerj-cs.1940] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 02/22/2024] [Indexed: 04/26/2024]
Abstract
Topic modeling and text mining are subsets of natural language processing (NLP) with relevance for conducting meta-analysis (MA) and systematic review (SR). For evidence synthesis, the above NLP methods are conventionally used for topic-specific literature searches or extracting values from reports to automate essential phases of SR and MA. Instead, this work proposes a comparative topic modeling approach to analyze reports of contradictory results on the same general research question. Specifically, the objective is to identify topics exhibiting distinct associations with significant results for an outcome of interest by ranking them according to their proportional occurrence in (and consistency of distribution across) reports of significant effects. Macular degeneration (MD) is a disease that affects millions of people annually, causing vision loss. Augmenting evidence synthesis to provide insight into MD prevention is therefore of central interest in this article. The proposed method was tested on broad-scope studies addressing whether supplemental nutritional compounds significantly benefit macular degeneration. Six compounds were identified as having a particular association with reports of significant results for benefiting MD. Four of these were further supported in terms of effectiveness upon conducting a follow-up literature search for validation (omega-3 fatty acids, copper, zeaxanthin, and nitrates). The two not supported by the follow-up literature search (niacin and molybdenum) also had scores in the lowest range under the proposed scoring system. Results therefore suggest that the proposed method's score for a given topic may be a viable proxy for its degree of association with the outcome of interest, and can be helpful in the systematic search for potentially causal relationships. Further, the compounds identified by the proposed method were not simultaneously captured as salient topics by state-of-the-art topic models that leverage document and word embeddings (Top2Vec) and transformer models (BERTopic). These results underpin the proposed method's potential to add specificity in understanding effects from broad-scope reports, elucidate topics of interest for future research, and guide evidence synthesis in a scalable way. All of this is accomplished while yielding valuable and actionable insights into the prevention of MD.
Collapse
Affiliation(s)
- Lucas Jacaruso
- University of Southern California, Los Angeles, CA, United States of America
| |
Collapse
|
46
|
Yavuz YE, Kahraman F. Evaluation of the prediagnosis and management of ChatGPT-4.0 in clinical cases in cardiology. Future Cardiol 2024; 20:197-207. [PMID: 39049771 DOI: 10.1080/14796678.2024.2348898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Accepted: 04/25/2024] [Indexed: 07/27/2024] Open
Abstract
Aim: Evaluation of the performance of ChatGPT-4.0 in providing prediagnosis and treatment plans for cardiac clinical cases by expert cardiologists. Methods: 20 cardiology clinical cases developed by experienced cardiologists were divided into two groups according to preparation methods. Cases were reviewed and analyzed by the ChatGPT-4.0 program, and analyses of ChatGPT were then sent to cardiologists. Eighteen expert cardiologists evaluated the quality of ChatGPT-4.0 responses using Likert and Global quality scales. Results: Physicians rated case difficulty (median 2.00), revealing high ChatGPT-4.0 agreement to differential diagnoses (median 5.00). Management plans received a median score of 4, indicating good quality. Regardless of the difficulty of the cases, ChatGPT-4.0 showed similar performance in differential diagnosis (p: 0.256) and treatment plans (p: 0.951). Conclusion: ChatGPT-4.0 excels at delivering accurate management and demonstrates its potential as a valuable clinical decision support tool in cardiology.
Collapse
Affiliation(s)
- Yunus Emre Yavuz
- Department of Cardiology, Siirt Training & Research Hospital, Siirt, 56100, Turkey
| | - Fatih Kahraman
- Department of Cardiology, Kütahya Evliya Çelebi Training & Research Hospital, Kütahya, 43000, Turkey
| |
Collapse
|
47
|
Xu Y, Jiang Z, Ting DSW, Kow AWC, Bello F, Car J, Tham YC, Wong TY. Medical education and physician training in the era of artificial intelligence. Singapore Med J 2024; 65:159-166. [PMID: 38527300 PMCID: PMC11060639 DOI: 10.4103/singaporemedj.smj-2023-203] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 02/08/2024] [Indexed: 03/27/2024]
Abstract
ABSTRACT With the rise of generative artificial intelligence (AI) and AI-powered chatbots, the landscape of medicine and healthcare is on the brink of significant transformation. This perspective delves into the prospective influence of AI on medical education, residency training and the continuing education of attending physicians or consultants. We begin by highlighting the constraints of the current education model, challenges in limited faculty, uniformity amidst burgeoning medical knowledge and the limitations in 'traditional' linear knowledge acquisition. We introduce 'AI-assisted' and 'AI-integrated' paradigms for medical education and physician training, targeting a more universal, accessible, high-quality and interconnected educational journey. We differentiate between essential knowledge for all physicians, specialised insights for clinician-scientists and mastery-level proficiency for clinician-computer scientists. With the transformative potential of AI in healthcare and service delivery, it is poised to reshape the pedagogy of medical education and residency training.
Collapse
Affiliation(s)
- Yueyuan Xu
- Tsinghua Medicine, School of Medicine, Tsinghua University, Beijing, China
| | - Zehua Jiang
- Tsinghua Medicine, School of Medicine, Tsinghua University, Beijing, China
- School of Clinical Medicine, Beijing Tsinghua Changgung Hospital, Beijing, China
| | - Daniel Shu Wei Ting
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
- Eye Academic Clinical Program, Duke-NUS Medical School, Singapore
- Byers Eye Institute, Stanford University, Palo Alto, CA, USA
| | - Alfred Wei Chieh Kow
- Department of Surgery, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Fernando Bello
- Technology Enhanced Learning and Innovation Department, Duke-NUS Medical School, National University of Singapore, Singapore
| | - Josip Car
- Centre for Population Health Sciences, Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
| | - Yih-Chung Tham
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
- Eye Academic Clinical Program, Duke-NUS Medical School, Singapore
- Centre for Innovation and Precision Eye Health and Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Tien Yin Wong
- Tsinghua Medicine, School of Medicine, Tsinghua University, Beijing, China
- School of Clinical Medicine, Beijing Tsinghua Changgung Hospital, Beijing, China
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
| |
Collapse
|
48
|
Hake J, Crowley M, Coy A, Shanks D, Eoff A, Kirmer-Voss K, Dhanda G, Parente DJ. Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts. Ann Fam Med 2024; 22:113-120. [PMID: 38527823 PMCID: PMC11237196 DOI: 10.1370/afm.3075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 10/13/2023] [Accepted: 11/17/2023] [Indexed: 03/27/2024] Open
Abstract
PURPOSE Worldwide clinical knowledge is expanding rapidly, but physicians have sparse time to review scientific literature. Large language models (eg, Chat Generative Pretrained Transformer [ChatGPT]), might help summarize and prioritize research articles to review. However, large language models sometimes "hallucinate" incorrect information. METHODS We evaluated ChatGPT's ability to summarize 140 peer-reviewed abstracts from 14 journals. Physicians rated the quality, accuracy, and bias of the ChatGPT summaries. We also compared human ratings of relevance to various areas of medicine to ChatGPT relevance ratings. RESULTS ChatGPT produced summaries that were 70% shorter (mean abstract length of 2,438 characters decreased to 739 characters). Summaries were nevertheless rated as high quality (median score 90, interquartile range [IQR] 87.0-92.5; scale 0-100), high accuracy (median 92.5, IQR 89.0-95.0), and low bias (median 0, IQR 0-7.5). Serious inaccuracies and hallucinations were uncommon. Classification of the relevance of entire journals to various fields of medicine closely mirrored physician classifications (nonlinear standard error of the regression [SER] 8.6 on a scale of 0-100). However, relevance classification for individual articles was much more modest (SER 22.3). CONCLUSIONS Summaries generated by ChatGPT were 70% shorter than mean abstract length and were characterized by high quality, high accuracy, and low bias. Conversely, ChatGPT had modest ability to classify the relevance of articles to medical specialties. We suggest that ChatGPT can help family physicians accelerate review of the scientific literature and have developed software (pyJournalWatch) to support this application. Life-critical medical decisions should remain based on full, critical, and thoughtful evaluation of the full text of research articles in context with clinical guidelines.
Collapse
Affiliation(s)
- Joel Hake
- Department of Family Medicine and Community Health, University of Kansas Medical Center, Kansas City, Kansas
| | - Miles Crowley
- Department of Family Medicine and Community Health, University of Kansas Medical Center, Kansas City, Kansas
| | - Allison Coy
- Department of Family Medicine and Community Health, University of Kansas Medical Center, Kansas City, Kansas
| | - Denton Shanks
- Department of Family Medicine and Community Health, University of Kansas Medical Center, Kansas City, Kansas
| | - Aundria Eoff
- Department of Family Medicine and Community Health, University of Kansas Medical Center, Kansas City, Kansas
| | - Kalee Kirmer-Voss
- Department of Family Medicine and Community Health, University of Kansas Medical Center, Kansas City, Kansas
| | - Gurpreet Dhanda
- Department of Family Medicine and Community Health, University of Kansas Medical Center, Kansas City, Kansas
| | - Daniel J Parente
- Department of Family Medicine and Community Health, University of Kansas Medical Center, Kansas City, Kansas
| |
Collapse
|
49
|
Abi-Rafeh J, Xu HH, Kazan R, Tevlin R, Furnas H. Large Language Models and Artificial Intelligence: A Primer for Plastic Surgeons on the Demonstrated and Potential Applications, Promises, and Limitations of ChatGPT. Aesthet Surg J 2024; 44:329-343. [PMID: 37562022 DOI: 10.1093/asj/sjad260] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 08/02/2023] [Accepted: 08/04/2023] [Indexed: 08/12/2023] Open
Abstract
BACKGROUND The rapidly evolving field of artificial intelligence (AI) holds great potential for plastic surgeons. ChatGPT, a recently released AI large language model (LLM), promises applications across many disciplines, including healthcare. OBJECTIVES The aim of this article was to provide a primer for plastic surgeons on AI, LLM, and ChatGPT, including an analysis of current demonstrated and proposed clinical applications. METHODS A systematic review was performed identifying medical and surgical literature on ChatGPT's proposed clinical applications. Variables assessed included applications investigated, command tasks provided, user input information, AI-emulated human skills, output validation, and reported limitations. RESULTS The analysis included 175 articles reporting on 13 plastic surgery applications and 116 additional clinical applications, categorized by field and purpose. Thirty-four applications within plastic surgery are thus proposed, with relevance to different target audiences, including attending plastic surgeons (n = 17, 50%), trainees/educators (n = 8, 24.0%), researchers/scholars (n = 7, 21%), and patients (n = 2, 6%). The 15 identified limitations of ChatGPT were categorized by training data, algorithm, and ethical considerations. CONCLUSIONS Widespread use of ChatGPT in plastic surgery will depend on rigorous research of proposed applications to validate performance and address limitations. This systemic review aims to guide research, development, and regulation to safely adopt AI in plastic surgery.
Collapse
|
50
|
Segal S, Saha AK, Khanna AK. Appropriateness of Answers to Common Preanesthesia Patient Questions Composed by the Large Language Model GPT-4 Compared to Human Authors. Anesthesiology 2024; 140:333-335. [PMID: 38193737 DOI: 10.1097/aln.0000000000004824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2024]
Affiliation(s)
- Scott Segal
- Wake Forest University School of Medicine, Atrium Health Wake Forest Baptist Medical Center, Winston-Salem, North Carolina (S.S.).
| | | | | |
Collapse
|