1
|
Moulaei K, Yadegari A, Baharestani M, Farzanbakhsh S, Sabet B, Reza Afrash M. Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications. Int J Med Inform 2024; 188:105474. [PMID: 38733640 DOI: 10.1016/j.ijmedinf.2024.105474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 05/03/2024] [Accepted: 05/04/2024] [Indexed: 05/13/2024]
Abstract
BACKGROUND Generative artificial intelligence (GAI) is revolutionizing healthcare with solutions for complex challenges, enhancing diagnosis, treatment, and care through new data and insights. However, its integration raises questions about applications, benefits, and challenges. Our study explores these aspects, offering an overview of GAI's applications and future prospects in healthcare. METHODS This scoping review searched Web of Science, PubMed, and Scopus . The selection of studies involved screening titles, reviewing abstracts, and examining full texts, adhering to the PRISMA-ScR guidelines throughout the process. RESULTS From 1406 articles across three databases, 109 met inclusion criteria after screening and deduplication. Nine GAI models were utilized in healthcare, with ChatGPT (n = 102, 74 %), Google Bard (Gemini) (n = 16, 11 %), and Microsoft Bing AI (n = 10, 7 %) being the most frequently employed. A total of 24 different applications of GAI in healthcare were identified, with the most common being "offering insights and information on health conditions through answering questions" (n = 41) and "diagnosis and prediction of diseases" (n = 17). In total, 606 benefits and challenges were identified, which were condensed to 48 benefits and 61 challenges after consolidation. The predominant benefits included "Providing rapid access to information and valuable insights" and "Improving prediction and diagnosis accuracy", while the primary challenges comprised "generating inaccurate or fictional content", "unknown source of information and fake references for texts", and "lower accuracy in answering questions". CONCLUSION This scoping review identified the applications, benefits, and challenges of GAI in healthcare. This synthesis offers a crucial overview of GAI's potential to revolutionize healthcare, emphasizing the imperative to address its limitations.
Collapse
Affiliation(s)
- Khadijeh Moulaei
- Department of Health Information Technology, School of Paramedical, Ilam University of Medical Sciences, Ilam, Iran
| | - Atiye Yadegari
- Department of Pediatric Dentistry, School of Dentistry, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Mahdi Baharestani
- Network of Interdisciplinarity in Neonates and Infants (NINI), Universal Scientific Education and Research Network (USERN), Tehran, Iran
| | - Shayan Farzanbakhsh
- Network of Interdisciplinarity in Neonates and Infants (NINI), Universal Scientific Education and Research Network (USERN), Tehran, Iran
| | - Babak Sabet
- Department of Surgery, Faculty of Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mohammad Reza Afrash
- Department of Artificial Intelligence, Smart University of Medical Sciences, Tehran, Iran.
| |
Collapse
|
2
|
Lucas MM, Yang J, Pomeroy JK, Yang CC. Reasoning with large language models for medical question answering. J Am Med Inform Assoc 2024:ocae131. [PMID: 38960731 DOI: 10.1093/jamia/ocae131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 05/13/2024] [Accepted: 05/20/2024] [Indexed: 07/05/2024] Open
Abstract
OBJECTIVES To investigate approaches of reasoning with large language models (LLMs) and to propose a new prompting approach, ensemble reasoning, to improve medical question answering performance with refined reasoning and reduced inconsistency. MATERIALS AND METHODS We used multiple choice questions from the USMLE Sample Exam question files on 2 closed-source commercial and 1 open-source clinical LLM to evaluate our proposed approach ensemble reasoning. RESULTS On GPT-3.5 turbo and Med42-70B, our proposed ensemble reasoning approach outperformed zero-shot chain-of-thought with self-consistency on Steps 1, 2, and 3 questions (+3.44%, +4.00%, and +2.54%) and (2.3%, 5.00%, and 4.15%), respectively. With GPT-4 turbo, there were mixed results with ensemble reasoning again outperforming zero-shot chain-of-thought with self-consistency on Step 1 questions (+1.15%). In all cases, the results demonstrated improved consistency of responses with our approach. A qualitative analysis of the reasoning from the model demonstrated that the ensemble reasoning approach produces correct and helpful reasoning. CONCLUSION The proposed iterative ensemble reasoning has the potential to improve the performance of LLMs in medical question answering tasks, particularly with the less powerful LLMs like GPT-3.5 turbo and Med42-70B, which may suggest that this is a promising approach for LLMs with lower capabilities. Additionally, the findings show that our approach helps to refine the reasoning generated by the LLM and thereby improve consistency even with the more powerful GPT-4 turbo. We also identify the potential and need for human-artificial intelligence teaming to improve the reasoning beyond the limits of the model.
Collapse
Affiliation(s)
- Mary M Lucas
- College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, United States
| | - Justin Yang
- Department of Computer Science, University of Maryland, College Park, MD 20742, United States
| | - Jon K Pomeroy
- College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, United States
- Penn Medicine, Philadelphia, PA 19104, United States
| | - Christopher C Yang
- College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, United States
| |
Collapse
|
3
|
Bortoli M, Fiore M, Tedeschi S, Oliveira V, Sousa R, Bruschi A, Campanacci DA, Viale P, De Paolis M, Sambri A. GPT-based chatbot tools are still unreliable in the management of prosthetic joint infections. Musculoskelet Surg 2024:10.1007/s12306-024-00846-w. [PMID: 38954323 DOI: 10.1007/s12306-024-00846-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 06/21/2024] [Indexed: 07/04/2024]
Abstract
BACKGROUND Artificial intelligence chatbot tools responses might discern patterns and correlations that may elude human observation, leading to more accurate and timely interventions. However, their reliability to answer healthcare-related questions is still debated. This study aimed to assess the performance of the three versions of GPT-based chatbots about prosthetic joint infections (PJI). METHODS Thirty questions concerning the diagnosis and treatment of hip and knee PJIs, stratified by a priori established difficulty, were generated by a team of experts, and administered to ChatGPT 3.5, BingChat, and ChatGPT 4.0. Responses were rated by three orthopedic surgeons and two infectious diseases physicians using a five-point Likert-like scale with numerical values to quantify the quality of responses. Inter-rater reliability was assessed by interclass correlation statistics. RESULTS Responses averaged "good-to-very good" for all chatbots examined, both in diagnosis and treatment, with no significant differences according to the difficulty of the questions. However, BingChat ratings were significantly lower in the treatment setting (p = 0.025), particularly in terms of accuracy (p = 0.02) and completeness (p = 0.004). Agreement in ratings among examiners appeared to be very poor. CONCLUSIONS On average, the quality of responses is rated positively by experts, but with ratings that frequently may vary widely. This currently suggests that AI chatbot tools are still unreliable in the management of PJI.
Collapse
Affiliation(s)
- M Bortoli
- Orthopedic and Traumatology Unit, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| | - M Fiore
- Orthopedic and Traumatology Unit, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy.
- Department of Medical and Surgical Sciences, Alma Mater Studiorum University of Bologna, 40138, Bologna, Italy.
| | - S Tedeschi
- Department of Medical and Surgical Sciences, Alma Mater Studiorum University of Bologna, 40138, Bologna, Italy
- Infectious Disease Unit, Department for Integrated Infectious Risk Management, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| | - V Oliveira
- Department of Orthopedics, Centro Hospitalar Universitário de Santo António, 4099-001, Porto, Portugal
| | - R Sousa
- Department of Orthopedics, Centro Hospitalar Universitário de Santo António, 4099-001, Porto, Portugal
| | - A Bruschi
- Orthopedic and Traumatology Unit, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| | - D A Campanacci
- Orthopedic Oncology Unit, Azienda Ospedaliera Universitaria Careggi, 50134, Florence, Italy
| | - P Viale
- Department of Medical and Surgical Sciences, Alma Mater Studiorum University of Bologna, 40138, Bologna, Italy
- Infectious Disease Unit, Department for Integrated Infectious Risk Management, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| | - M De Paolis
- Orthopedic and Traumatology Unit, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| | - A Sambri
- Orthopedic and Traumatology Unit, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| |
Collapse
|
4
|
Luo S, Canavese F, Aroojis A, Andreacchio A, Anticevic D, Bouchard M, Castaneda P, De Rosa V, Fiogbe MA, Frick SL, Hui JH, Johari AN, Loro A, Lyu X, Matsushita M, Omeroglu H, Roye DP, Shah MM, Yong B, Li L. Are Generative Pretrained Transformer 4 Responses to Developmental Dysplasia of the Hip Clinical Scenarios Universal? An International Review. J Pediatr Orthop 2024; 44:e504-e511. [PMID: 38597198 DOI: 10.1097/bpo.0000000000002682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
OBJECTIVE There is increasing interest in applying artificial intelligence chatbots like generative pretrained transformer 4 (GPT-4) in the medical field. This study aimed to explore the universality of GPT-4 responses to simulated clinical scenarios of developmental dysplasia of the hip (DDH) across diverse global settings. METHODS Seventeen international experts with more than 15 years of experience in pediatric orthopaedics were selected for the evaluation panel. Eight simulated DDH clinical scenarios were created, covering 4 key areas: (1) initial evaluation and diagnosis, (2) initial examination and treatment, (3) nursing care and follow-up, and (4) prognosis and rehabilitation planning. Each scenario was completed independently in a new GPT-4 session. Interrater reliability was assessed using Fleiss kappa, and the quality, relevance, and applicability of GPT-4 responses were analyzed using median scores and interquartile ranges. Following scoring, experts met in ZOOM sessions to generate Regional Consensus Assessment Scores, which were intended to represent a consistent regional assessment of the use of the GPT-4 in pediatric orthopaedic care. RESULTS GPT-4's responses to the 8 clinical DDH scenarios received performance scores ranging from 44.3% to 98.9% of the 88-point maximum. The Fleiss kappa statistic of 0.113 ( P = 0.001) indicated low agreement among experts in their ratings. When assessing the responses' quality, relevance, and applicability, the median scores were 3, with interquartile ranges of 3 to 4, 3 to 4, and 2 to 3, respectively. Significant differences were noted in the prognosis and rehabilitation domain scores ( P < 0.05 for all). Regional consensus scores were 75 for Africa, 74 for Asia, 73 for India, 80 for Europe, and 65 for North America, with the Kruskal-Wallis test highlighting significant disparities between these regions ( P = 0.034). CONCLUSIONS This study demonstrates the promise of GPT-4 in pediatric orthopaedic care, particularly in supporting preliminary DDH assessments and guiding treatment strategies for specialist care. However, effective integration of GPT-4 into clinical practice will require adaptation to specific regional health care contexts, highlighting the importance of a nuanced approach to health technology adaptation. LEVEL OF EVIDENCE Level IV.
Collapse
Affiliation(s)
- Shaoting Luo
- Department of Pediatric Orthopaedics, Shengjing Hospital of China Medical University, Shenyang, Liaoning
| | - Federico Canavese
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - Alaric Aroojis
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - Antonio Andreacchio
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - Darko Anticevic
- Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
| | | | - Pablo Castaneda
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - Vincenzo De Rosa
- Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
| | | | - Steven L Frick
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - James H Hui
- Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
| | - Ashok N Johari
- Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
| | - Antonio Loro
- Ufuk University Faculty of Medicine, Ankara, Turkey
| | - Xuemin Lyu
- Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
| | - Masaki Matsushita
- Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
| | | | - David P Roye
- Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
| | | | - Bicheng Yong
- Department of Pediatric Orthopaedics, Beit CURE Children's Hospital of Malawi, Chichiri Blantyre, Malawi
| | - Lianyong Li
- Department of Pediatric Orthopaedics, Shengjing Hospital of China Medical University, Shenyang, Liaoning
| |
Collapse
|
5
|
Marshan A, Almutairi AN, Ioannou A, Bell D, Monaghan A, Arzoky M. MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain. Front Big Data 2024; 7:1371680. [PMID: 38988646 PMCID: PMC11233734 DOI: 10.3389/fdata.2024.1371680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Accepted: 06/10/2024] [Indexed: 07/12/2024] Open
Abstract
Introduction In response to the increasing prevalence of electronic medical records (EMRs) stored in databases, healthcare staff are encountering difficulties retrieving these records due to their limited technical expertise in database operations. As these records are crucial for delivering appropriate medical care, there is a need for an accessible method for healthcare staff to access EMRs. Methods To address this, natural language processing (NLP) for Text-to-SQL has emerged as a solution, enabling non-technical users to generate SQL queries using natural language text. This research assesses existing work on Text-to-SQL conversion and proposes the MedT5SQL model specifically designed for EMR retrieval. The proposed model utilizes the Text-to-Text Transfer Transformer (T5) model, a Large Language Model (LLM) commonly used in various text-based NLP tasks. The model is fine-tuned on the MIMICSQL dataset, the first Text-to-SQL dataset for the healthcare domain. Performance evaluation involves benchmarking the MedT5SQL model on two optimizers, varying numbers of training epochs, and using two datasets, MIMICSQL and WikiSQL. Results For MIMICSQL dataset, the model demonstrates considerable effectiveness in generating question-SQL pairs achieving accuracy of 80.63%, 98.937%, and 90% for exact match accuracy matrix, approximate string-matching, and manual evaluation, respectively. When testing the performance of the model on WikiSQL dataset, the model demonstrates efficiency in generating SQL queries, with an accuracy of 44.2% on WikiSQL and 94.26% for approximate string-matching. Discussion Results indicate improved performance with increased training epochs. This work highlights the potential of fine-tuned T5 model to convert medical-related questions written in natural language to Structured Query Language (SQL) in healthcare domain, providing a foundation for future research in this area.
Collapse
Affiliation(s)
- Alaa Marshan
- School of Computer Science and Electronic Engineering, University of Surrey, Guildford, United Kingdom
| | | | - Athina Ioannou
- Surrey Business School, University of Surrey, Guildford, United Kingdom
| | - David Bell
- Department of Computer Science, Brunel University London, London, United Kingdom
| | - Asmat Monaghan
- School of Business and Management, Royal Holloway, University of London, London, United Kingdom
| | - Mahir Arzoky
- Department of Computer Science, Brunel University London, London, United Kingdom
| |
Collapse
|
6
|
Pividori M, Greene CS. A publishing infrastructure for Artificial Intelligence (AI)-assisted academic authoring. J Am Med Inform Assoc 2024:ocae139. [PMID: 38879443 DOI: 10.1093/jamia/ocae139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 05/23/2024] [Accepted: 05/29/2024] [Indexed: 06/25/2024] Open
Abstract
OBJECTIVE Investigate the use of advanced natural language processing models to streamline the time-consuming process of writing and revising scholarly manuscripts. MATERIALS AND METHODS For this purpose, we integrate large language models into the Manubot publishing ecosystem to suggest revisions for scholarly texts. Our AI-based revision workflow employs a prompt generator that incorporates manuscript metadata into templates, generating section-specific instructions for the language model. The model then generates revised versions of each paragraph for human authors to review. We evaluated this methodology through 5 case studies of existing manuscripts, including the revision of this manuscript. RESULTS Our results indicate that these models, despite some limitations, can grasp complex academic concepts and enhance text quality. All changes to the manuscript are tracked using a version control system, ensuring transparency in distinguishing between human- and machine-generated text. CONCLUSIONS Given the significant time researchers invest in crafting prose, incorporating large language models into the scholarly writing process can significantly improve the type of knowledge work performed by academics. Our approach also enables scholars to concentrate on critical aspects of their work, such as the novelty of their ideas, while automating tedious tasks like adhering to specific writing styles. Although the use of AI-assisted tools in scientific authoring is controversial, our approach, which focuses on revising human-written text and provides change-tracking transparency, can mitigate concerns regarding AI's role in scientific writing.
Collapse
Affiliation(s)
- Milton Pividori
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, United States
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Casey S Greene
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, United States
- Center for Health AI, Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, United States
| |
Collapse
|
7
|
Wu Y, Wu M, Wang C, Lin J, Liu J, Liu S. Evaluating the Prevalence of Burnout Among Health Care Professionals Related to Electronic Health Record Use: Systematic Review and Meta-Analysis. JMIR Med Inform 2024; 12:e54811. [PMID: 38865188 PMCID: PMC11208837 DOI: 10.2196/54811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 02/23/2024] [Accepted: 04/17/2024] [Indexed: 06/13/2024] Open
Abstract
BACKGROUND Burnout among health care professionals is a significant concern, with detrimental effects on health care service quality and patient outcomes. The use of the electronic health record (EHR) system has been identified as a significant contributor to burnout among health care professionals. OBJECTIVE This systematic review and meta-analysis aims to assess the prevalence of burnout among health care professionals associated with the use of the EHR system, thereby providing evidence to improve health information systems and develop strategies to measure and mitigate burnout. METHODS We conducted a comprehensive search of the PubMed, Embase, and Web of Science databases for English-language peer-reviewed articles published between January 1, 2009, and December 31, 2022. Two independent reviewers applied inclusion and exclusion criteria, and study quality was assessed using the Joanna Briggs Institute checklist and the Newcastle-Ottawa Scale. Meta-analyses were performed using R (version 4.1.3; R Foundation for Statistical Computing), with EndNote X7 (Clarivate) for reference management. RESULTS The review included 32 cross-sectional studies and 5 case-control studies with a total of 66,556 participants, mainly physicians and registered nurses. The pooled prevalence of burnout among health care professionals in cross-sectional studies was 40.4% (95% CI 37.5%-43.2%). Case-control studies indicated a higher likelihood of burnout among health care professionals who spent more time on EHR-related tasks outside work (odds ratio 2.43, 95% CI 2.31-2.57). CONCLUSIONS The findings highlight the association between the increased use of the EHR system and burnout among health care professionals. Potential solutions include optimizing EHR systems, implementing automated dictation or note-taking, employing scribes to reduce documentation burden, and leveraging artificial intelligence to enhance EHR system efficiency and reduce the risk of burnout. TRIAL REGISTRATION PROSPERO International Prospective Register of Systematic Reviews CRD42021281173; https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42021281173.
Collapse
Affiliation(s)
- Yuxuan Wu
- Department of Medical Informatics, West China Hospital, Sichuan University, Chengdu, China
| | - Mingyue Wu
- Information Center, West China Hospital, Sichuan University, Chengdu, China
| | - Changyu Wang
- West China College of Stomatology, Sichuan University, Chengdu, China
| | - Jie Lin
- Department of Oral Implantology, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Jialin Liu
- Department of Medical Informatics, West China Hospital, Sichuan University, Chengdu, China
- Information Center, West China Hospital, Sichuan University, Chengdu, China
| | - Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
8
|
Rao SJ, Isath A, Krishnan P, Tangsrivimol JA, Virk HUH, Wang Z, Glicksberg BS, Krittanawong C. ChatGPT: A Conceptual Review of Applications and Utility in the Field of Medicine. J Med Syst 2024; 48:59. [PMID: 38836893 DOI: 10.1007/s10916-024-02075-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 05/07/2024] [Indexed: 06/06/2024]
Abstract
Artificial Intelligence, specifically advanced language models such as ChatGPT, have the potential to revolutionize various aspects of healthcare, medical education, and research. In this narrative review, we evaluate the myriad applications of ChatGPT in diverse healthcare domains. We discuss its potential role in clinical decision-making, exploring how it can assist physicians by providing rapid, data-driven insights for diagnosis and treatment. We review the benefits of ChatGPT in personalized patient care, particularly in geriatric care, medication management, weight loss and nutrition, and physical activity guidance. We further delve into its potential to enhance medical research, through the analysis of large datasets, and the development of novel methodologies. In the realm of medical education, we investigate the utility of ChatGPT as an information retrieval tool and personalized learning resource for medical students and professionals. There are numerous promising applications of ChatGPT that will likely induce paradigm shifts in healthcare practice, education, and research. The use of ChatGPT may come with several benefits in areas such as clinical decision making, geriatric care, medication management, weight loss and nutrition, physical fitness, scientific research, and medical education. Nevertheless, it is important to note that issues surrounding ethics, data privacy, transparency, inaccuracy, and inadequacy persist. Prior to widespread use in medicine, it is imperative to objectively evaluate the impact of ChatGPT in a real-world setting using a risk-based approach.
Collapse
Affiliation(s)
- Shiavax J Rao
- Department of Medicine, MedStar Union Memorial Hospital, Baltimore, MD, USA
| | - Ameesh Isath
- Department of Cardiology, Westchester Medical Center and New York Medical College, Valhalla, NY, USA
| | - Parvathy Krishnan
- Department of Pediatrics, Westchester Medical Center and New York Medical College, Valhalla, NY, USA
| | - Jonathan A Tangsrivimol
- Division of Neurosurgery, Department of Surgery, Chulabhorn Hospital, Chulabhorn Royal Academy, Bangkok, 10210, Thailand
- Department of Neurological Surgery, Weill Cornell Medicine Brain and Spine Center, New York, NY, 10022, USA
| | - Hafeez Ul Hassan Virk
- Harrington Heart & Vascular Institute, Case Western Reserve University, University Hospitals Cleveland Medical Center, Cleveland, OH, USA
| | - Zhen Wang
- Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA
- Division of Health Care Policy and Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Benjamin S Glicksberg
- Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Chayakrit Krittanawong
- Cardiology Division, NYU Langone Health and NYU School of Medicine, 550 First Avenue, New York, NY, 10016, USA.
| |
Collapse
|
9
|
Eltaybani S. The Transformative Role of Large Language Models in Post-Acute and Long-Term Care. J Am Med Dir Assoc 2024; 25:104982. [PMID: 38614135 DOI: 10.1016/j.jamda.2024.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 02/26/2024] [Accepted: 03/03/2024] [Indexed: 04/15/2024]
Affiliation(s)
- Sameh Eltaybani
- Global Nursing Research Center, The University of Tokyo, Tokyo, Japan.
| |
Collapse
|
10
|
Mao C, Zhang T. A commentary on can ChatGPT assist urologists manage overactive bladder? Int J Surg 2024; 110:3970-3971. [PMID: 38446864 PMCID: PMC11175744 DOI: 10.1097/js9.0000000000001261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 02/21/2024] [Indexed: 03/08/2024]
Affiliation(s)
- Changkun Mao
- Department of Urology, Anhui Provincial Children’s Hospital, Hefei, Anhui, People’s Republic of China
| | | |
Collapse
|
11
|
Zhang A, Dimock E, Gupta R, Chen K. The new frontier: utilizing ChatGPT to expand craniofacial research. Arch Craniofac Surg 2024; 25:116-122. [PMID: 38977396 PMCID: PMC11231409 DOI: 10.7181/acfs.2024.00115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 04/05/2024] [Accepted: 06/11/2024] [Indexed: 07/10/2024] Open
Abstract
BACKGROUND Due to the importance of evidence-based research in plastic surgery, the authors of this study aimed to assess the accuracy of ChatGPT in generating novel systematic review ideas within the field of craniofacial surgery. METHODS ChatGPT was prompted to generate 20 novel systematic review ideas for 10 different subcategories within the field of craniofacial surgery. For each topic, the chatbot was told to give 10 "general" and 10 "specific" ideas that were related to the concept. In order to determine the accuracy of ChatGPT, a literature review was conducted using PubMed, CINAHL, Embase, and Cochrane. RESULTS In total, 200 total systematic review research ideas were generated by ChatGPT. We found that the algorithm had an overall 57.5% accuracy at identifying novel systematic review ideas. ChatGPT was found to be 39% accurate for general topics and 76% accurate for specific topics. CONCLUSION Craniofacial surgeons should use ChatGPT as a tool. We found that ChatGPT provided more precise answers with specific research questions than with general questions and helped narrow down the search scope, leading to a more relevant and accurate response. Beyond research purposes, ChatGPT can augment patient consultations, improve healthcare equity, and assist in clinical decisionmaking. With rapid advancements in artificial intelligence (AI), it is important for plastic surgeons to consider using AI in their clinical practice to improve patient-centered outcomes.
Collapse
Affiliation(s)
- Andi Zhang
- Division of Plastic and Reconstructive Surgery, Saint Louis University School of Medicine, St. Louis, MO, USA
| | - Ethan Dimock
- Oakland University William Beaumont School of Medicine, Rochester, MI, USA
| | - Rohun Gupta
- Division of Plastic and Reconstructive Surgery, Saint Louis University School of Medicine, St. Louis, MO, USA
| | - Kevin Chen
- Division of Plastic and Reconstructive Surgery, Saint Louis University School of Medicine, St. Louis, MO, USA
| |
Collapse
|
12
|
Balasanjeevi G, Surapaneni KM. Comparison of ChatGPT version 3.5 & 4 for utility in respiratory medicine education using clinical case scenarios. Respir Med Res 2024; 85:101091. [PMID: 38657295 DOI: 10.1016/j.resmer.2024.101091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 02/14/2024] [Accepted: 02/15/2024] [Indexed: 04/26/2024]
Abstract
Integration of ChatGPT in Respiratory medicine presents a promising avenue for enhancing clinical practice and pedagogical approaches. This study compares the performance of ChatGPT version 3.5 and 4 in respiratory medicine, emphasizing its potential in clinical decision support and medical education using clinical cases. Results indicate moderate performance highlighting limitations in handling complex case scenarios. Compared to ChatGPT 3.5, version 4 showed greater promise as a pedagogical tool, providing interactive learning experiences. While serving as a preliminary decision support tool clinically, caution is advised, stressing the need for ongoing validation. Future research should refine its clinical capabilities for optimal integration into medical education and practice.
Collapse
Affiliation(s)
- Gayathri Balasanjeevi
- Department of Tuberculosis & Respiratory Diseases, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai 600 123, Tamil Nadu, India
| | - Krishna Mohan Surapaneni
- Department of Biochemistry, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai, 600 123 Tamil Nadu, India; Department of Medical Education, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai, 600 123 Tamil Nadu, India.
| |
Collapse
|
13
|
Lorenzi A, Pugliese G, Maniaci A, Lechien JR, Allevi F, Boscolo-Rizzo P, Vaira LA, Saibene AM. Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced. Eur Arch Otorhinolaryngol 2024:10.1007/s00405-024-08746-2. [PMID: 38795148 DOI: 10.1007/s00405-024-08746-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Accepted: 05/17/2024] [Indexed: 05/27/2024]
Abstract
PURPOSE This study evaluates the efficacy of two advanced Large Language Models (LLMs), OpenAI's ChatGPT 4 and Google's Gemini Advanced, in providing treatment recommendations for head and neck oncology cases. The aim is to assess their utility in supporting multidisciplinary oncological evaluations and decision-making processes. METHODS This comparative analysis examined the responses of ChatGPT 4 and Gemini Advanced to five hypothetical cases of head and neck cancer, each representing a different anatomical subsite. The responses were evaluated against the latest National Comprehensive Cancer Network (NCCN) guidelines by two blinded panels using the total disagreement score (TDS) and the artificial intelligence performance instrument (AIPI). Statistical assessments were performed using the Wilcoxon signed-rank test and the Friedman test. RESULTS Both LLMs produced relevant treatment recommendations with ChatGPT 4 generally outperforming Gemini Advanced regarding adherence to guidelines and comprehensive treatment planning. ChatGPT 4 showed higher AIPI scores (median 3 [2-4]) compared to Gemini Advanced (median 2 [2-3]), indicating better overall performance. Notably, inconsistencies were observed in the management of induction chemotherapy and surgical decisions, such as neck dissection. CONCLUSIONS While both LLMs demonstrated the potential to aid in the multidisciplinary management of head and neck oncology, discrepancies in certain critical areas highlight the need for further refinement. The study supports the growing role of AI in enhancing clinical decision-making but also emphasizes the necessity for continuous updates and validation against current clinical standards to integrate AI into healthcare practices fully.
Collapse
Affiliation(s)
- Andrea Lorenzi
- Division of Otolaryngology, Department of Surgical Sciences, Università degli Studi di Torino, Turin, Italy
| | - Giorgia Pugliese
- Otolaryngology Unit, Santi Paolo e Carlo Hospital, Department of Health Sciences, Università degli Studi di Milano, Milan, Italy.
| | - Antonino Maniaci
- Faculty of Medicine and Surgery, "Kore" University of Enna, Enna, Italy
- International Federation of Otorhinolaryngological Societies (YO-IFOS) Head and Neck Research Group, Paris, France
| | - Jerome R Lechien
- International Federation of Otorhinolaryngological Societies (YO-IFOS) Head and Neck Research Group, Paris, France
- Department of Otolaryngology-Head and Neck Surgery, Foch Hospital, School of Medicine, University Paris Saclay, Paris, France
| | - Fabiana Allevi
- International Federation of Otorhinolaryngological Societies (YO-IFOS) Head and Neck Research Group, Paris, France
- Maxillofacial Surgery Unit, Santi Paolo e Carlo Hospital, Department of Health Sciences, Università degli Studi di Milano, Milan, Italy
| | - Paolo Boscolo-Rizzo
- Department of Medical, Surgical and Health Sciences, Section of Otolaryngology, University of Trieste, Trieste, Italy
| | - Luigi Angelo Vaira
- International Federation of Otorhinolaryngological Societies (YO-IFOS) Head and Neck Research Group, Paris, France
- Maxillofacial Surgery Operative Unit, Department of Medicine, Surgery and Pharmacy, University of Sassari, Sassari, Italy
- Biomedical Science PhD School, Biomedical Science Department, University of Sassari, Sassari, Italy
| | - Alberto Maria Saibene
- Otolaryngology Unit, Santi Paolo e Carlo Hospital, Department of Health Sciences, Università degli Studi di Milano, Milan, Italy
- International Federation of Otorhinolaryngological Societies (YO-IFOS) Head and Neck Research Group, Paris, France
| |
Collapse
|
14
|
Yang J, Walker KC, Bekar-Cesaretli AA, Hao B, Bhadelia N, Joseph-McCarthy D, Paschalidis IC. Automating biomedical literature review for rapid drug discovery: Leveraging GPT-4 to expedite pandemic response. Int J Med Inform 2024; 189:105500. [PMID: 38815316 DOI: 10.1016/j.ijmedinf.2024.105500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 04/13/2024] [Accepted: 05/21/2024] [Indexed: 06/01/2024]
Abstract
OBJECTIVE The rapid expansion of the biomedical literature challenges traditional review methods, especially during outbreaks of emerging infectious diseases when quick action is critical. Our study aims to explore the potential of ChatGPT to automate the biomedical literature review for rapid drug discovery. MATERIALS AND METHODS We introduce a novel automated pipeline helping to identify drugs for a given virus in response to a potential future global health threat. Our approach can be used to select PubMed articles identifying a drug target for the given virus. We tested our approach on two known pathogens: SARS-CoV-2, where the literature is vast, and Nipah, where the literature is sparse. Specifically, a panel of three experts reviewed a set of PubMed articles and labeled them as either describing a drug target for the given virus or not. The same task was given to the automated pipeline and its performance was based on whether it labeled the articles similarly to the human experts. We applied a number of prompt engineering techniques to improve the performance of ChatGPT. RESULTS Our best configuration used GPT-4 by OpenAI and achieved an out-of-sample validation performance with accuracy/F1-score/sensitivity/specificity of 92.87%/88.43%/83.38%/97.82% for SARS-CoV-2 and 87.40%/73.90%/74.72%/91.36% for Nipah. CONCLUSION These results highlight the utility of ChatGPT in drug discovery and development and reveal their potential to enable rapid drug target identification during a pandemic-level health emergency.
Collapse
Affiliation(s)
- Jingmei Yang
- Department of Electrical & Computer Engineering and Division of Systems Engineering, Boston University, Boston, MA, United States of America
| | - Kenji C Walker
- Department of Biomedical Engineering, Boston University, Boston, MA, United States of America
| | | | - Boran Hao
- Department of Electrical & Computer Engineering and Division of Systems Engineering, Boston University, Boston, MA, United States of America
| | - Nahid Bhadelia
- Chobanian & Avedisian School of Medicine and Center for Emerging Infectious Diseases Policy and Research, Boston University, Boston, MA, United States of America
| | - Diane Joseph-McCarthy
- Department of Biomedical Engineering, Boston University, Boston, MA, United States of America
| | - Ioannis Ch Paschalidis
- Department of Electrical & Computer Engineering and Division of Systems Engineering, Boston University, Boston, MA, United States of America; Department of Biomedical Engineering, Boston University, Boston, MA, United States of America; Faculty of Computing & Data Sciences, Boston University, Boston, MA, United States of America.
| |
Collapse
|
15
|
Naqvi WM, Shaikh SZ, Mishra GV. Large language models in physical therapy: time to adapt and adept. Front Public Health 2024; 12:1364660. [PMID: 38887241 PMCID: PMC11182445 DOI: 10.3389/fpubh.2024.1364660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 05/10/2024] [Indexed: 06/20/2024] Open
Abstract
Healthcare is experiencing a transformative phase, with artificial intelligence (AI) and machine learning (ML). Physical therapists (PTs) stand on the brink of a paradigm shift in education, practice, and research. Rather than visualizing AI as a threat, it presents an opportunity to revolutionize. This paper examines how large language models (LLMs), such as ChatGPT and BioMedLM, driven by deep ML can offer human-like performance but face challenges in accuracy due to vast data in PT and rehabilitation practice. PTs can benefit by developing and training an LLM specifically for streamlining administrative tasks, connecting globally, and customizing treatments using LLMs. However, human touch and creativity remain invaluable. This paper urges PTs to engage in learning and shaping AI models by highlighting the need for ethical use and human supervision to address potential biases. Embracing AI as a contributor, and not just a user, is crucial by integrating AI, fostering collaboration for a future in which AI enriches the PT field provided data accuracy, and the challenges associated with feeding the AI model are sensitively addressed.
Collapse
Affiliation(s)
- Waqar M. Naqvi
- Department of Interdisciplinary Sciences, Datta Meghe Institute of Higher Education and Research, Wardha, India
- Department of Physiotherapy, College of Health Sciences, Gulf Medical University, Ajman, United Arab Emirates
- NKP Salve Institute of Medical Sciences and Research Center, Nagpur, India
| | - Summaiya Zareen Shaikh
- Department of Neuro-Physiotherapy, The SIA College of Health Sciences, College of Physiotherapy, Thane, India
| | - Gaurav V. Mishra
- Department of Radiodiagnosis, Datta Meghe Institute of Higher Education and Research, Wardha, India
| |
Collapse
|
16
|
Liu S, McCoy AB, Wright AP, Nelson SD, Huang SS, Ahmad HB, Carro SE, Franklin J, Brogan J, Wright A. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J Am Med Inform Assoc 2024; 31:1388-1396. [PMID: 38452289 PMCID: PMC11105133 DOI: 10.1093/jamia/ocae041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 02/06/2024] [Accepted: 02/21/2024] [Indexed: 03/09/2024] Open
Abstract
OBJECTIVES To evaluate the capability of using generative artificial intelligence (AI) in summarizing alert comments and to determine if the AI-generated summary could be used to improve clinical decision support (CDS) alerts. MATERIALS AND METHODS We extracted user comments to alerts generated from September 1, 2022 to September 1, 2023 at Vanderbilt University Medical Center. For a subset of 8 alerts, comment summaries were generated independently by 2 physicians and then separately by GPT-4. We surveyed 5 CDS experts to rate the human-generated and AI-generated summaries on a scale from 1 (strongly disagree) to 5 (strongly agree) for the 4 metrics: clarity, completeness, accuracy, and usefulness. RESULTS Five CDS experts participated in the survey. A total of 16 human-generated summaries and 8 AI-generated summaries were assessed. Among the top 8 rated summaries, five were generated by GPT-4. AI-generated summaries demonstrated high levels of clarity, accuracy, and usefulness, similar to the human-generated summaries. Moreover, AI-generated summaries exhibited significantly higher completeness and usefulness compared to the human-generated summaries (AI: 3.4 ± 1.2, human: 2.7 ± 1.2, P = .001). CONCLUSION End-user comments provide clinicians' immediate feedback to CDS alerts and can serve as a direct and valuable data resource for improving CDS delivery. Traditionally, these comments may not be considered in the CDS review process due to their unstructured nature, large volume, and the presence of redundant or irrelevant content. Our study demonstrates that GPT-4 is capable of distilling these comments into summaries characterized by high clarity, accuracy, and completeness. AI-generated summaries are equivalent and potentially better than human-generated summaries. These AI-generated summaries could provide CDS experts with a novel means of reviewing user comments to rapidly optimize CDS alerts both online and offline.
Collapse
Affiliation(s)
- Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37212, United States
| | - Allison B McCoy
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Aileen P Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Scott D Nelson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Sean S Huang
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Hasan B Ahmad
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195, United States
| | - Sabrina E Carro
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Jacob Franklin
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - James Brogan
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Adam Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| |
Collapse
|
17
|
Iqbal U, Lee LTJ, Rahmanti AR, Celi LA, Li YCJ. Can large language models provide secondary reliable opinion on treatment options for dermatological diseases? J Am Med Inform Assoc 2024; 31:1341-1347. [PMID: 38578616 PMCID: PMC11105123 DOI: 10.1093/jamia/ocae067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 02/26/2024] [Accepted: 03/27/2024] [Indexed: 04/06/2024] Open
Abstract
OBJECTIVE To investigate the consistency and reliability of medication recommendations provided by ChatGPT for common dermatological conditions, highlighting the potential for ChatGPT to offer second opinions in patient treatment while also delineating possible limitations. MATERIALS AND METHODS In this mixed-methods study, we used survey questions in April 2023 for drug recommendations generated by ChatGPT with data from secondary databases, that is, Taiwan's National Health Insurance Research Database and an US medical center database, and validated by dermatologists. The methodology included preprocessing queries, executing them multiple times, and evaluating ChatGPT responses against the databases and dermatologists. The ChatGPT-generated responses were analyzed statistically in a disease-drug matrix, considering disease-medication associations (Q-value) and expert evaluation. RESULTS ChatGPT achieved a high 98.87% dermatologist approval rate for common dermatological medication recommendations. We evaluated its drug suggestions using the Q-value, showing that human expert validation agreement surpassed Q-value cutoff-based agreement. Varying cutoff values for disease-medication associations, a cutoff of 3 achieved 95.14% accurate prescriptions, 5 yielded 85.42%, and 10 resulted in 72.92%. While ChatGPT offered accurate drug advice, it occasionally included incorrect ATC codes, leading to issues like incorrect drug use and type, nonexistent codes, repeated errors, and incomplete medication codes. CONCLUSION ChatGPT provides medication recommendations as a second opinion in dermatology treatment, but its reliability and comprehensiveness need refinement for greater accuracy. In the future, integrating a medical domain-specific knowledge base for training and ongoing optimization will enhance the precision of ChatGPT's results.
Collapse
Affiliation(s)
- Usman Iqbal
- School of Population Health, Faculty of Medicine and Health, University of New South Wales (UNSW), Sydney, NSW 2052, Australia
- Department of Health, Tasmania 7000, Australia
- Global Health and Health Security Department, College of Public Health, Taipei Medical University, Taipei 110, Taiwan
| | - Leon Tsung-Ju Lee
- Graduate Institute of Clinical Medicine, Taipei Medical University, Taipei 110, Taiwan
- Department of Dermatology, Taipei Medical University Hospital, Taipei Medical University, Taipei 110, Taiwan
- Department of Dermatology, School of Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan
| | - Annisa Ristya Rahmanti
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan
- International Center for Health Information and Technology, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan
- Department Health Policy and Management, Faculty of Medicine, Public Health and Nursing, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia
| | - Leo Anthony Celi
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA 02139, United States
- Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA 02215, United States
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States
| | - Yu-Chuan Jack Li
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan
- International Center for Health Information and Technology, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan
- Department of Dermatology, Taipei Municipal Wanfang Hospital, Taipei Medical University, Taipei 116, Taiwan
- The International Medical Informatics Association (IMIA), Genève CH-1204, Switzerland
| |
Collapse
|
18
|
Liu S, McCoy AB, Wright AP, Carew B, Genkins JZ, Huang SS, Peterson JF, Steitz B, Wright A. Leveraging large language models for generating responses to patient messages-a subjective analysis. J Am Med Inform Assoc 2024; 31:1367-1379. [PMID: 38497958 PMCID: PMC11105129 DOI: 10.1093/jamia/ocae052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 01/17/2024] [Accepted: 02/28/2024] [Indexed: 03/19/2024] Open
Abstract
OBJECTIVE This study aimed to develop and assess the performance of fine-tuned large language models for generating responses to patient messages sent via an electronic health record patient portal. MATERIALS AND METHODS Utilizing a dataset of messages and responses extracted from the patient portal at a large academic medical center, we developed a model (CLAIR-Short) based on a pre-trained large language model (LLaMA-65B). In addition, we used the OpenAI API to update physician responses from an open-source dataset into a format with informative paragraphs that offered patient education while emphasizing empathy and professionalism. By combining with this dataset, we further fine-tuned our model (CLAIR-Long). To evaluate fine-tuned models, we used 10 representative patient portal questions in primary care to generate responses. We asked primary care physicians to review generated responses from our models and ChatGPT and rated them for empathy, responsiveness, accuracy, and usefulness. RESULTS The dataset consisted of 499 794 pairs of patient messages and corresponding responses from the patient portal, with 5000 patient messages and ChatGPT-updated responses from an online platform. Four primary care physicians participated in the survey. CLAIR-Short exhibited the ability to generate concise responses similar to provider's responses. CLAIR-Long responses provided increased patient educational content compared to CLAIR-Short and were rated similarly to ChatGPT's responses, receiving positive evaluations for responsiveness, empathy, and accuracy, while receiving a neutral rating for usefulness. CONCLUSION This subjective analysis suggests that leveraging large language models to generate responses to patient messages demonstrates significant potential in facilitating communication between patients and healthcare providers.
Collapse
Affiliation(s)
- Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Allison B McCoy
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Aileen P Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Babatunde Carew
- Department of General Internal Medicine and Public Health, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Julian Z Genkins
- Department of Medicine, Stanford University, Stanford, CA 94304, United States
| | - Sean S Huang
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Josh F Peterson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Bryan Steitz
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Adam Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| |
Collapse
|
19
|
Kim H, Park H, Kang S, Kim J, Kim J, Jung J, Taira R. Evaluating the validity of the nursing statements algorithmically generated based on the International Classifications of Nursing Practice for respiratory nursing care using large language models. J Am Med Inform Assoc 2024; 31:1397-1403. [PMID: 38630586 PMCID: PMC11105147 DOI: 10.1093/jamia/ocae070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/21/2024] [Accepted: 03/19/2024] [Indexed: 04/19/2024] Open
Abstract
OBJECTIVE This study aims to facilitate the creation of quality standardized nursing statements in South Korea's hospitals using algorithmic generation based on the International Classifications of Nursing Practice (ICNP) and evaluation through Large Language Models. MATERIALS AND METHODS We algorithmically generated 15 972 statements related to acute respiratory care using 117 concepts and concept composition models of ICNP. Human reviewers, Generative Pre-trained Transformers 4.0 (GPT-4.0), and Bio_Clinical Bidirectional Encoder Representations from Transformers (BERT) evaluated the generated statements for validity. The evaluation by GPT-4.0 and Bio_ClinicalBERT was conducted with and without contextual information and training. RESULTS Of the generated statements, 2207 were deemed valid by expert reviewers. GPT-4.0 showed a zero-shot AUC of 0.857, which aggravated with contextual information. Bio_ClinicalBERT, after training, significantly improved, reaching an AUC of 0.998. CONCLUSION Bio_ClinicalBERT effectively validates auto-generated nursing statements, offering a promising solution to enhance and streamline healthcare documentation processes.
Collapse
Affiliation(s)
- Hyeoneui Kim
- College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
- The Research Institute of Nursing Science, Seoul National University, Seoul, 03080, Republic of Korea
- Center for Human-Caring Nurse Leaders for the Future by Brain Korea 21 (BK 21) Four Project, College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
| | - Hyewon Park
- College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
- Samsung Medical Center, Seoul, 06351, Republic of Korea
| | - Sunghoon Kang
- The Department of Science Studies, Seoul National University, Seoul, 08826, Republic of Korea
| | - Jinsol Kim
- College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
- Center for Human-Caring Nurse Leaders for the Future by Brain Korea 21 (BK 21) Four Project, College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
| | - Jeongha Kim
- College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
- Asan Medical Center, Seoul, 05505, Republic of Korea
| | - Jinsun Jung
- College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
- Center for Human-Caring Nurse Leaders for the Future by Brain Korea 21 (BK 21) Four Project, College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
| | - Ricky Taira
- The Department of Radiological Science, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095, United States
| |
Collapse
|
20
|
Dos Santos FC, Johnson LG, Madandola OO, Priola KJB, Yao Y, Macieira TGR, Keenan GM. An example of leveraging AI for documentation: ChatGPT-generated nursing care plan for an older adult with lung cancer. J Am Med Inform Assoc 2024:ocae116. [PMID: 38758655 DOI: 10.1093/jamia/ocae116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 04/22/2024] [Accepted: 05/03/2024] [Indexed: 05/19/2024] Open
Abstract
OBJECTIVE Our article demonstrates the effectiveness of using a validated framework to create a ChatGPT prompt that generates valid nursing care plan suggestions for one hypothetical older patient with lung cancer. METHOD This study describes the methodology for creating ChatGPT prompts that generate consistent care plan suggestions and its application for a lung cancer case scenario. After entering a nursing assessment of the patient's condition into ChatGPT, we asked it to generate care plan suggestions. Subsequently, we assessed the quality of the care plans produced by ChatGPT. RESULTS While not all the suggested care plan terms (11 out of 16) utilized standardized nursing terminology, the ChatGPT-generated care plan closely matched the gold standard in scope and nature, correctly prioritizing oxygenation and ventilation needs. CONCLUSION Using a validated framework prompt to generate nursing care plan suggestions with ChatGPT demonstrates its potential value as a decision support tool for optimizing cancer care documentation.
Collapse
Affiliation(s)
| | - Lisa G Johnson
- Department of Family, Community, and Health Systems Science, College of Nursing, University of Florida, Gainesville, FL 32610, United States
| | - Olatunde O Madandola
- Department of Family, Community, and Health Systems Science, College of Nursing, University of Florida, Gainesville, FL 32610, United States
| | - Karen J B Priola
- Department of Family, Community, and Health Systems Science, College of Nursing, University of Florida, Gainesville, FL 32610, United States
| | - Yingwei Yao
- Department of Biobehavioral Nursing Science, College of Nursing, University of Florida, Gainesville, FL 32610, United States
| | - Tamara G R Macieira
- Department of Family, Community, and Health Systems Science, College of Nursing, University of Florida, Gainesville, FL 32610, United States
| | - Gail M Keenan
- Department of Family, Community, and Health Systems Science, College of Nursing, University of Florida, Gainesville, FL 32610, United States
| |
Collapse
|
21
|
Grimm DR, Lee YJ, Hu K, Liu L, Garcia O, Balakrishnan K, Ayoub NF. The utility of ChatGPT as a generative medical translator. Eur Arch Otorhinolaryngol 2024:10.1007/s00405-024-08708-8. [PMID: 38705894 DOI: 10.1007/s00405-024-08708-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 04/24/2024] [Indexed: 05/07/2024]
Abstract
PURPOSE Large language models continue to dramatically change the medical landscape. We aimed to explore the utility of ChatGPT in providing accurate, actionable, and understandable generative medical translations in English, Spanish, and Mandarin pertaining to Otolaryngology. METHODS Responses of GPT-4 to commonly asked patient questions listed on official otolaryngology clinical practice guidelines (CPG) were evaluated with the Patient Education materials Assessment Tool-printable (PEMAT-P.) Additional critical elements were identified a priori to evaluate ChatGPT's accuracy and thoroughness in its responses. Multiple fluent speakers of English, Mandarin, and Spanish evaluated each response generated by ChatGPT. RESULTS Total PEMAT-P scores differed between English, Mandarin, and Spanish GPT-4 generated responses depicting a moderate effect size of language, Eta-Square 0.07 with scores ranging from 73 to 77 (P-value = 0.03). Overall understandability scores did not differ between English, Mandarin, and Spanish depicting a small effect size of language, Eta-Square 0.02 scores ranging from 76 to 79 (P-value = 0.17), nor did overall actionability scores Eta-Square 0 score ranging 66-73 (P-value = 0.44). Overall a priori procedure-specific responses similarly did not differ between English, Spanish, and Mandarin Eta-Square 0.02 scores ranging 61-78 (P-value = 0.22). CONCLUSION GPT-4 produces accurate, understandable, and actionable outputs in English, Spanish, and Mandarin. Responses generated by GPT-4 in Spanish and Mandarin are comparable to English counterparts indicating a novel use for these models within Otolaryngology, and implications for bridging healthcare access and literacy gaps. LEVEL OF EVIDENCE IV.
Collapse
Affiliation(s)
- David R Grimm
- Division of Pediatric Otolaryngology, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Yu-Jin Lee
- Division of Pediatric Otolaryngology, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Katherine Hu
- Division of Pediatric Otolaryngology, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Longsha Liu
- Division of Pediatric Otolaryngology, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Omar Garcia
- Division of Pediatric Otolaryngology, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Karthik Balakrishnan
- Division of Pediatric Otolaryngology, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Noel F Ayoub
- Division of Pediatric Otolaryngology, Department of Otolaryngology-Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA, 94305, USA.
- Division of Rhinology and Skull Base Surgery, Department of Otolaryngology-Head and Neck Surgery, Mass Eye and Ear, 243 Charles Street, Boston, MA, 02114, USA.
| |
Collapse
|
22
|
Ferdush J, Begum M, Hossain ST. ChatGPT and Clinical Decision Support: Scope, Application, and Limitations. Ann Biomed Eng 2024; 52:1119-1124. [PMID: 37516680 DOI: 10.1007/s10439-023-03329-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Accepted: 07/18/2023] [Indexed: 07/31/2023]
Abstract
This study examines ChatGPT's role in clinical decision support, by analyzing its scope, application, and limitations. By analyzing patient data and providing evidence-based recommendations, ChatGPT, an AI language model, can help healthcare professionals make well-informed decisions. This study examines ChatGPT's use in clinical decision support, including diagnosis and treatment planning. However, it acknowledges limitations like biases, lack of contextual understanding, and human oversight and also proposes a framework for the future clinical decision support system. Understanding these factors will allow healthcare professionals to utilize ChatGPT effectively and make accurate clinical decisions. Further research is needed to understand the implications of using ChatGPT in healthcare settings and to develop safeguards for responsible use.
Collapse
Affiliation(s)
- Jannatul Ferdush
- Department of Computer Science and Engineering, Jashore University of Science and Technology, Jashore, 7408, Bangladesh.
| | - Mahbuba Begum
- Department of Computer Science and Engineering, Mawlana Bhasani Science and Technology, Tangail, 1902, Bangladesh
| | - Sakib Tanvir Hossain
- Department of Mechanical Engineering, Khulna University of Engineering and Technology, Khulna, 9203, Bangladesh
| |
Collapse
|
23
|
Wu J, Wu X, Qiu Z, Li M, Lin S, Zhang Y, Zheng Y, Yuan C, Yang J. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inform Assoc 2024:ocae079. [PMID: 38684792 DOI: 10.1093/jamia/ocae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 03/14/2024] [Accepted: 04/02/2024] [Indexed: 05/02/2024] Open
Abstract
OBJECTIVES Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance. MATERIALS AND METHODS The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381 149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT (GPT-3.5), GPT-4, Baichuan2-7B, Baichuan2-13B, and QWEN-72B in CNMLE-2022 and further investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 distinct perspectives. RESULTS Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE framework, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT's performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70, affirming the effectiveness and robustness of the framework. It also enabled a smaller Baichuan2-13B to pass the examination, showcasing the great potential in low-resource settings. DISCUSSION AND CONCLUSION This study shed light on the optimal practices to enhance the capabilities of LLMs in non-English medical scenarios. By synergizing medical knowledge through in-context learning, LLMs can extend clinical insight beyond language barriers in healthcare, significantly reducing language-related disparities of LLM applications and ensuring global benefit in this field.
Collapse
Affiliation(s)
- Jiageng Wu
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Xian Wu
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Zhaopeng Qiu
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Minghui Li
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Shixu Lin
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Yingying Zhang
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Yefeng Zheng
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Changzheng Yuan
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Department of Nutrition, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States
| | - Jie Yang
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, United States
| |
Collapse
|
24
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.26.24306390. [PMID: 38712148 PMCID: PMC11071576 DOI: 10.1101/2024.04.26.24306390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Background The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Ellen Wright Clayton
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
| | - Bradley A. Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
- Department of Biostatistics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| |
Collapse
|
25
|
Pham C, Govender R, Tehami S, Chavez S, Adepoju OE, Liaw W. ChatGPT's Performance in Cardiac Arrest and Bradycardia Simulations Using the American Heart Association's Advanced Cardiovascular Life Support Guidelines: Exploratory Study. J Med Internet Res 2024; 26:e55037. [PMID: 38648098 DOI: 10.2196/55037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 02/22/2024] [Accepted: 03/10/2024] [Indexed: 04/25/2024] Open
Abstract
BACKGROUND ChatGPT is the most advanced large language model to date, with prior iterations having passed medical licensing examinations, providing clinical decision support, and improved diagnostics. Although limited, past studies of ChatGPT's performance found that artificial intelligence could pass the American Heart Association's advanced cardiovascular life support (ACLS) examinations with modifications. ChatGPT's accuracy has not been studied in more complex clinical scenarios. As heart disease and cardiac arrest remain leading causes of morbidity and mortality in the United States, finding technologies that help increase adherence to ACLS algorithms, which improves survival outcomes, is critical. OBJECTIVE This study aims to examine the accuracy of ChatGPT in following ACLS guidelines for bradycardia and cardiac arrest. METHODS We evaluated the accuracy of ChatGPT's responses to 2 simulations based on the 2020 American Heart Association ACLS guidelines with 3 primary outcomes of interest: the mean individual step accuracy, the accuracy score per simulation attempt, and the accuracy score for each algorithm. For each simulation step, ChatGPT was scored for correctness (1 point) or incorrectness (0 points). Each simulation was conducted 20 times. RESULTS ChatGPT's median accuracy for each step was 85% (IQR 40%-100%) for cardiac arrest and 30% (IQR 13%-81%) for bradycardia. ChatGPT's median accuracy over 20 simulation attempts for cardiac arrest was 69% (IQR 67%-74%) and for bradycardia was 42% (IQR 33%-50%). We found that ChatGPT's outputs varied despite consistent input, the same actions were persistently missed, repetitive overemphasis hindered guidance, and erroneous medication information was presented. CONCLUSIONS This study highlights the need for consistent and reliable guidance to prevent potential medical errors and optimize the application of ChatGPT to enhance its reliability and effectiveness in clinical practice.
Collapse
Affiliation(s)
- Cecilia Pham
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
| | - Romi Govender
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
| | - Salik Tehami
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
| | - Summer Chavez
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
- Humana Integrated Health Sciences Institute, University of Houston, Houston, TX, United States
- Department of Health Systems and Population Health Sciences, Tilman J Fertitta Family College of Medicine, Houston, TX, United States
| | - Omolola E Adepoju
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
- Humana Integrated Health Sciences Institute, University of Houston, Houston, TX, United States
- Department of Health Systems and Population Health Sciences, Tilman J Fertitta Family College of Medicine, Houston, TX, United States
| | - Winston Liaw
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
- Humana Integrated Health Sciences Institute, University of Houston, Houston, TX, United States
- Department of Health Systems and Population Health Sciences, Tilman J Fertitta Family College of Medicine, Houston, TX, United States
| |
Collapse
|
26
|
Deng QF, Bao YY, Yang YY, Mao CK. Re: David Musheyev, Alexander Pan, Stacy Loeb, Abdo E. Kabarriti. How Well Do Artificial Intelligence Chatbots Respond to the Top Search Queries About Urological Malignancies? Eur Urol 2023;85:13-6. Eur Urol 2024:S0302-2838(24)02312-1. [PMID: 38644145 DOI: 10.1016/j.eururo.2024.02.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Accepted: 02/15/2024] [Indexed: 04/23/2024]
Affiliation(s)
- Qi-Fei Deng
- Department of Urology, Anhui Provincial Children's Hospital, Hefei, China
| | - Yuan-Yuan Bao
- Department of Electrocardiography, Anhui Maternal and Child Health Hospital, Hefei, China
| | - Yuan-Yuan Yang
- Department of Electrocardiography, Anhui Maternal and Child Health Hospital, Hefei, China.
| | - Chang-Kun Mao
- Department of Urology, Anhui Provincial Children's Hospital, Hefei, China.
| |
Collapse
|
27
|
Wu J, Ma Y, Wang J, Xiao M. The Application of ChatGPT in Medicine: A Scoping Review and Bibliometric Analysis. J Multidiscip Healthc 2024; 17:1681-1692. [PMID: 38650670 PMCID: PMC11034560 DOI: 10.2147/jmdh.s463128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 03/25/2024] [Indexed: 04/25/2024] Open
Abstract
Purpose ChatGPT has a wide range of applications in the medical field. Therefore, this review aims to define the key issues and provide a comprehensive view of the literature based on the application of ChatGPT in medicine. Methods This scope follows Arksey and O'Malley's five-stage framework. A comprehensive literature search of publications (30 November 2022 to 16 August 2023) was conducted. Six databases were searched and relevant references were systematically catalogued. Attention was focused on the general characteristics of the articles, their fields of application, and the advantages and disadvantages of using ChatGPT. Descriptive statistics and narrative synthesis methods were used for data analysis. Results Of the 3426 studies, 247 met the criteria for inclusion in this review. The majority of articles (31.17%) were from the United States. Editorials (43.32%) ranked first, followed by experimental studys (11.74%). The potential applications of ChatGPT in medicine are varied, with the largest number of studies (45.75%) exploring clinical practice, including assisting with clinical decision support and providing disease information and medical advice. This was followed by medical education (27.13%) and scientific research (16.19%). Particularly noteworthy in the discipline statistics were radiology, surgery and dentistry at the top of the list. However, ChatGPT in medicine also faces issues of data privacy, inaccuracy and plagiarism. Conclusion The application of ChatGPT in medicine focuses on different disciplines and general application scenarios. ChatGPT has a paradoxical nature: it offers significant advantages, but at the same time raises great concerns about its application in healthcare settings. Therefore, it is imperative to develop theoretical frameworks that not only address its widespread use in healthcare but also facilitate a comprehensive assessment. In addition, these frameworks should contribute to the development of strict and effective guidelines and regulatory measures.
Collapse
Affiliation(s)
- Jie Wu
- Department of Nursing, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Yingzhuo Ma
- Department of Nursing, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Jun Wang
- Department of Nursing, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Mingzhao Xiao
- Department of Urology, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| |
Collapse
|
28
|
Liu S, McCoy AB, Peterson JF, Lasko TA, Sittig DF, Nelson SD, Andrews J, Patterson L, Cobb CM, Mulherin D, Morton CT, Wright A. Leveraging explainable artificial intelligence to optimize clinical decision support. J Am Med Inform Assoc 2024; 31:968-974. [PMID: 38383050 PMCID: PMC10990514 DOI: 10.1093/jamia/ocae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 01/02/2024] [Accepted: 01/22/2024] [Indexed: 02/23/2024] Open
Abstract
OBJECTIVE To develop and evaluate a data-driven process to generate suggestions for improving alert criteria using explainable artificial intelligence (XAI) approaches. METHODS We extracted data on alerts generated from January 1, 2019 to December 31, 2020, at Vanderbilt University Medical Center. We developed machine learning models to predict user responses to alerts. We applied XAI techniques to generate global explanations and local explanations. We evaluated the generated suggestions by comparing with alert's historical change logs and stakeholder interviews. Suggestions that either matched (or partially matched) changes already made to the alert or were considered clinically correct were classified as helpful. RESULTS The final dataset included 2 991 823 firings with 2689 features. Among the 5 machine learning models, the LightGBM model achieved the highest Area under the ROC Curve: 0.919 [0.918, 0.920]. We identified 96 helpful suggestions. A total of 278 807 firings (9.3%) could have been eliminated. Some of the suggestions also revealed workflow and education issues. CONCLUSION We developed a data-driven process to generate suggestions for improving alert criteria using XAI techniques. Our approach could identify improvements regarding clinical decision support (CDS) that might be overlooked or delayed in manual reviews. It also unveils a secondary purpose for the XAI: to improve quality by discovering scenarios where CDS alerts are not accepted due to workflow, education, or staffing issues.
Collapse
Affiliation(s)
- Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37212, United States
| | - Allison B McCoy
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Josh F Peterson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Thomas A Lasko
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37212, United States
| | - Dean F Sittig
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, United States
| | - Scott D Nelson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Jennifer Andrews
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Lorraine Patterson
- HeathIT, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Cheryl M Cobb
- Department of Psychiatry and Behavioral Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - David Mulherin
- HeathIT, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Colleen T Morton
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Adam Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| |
Collapse
|
29
|
Saibene AM, Allevi F, Calvo-Henriquez C, Maniaci A, Mayo-Yáñez M, Paderno A, Vaira LA, Felisati G, Craig JR. Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation. Eur Arch Otorhinolaryngol 2024; 281:1835-1841. [PMID: 38189967 PMCID: PMC10943141 DOI: 10.1007/s00405-023-08372-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 11/22/2023] [Indexed: 01/09/2024]
Abstract
PURPOSE This study aimed to evaluate the utility of large language model (LLM) artificial intelligence tools, Chat Generative Pre-Trained Transformer (ChatGPT) versions 3.5 and 4, in managing complex otolaryngological clinical scenarios, specifically for the multidisciplinary management of odontogenic sinusitis (ODS). METHODS A prospective, structured multidisciplinary specialist evaluation was conducted using five ad hoc designed ODS-related clinical scenarios. LLM responses to these scenarios were critically reviewed by a multidisciplinary panel of eight specialist evaluators (2 ODS experts, 2 rhinologists, 2 general otolaryngologists, and 2 maxillofacial surgeons). Based on the level of disagreement from panel members, a Total Disagreement Score (TDS) was calculated for each LLM response, and TDS comparisons were made between ChatGPT3.5 and ChatGPT4, as well as between different evaluators. RESULTS While disagreement to some degree was demonstrated in 73/80 evaluator reviews of LLMs' responses, TDSs were significantly lower for ChatGPT4 compared to ChatGPT3.5. Highest TDSs were found in the case of complicated ODS with orbital abscess, presumably due to increased case complexity with dental, rhinologic, and orbital factors affecting diagnostic and therapeutic options. There were no statistically significant differences in TDSs between evaluators' specialties, though ODS experts and maxillofacial surgeons tended to assign higher TDSs. CONCLUSIONS LLMs like ChatGPT, especially newer versions, showed potential for complimenting evidence-based clinical decision-making, but substantial disagreement was still demonstrated between LLMs and clinical specialists across most case examples, suggesting they are not yet optimal in aiding clinical management decisions. Future studies will be important to analyze LLMs' performance as they evolve over time.
Collapse
Affiliation(s)
- Alberto Maria Saibene
- Otolaryngology Unit, Santi Paolo E Carlo Hospital, Department of Health Sciences, Università Degli Studi Di Milano, Milan, Italy.
| | - Fabiana Allevi
- Maxillofacial Surgery Unit, Santi Paolo E Carlo Hospital, Department of Health Sciences, Università Degli Studi Di Milano, Milan, Italy
| | - Christian Calvo-Henriquez
- Service of Otolaryngology, Rhinology Unit, Hospital Complex at the University of Santiago de Compostela, Santiago de Compostela, A Coruña, Spain
| | - Antonino Maniaci
- Department of Medical, Surgical Sciences and Advanced Technologies G.F. Ingrassia, University of Catania, Catania, Italy
| | - Miguel Mayo-Yáñez
- Otorhinolaryngology, Head and Neck Surgery Department, Complexo Hospitalario Universitario A Coruña (CHUAC), A Coruña, Galicia, Spain
| | - Alberto Paderno
- Department of Otorhinolaryngology, Head and Neck Surgery, University of Brescia, Brescia, Italy
| | - Luigi Angelo Vaira
- Maxillofacial Surgery Operative Unit, Department of Medicine, Surgery and Pharmacy, University of Sassari, Sassari, Italy
- Biomedical Science PhD School, Biomedical Science Department, University of Sassari, Sassari, Italy
| | - Giovanni Felisati
- Otolaryngology Unit, Santi Paolo E Carlo Hospital, Department of Health Sciences, Università Degli Studi Di Milano, Milan, Italy
| | - John R Craig
- Department of Otolaryngology-Head and Neck Surgery, Henry Ford Health, Detroit, MI, USA
| |
Collapse
|
30
|
Braun EM, Juhasz-Böss I, Solomayer EF, Truhn D, Keller C, Heinrich V, Braun BJ. Will I soon be out of my job? Quality and guideline conformity of ChatGPT therapy suggestions to patient inquiries with gynecologic symptoms in a palliative setting. Arch Gynecol Obstet 2024; 309:1543-1549. [PMID: 37975899 DOI: 10.1007/s00404-023-07272-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Accepted: 10/15/2023] [Indexed: 11/19/2023]
Abstract
PURPOSE The market and application possibilities for artificial intelligence are currently growing at high speed and are increasingly finding their way into gynecology. While the medical side is highly represented in the current literature, the patient's perspective is still lagging behind. Therefore, the aim of this study was to evaluate the recommendations of ChatGPT regarding patient inquiries about the possible therapy of gynecological leading symptoms in a palliative situation by experts. METHODS Case vignettes were constructed for 10 common concomitant symptoms in gynecologic oncology tumors in a palliative setting, and patient queries regarding therapy of these symptoms were generated as prompts for ChatGPT. Five experts in palliative care and gynecologic oncology evaluated the responses with respect to guideline adherence and applicability and identified advantages and disadvantages. RESULTS The overall rating of ChatGPT responses averaged 4.1 (5 = strongly agree; 1 = strongly disagree). The experts saw an average guideline conformity of the therapy recommendations with a value of 4.0. ChatGPT sometimes omits relevant therapies and does not provide an individual assessment of the suggested therapies, but does indicate that a physician consultation is additionally necessary. CONCLUSIONS Language models, such as ChatGPT, can provide valid and largely guideline-compliant therapy recommendations in their freely available and thus in principle accessible version for our patients. For a complete therapy recommendation, an evaluation of the therapies, their individual adjustment as well as a filtering of possible wrong recommendations, a medical expert's opinion remains indispensable.
Collapse
Affiliation(s)
- Eva-Marie Braun
- Center for Integrative Oncology, Die Filderklinik, Im Haberschlai 7, 70794, Filderstadt-Bonlanden, Germany.
| | - Ingolf Juhasz-Böss
- Department of Gynecology, University Medical Center Freiburg, Hugstetter Straße 55, 79106, Freiburg, Germany
| | - Erich-Franz Solomayer
- Department of Gynecology, Obstetrics and Reproductive Medicine, Saarland University Hospital, Kirrberger Straße, Building 9, 66421, Homburg, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Pauwelsstraße 30, 52074, Aachen, Germany
| | - Christiane Keller
- Center for Palliative Medicine and Pediatric Pain Therapy, Saarland University Hospital, Kirrberger Straße, Building 69, 66421, Homburg, Germany
| | - Vanessa Heinrich
- Department of Radiation Oncology, University Hospital Tübingen, Crona Kliniken, Hoppe-Seyler-Str. 3, 72076, Tübingen, Germany
| | - Benedikt Johannes Braun
- Department of Trauma and Reconstructive Surgery at the Eberhard Karls University Tübingen, BG Unfallklinik Tübingen, Schnarrenbergstrasse 95, 72076, Tübingen, Germany
| |
Collapse
|
31
|
Teixeira-Marques F, Medeiros N, Nazaré F, Alves S, Lima N, Ribeiro L, Gama R, Oliveira P. Exploring the role of ChatGPT in clinical decision-making in otorhinolaryngology: a ChatGPT designed study. Eur Arch Otorhinolaryngol 2024; 281:2023-2030. [PMID: 38345613 DOI: 10.1007/s00405-024-08498-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Accepted: 01/23/2024] [Indexed: 03/16/2024]
Abstract
PURPOSE Since the beginning of 2023, ChatGPT emerged as a hot topic in healthcare research. The potential to be a valuable tool in clinical practice is compelling, particularly in improving clinical decision support by helping physicians to make clinical decisions based on the best medical knowledge available. We aim to investigate ChatGPT's ability to identify, diagnose and manage patients with otorhinolaryngology-related symptoms. METHODS A prospective, cross-sectional study was designed based on an idea suggested by ChatGPT to assess the level of agreement between ChatGPT and five otorhinolaryngologists (ENTs) in 20 reality-inspired clinical cases. The clinical cases were presented to the chatbot on two different occasions (ChatGPT-1 and ChatGPT-2) to assess its temporal stability. RESULTS The mean score of ChatGPT-1 was 4.4 (SD 1.2; min 1, max 5) and of ChatGPT-2 was 4.15 (SD 1.3; min 1, max 5), while the ENTs mean score was 4.91 (SD 0.3; min 3, max 5). The Mann-Whitney U test revealed a statistically significant difference (p < 0.001) between both ChatGPT's and the ENTs's score. ChatGPT-1 and ChatGPT-2 gave different answers in five occasions. CONCLUSIONS Artificial intelligence will be an important instrument in clinical decision-making in the near future and ChatGPT is the most promising chatbot so far. Despite needing further development to be used with safety, there is room for improvement and potential to aid otorhinolaryngology residents and specialists in making the most correct decision for the patient.
Collapse
Affiliation(s)
- Francisco Teixeira-Marques
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal.
| | - Nuno Medeiros
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Francisco Nazaré
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Sandra Alves
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Nuno Lima
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Leandro Ribeiro
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Rita Gama
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| | - Pedro Oliveira
- Department of Otorhinolaryngology, Centro Hospitalar de Vila Nova de Gaia/Espinho, Gaia (Porto), Portugal
| |
Collapse
|
32
|
Aguero D, Nelson SD. The Potential Application of Large Language Models in Pharmaceutical Supply Chain Management. J Pediatr Pharmacol Ther 2024; 29:200-205. [PMID: 38596417 PMCID: PMC11001215 DOI: 10.5863/1551-6776-29.2.200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 01/04/2024] [Indexed: 04/11/2024]
Affiliation(s)
- David Aguero
- Department of Pharmacy and Pharmaceutical Sciences (DA), St. Jude Children’s Research Hospital, TN
| | - Scott D. Nelson
- Department of Biomedical Informatics (SDN), Vanderbilt University Medical Center, Nashville, TN
| |
Collapse
|
33
|
Lin KC, Chen TA, Lin MH, Chen YC, Chen TJ. Integration and Assessment of ChatGPT in Medical Case Reporting: A Multifaceted Approach. Eur J Investig Health Psychol Educ 2024; 14:888-901. [PMID: 38667812 PMCID: PMC11049282 DOI: 10.3390/ejihpe14040057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Revised: 03/18/2024] [Accepted: 03/19/2024] [Indexed: 04/28/2024] Open
Abstract
ChatGPT, a large language model, has gained significance in medical writing, particularly in case reports that document the course of an illness. This article explores the integration of ChatGPT and how ChatGPT shapes the process, product, and politics of medical writing in the real world. We conducted a bibliometric analysis on case reports utilizing ChatGPT and indexed in PubMed, encompassing publication information. Furthermore, an in-depth analysis was conducted to categorize the applications and limitations of ChatGPT and the publication trend of application categories. A total of 66 case reports utilizing ChatGPT were identified, with a predominant preference for the online version and English input by the authors. The prevalent application categories were information retrieval and content generation. Notably, this trend remained consistent across different months. Within the subset of 32 articles addressing ChatGPT limitations in case report writing, concerns related to inaccuracies and a lack of clinical context were prominently emphasized. This pointed out the important role of clinical thinking and professional expertise, representing the foundational tenets of medical education, while also accentuating the distinction between physicians and generative artificial intelligence.
Collapse
Affiliation(s)
- Kuan-Chen Lin
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
| | - Tsung-An Chen
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
| | - Ming-Hwai Lin
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
- School of Medicine, National Yang Ming Chiao Tung University, Taipei 30010, Taiwan
| | - Yu-Chun Chen
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
- School of Medicine, National Yang Ming Chiao Tung University, Taipei 30010, Taiwan
- Institute of Hospital and Health Care Administration, School of Medicine, National Yang Ming Chiao Tung University, Taipei 30010, Taiwan
- Big Data Center, Taipei Veterans General Hospital, Taipei 11217, Taiwan
| | - Tzeng-Ji Chen
- Department of Family Medicine, Taipei Veterans General Hospital Hsinchu Branch, No. 81, Sec. 1, Zhongfeng Road, Zhudong Township, Hsinchu 310403, Taiwan
- Department of Post-Baccalaureate Medicine, National Chung Hsing University, No. 145, Xingda Road, South District, Taichung 402202, Taiwan
| |
Collapse
|
34
|
Kim DW, Park JS, Sharma K, Velazquez A, Li L, Ostrominski JW, Tran T, Seitter Peréz RH, Shin JH. Qualitative evaluation of artificial intelligence-generated weight management diet plans. Front Nutr 2024; 11:1374834. [PMID: 38577160 PMCID: PMC10991711 DOI: 10.3389/fnut.2024.1374834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 03/06/2024] [Indexed: 04/06/2024] Open
Abstract
Importance The transformative potential of artificial intelligence (AI), particularly via large language models, is increasingly being manifested in healthcare. Dietary interventions are foundational to weight management efforts, but whether AI techniques are presently capable of generating clinically applicable diet plans has not been evaluated. Objective Our study sought to evaluate the potential of personalized AI-generated weight-loss diet plans for clinical applications by employing a survey-based assessment conducted by experts in the fields of obesity medicine and clinical nutrition. Design setting and participants We utilized ChatGPT (4.0) to create weight-loss diet plans and selected two control diet plans from tertiary medical centers for comparison. Dietitians, physicians, and nurse practitioners specializing in obesity medicine or nutrition were invited to provide feedback on the AI-generated plans. Each plan was assessed blindly based on its effectiveness, balanced-ness, comprehensiveness, flexibility, and applicability. Personalized plans for hypothetical patients with specific health conditions were also evaluated. Main outcomes and measures The primary outcomes measured included the indistinguishability of the AI diet plan from human-created plans, and the potential of personalized AI-generated diet plans for real-world clinical applications. Results Of 95 participants, 67 completed the survey and were included in the final analysis. No significant differences were found among the three weight-loss diet plans in any evaluation category. Among the 14 experts who believed that they could identify the AI plan, only five did so correctly. In an evaluation involving 57 experts, the AI-generated personalized weight-loss diet plan was assessed, with scores above neutral for all evaluation variables. Several limitations, of the AI-generated plans were highlighted, including conflicting dietary considerations, lack of affordability, and insufficient specificity in recommendations, such as exact portion sizes. These limitations suggest that refining inputs could enhance the quality and applicability of AI-generated diet plans. Conclusion Despite certain limitations, our study highlights the potential of AI-generated diet plans for clinical applications. AI-generated dietary plans were frequently indistinguishable from diet plans widely used at major tertiary medical centers. Although further refinement and prospective studies are needed, these findings illustrate the potential of AI in advancing personalized weight-centric care.
Collapse
Affiliation(s)
- Dong Wook Kim
- Division of Endocrinology, Diabetes and Hypertension, Center for Weight Management and Wellness, Brigham and Women's Hospital, Boston, MA, United States
- Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
| | - Ji Seok Park
- Department of Gastroenterology, Hepatology & Nutrition, Cleveland Clinic, Cleveland, OH, United States
| | - Kavita Sharma
- Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
| | - Amanda Velazquez
- Department of Medicine, Weight Management and Metabolic Health Center, Cedars Sinai Hospital, Los Angeles, CA, United States
| | - Lu Li
- Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
| | - John W. Ostrominski
- Division of Endocrinology, Diabetes and Hypertension, Center for Weight Management and Wellness, Brigham and Women's Hospital, Boston, MA, United States
- Division of Cardiovascular Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, United States
| | - Tram Tran
- Division of Endocrinology, Diabetes and Hypertension, Center for Weight Management and Wellness, Brigham and Women's Hospital, Boston, MA, United States
- Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
| | - Robert H. Seitter Peréz
- Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
| | - Jeong-Hun Shin
- Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
- Department of Internal Medicine, Hanyang University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
35
|
Rodriguez DV, Lawrence K, Gonzalez J, Brandfield-Harvey B, Xu L, Tasneem S, Levine DL, Mann D. Leveraging Generative AI Tools to Support the Development of Digital Solutions in Health Care Research: Case Study. JMIR Hum Factors 2024; 11:e52885. [PMID: 38446539 PMCID: PMC10955400 DOI: 10.2196/52885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 11/27/2023] [Accepted: 12/15/2023] [Indexed: 03/07/2024] Open
Abstract
BACKGROUND Generative artificial intelligence has the potential to revolutionize health technology product development by improving coding quality, efficiency, documentation, quality assessment and review, and troubleshooting. OBJECTIVE This paper explores the application of a commercially available generative artificial intelligence tool (ChatGPT) to the development of a digital health behavior change intervention designed to support patient engagement in a commercial digital diabetes prevention program. METHODS We examined the capacity, advantages, and limitations of ChatGPT to support digital product idea conceptualization, intervention content development, and the software engineering process, including software requirement generation, software design, and code production. In total, 11 evaluators, each with at least 10 years of experience in fields of study ranging from medicine and implementation science to computer science, participated in the output review process (ChatGPT vs human-generated output). All had familiarity or prior exposure to the original personalized automatic messaging system intervention. The evaluators rated the ChatGPT-produced outputs in terms of understandability, usability, novelty, relevance, completeness, and efficiency. RESULTS Most metrics received positive scores. We identified that ChatGPT can (1) support developers to achieve high-quality products faster and (2) facilitate nontechnical communication and system understanding between technical and nontechnical team members around the development goal of rapid and easy-to-build computational solutions for medical technologies. CONCLUSIONS ChatGPT can serve as a usable facilitator for researchers engaging in the software development life cycle, from product conceptualization to feature identification and user story development to code generation. TRIAL REGISTRATION ClinicalTrials.gov NCT04049500; https://clinicaltrials.gov/ct2/show/NCT04049500.
Collapse
Affiliation(s)
- Danissa V Rodriguez
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Katharine Lawrence
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
- Medical Center Information Technology, Department of Health Informatics, New York University Langone Health, New York, NY, United States
| | - Javier Gonzalez
- Medical Center Information Technology, Department of Health Informatics, New York University Langone Health, New York, NY, United States
| | - Beatrix Brandfield-Harvey
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Lynn Xu
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Sumaiya Tasneem
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Defne L Levine
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Devin Mann
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
- Medical Center Information Technology, Department of Health Informatics, New York University Langone Health, New York, NY, United States
| |
Collapse
|
36
|
Mu Y, He D. The Potential Applications and Challenges of ChatGPT in the Medical Field. Int J Gen Med 2024; 17:817-826. [PMID: 38476626 PMCID: PMC10929156 DOI: 10.2147/ijgm.s456659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 02/26/2024] [Indexed: 03/14/2024] Open
Abstract
ChatGPT, an AI-driven conversational large language model (LLM), has garnered significant scholarly attention since its inception, owing to its manifold applications in the realm of medical science. This study primarily examines the merits, limitations, anticipated developments, and practical applications of ChatGPT in clinical practice, healthcare, medical education, and medical research. It underscores the necessity for further research and development to enhance its performance and deployment. Moreover, future research avenues encompass ongoing enhancements and standardization of ChatGPT, mitigating its limitations, and exploring its integration and applicability in translational and personalized medicine. Reflecting the narrative nature of this review, a focused literature search was performed to identify relevant publications on ChatGPT's use in medicine. This process was aimed at gathering a broad spectrum of insights to provide a comprehensive overview of the current state and future prospects of ChatGPT in the medical domain. The objective is to aid healthcare professionals in understanding the groundbreaking advancements associated with the latest artificial intelligence tools, while also acknowledging the opportunities and challenges presented by ChatGPT.
Collapse
Affiliation(s)
- Yonglin Mu
- Department of Urology, Children’s Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Dawei He
- Department of Urology, Children’s Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| |
Collapse
|
37
|
Spotnitz M, Idnay B, Gordon ER, Shyu R, Zhang G, Liu C, Cimino JJ, Weng C. A Survey of Clinicians' Views of the Utility of Large Language Models. Appl Clin Inform 2024; 15:306-312. [PMID: 38442909 PMCID: PMC11023712 DOI: 10.1055/a-2281-7092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 02/15/2024] [Indexed: 03/07/2024] Open
Abstract
OBJECTIVES Large language models (LLMs) like Generative pre-trained transformer (ChatGPT) are powerful algorithms that have been shown to produce human-like text from input data. Several potential clinical applications of this technology have been proposed and evaluated by biomedical informatics experts. However, few have surveyed health care providers for their opinions about whether the technology is fit for use. METHODS We distributed a validated mixed-methods survey to gauge practicing clinicians' comfort with LLMs for a breadth of tasks in clinical practice, research, and education, which were selected from the literature. RESULTS A total of 30 clinicians fully completed the survey. Of the 23 tasks, 16 were rated positively by more than 50% of the respondents. Based on our qualitative analysis, health care providers considered LLMs to have excellent synthesis skills and efficiency. However, our respondents had concerns that LLMs could generate false information and propagate training data bias.Our survey respondents were most comfortable with scenarios that allow LLMs to function in an assistive role, like a physician extender or trainee. CONCLUSION In a mixed-methods survey of clinicians about LLM use, health care providers were encouraging of having LLMs in health care for many tasks, and especially in assistive roles. There is a need for continued human-centered development of both LLMs and artificial intelligence in general.
Collapse
Affiliation(s)
- Matthew Spotnitz
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States
| | - Betina Idnay
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States
| | - Emily R. Gordon
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States
- Department of Dermatology, Vagelos College of Physicians and Surgeons, Columbia University Irving Medical Center, New York, New York, United States
| | - Rebecca Shyu
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States
| | - Gongbo Zhang
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States
| | - James J. Cimino
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States
- Department of Biomedical Informatics and Data Science, Informatics Institute, Heersink School of Medicine, University of Alabama at Birmingham, Birmingham, Alabama, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States
| |
Collapse
|
38
|
Wei Q, Yao Z, Cui Y, Wei B, Jin Z, Xu X. Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis. J Biomed Inform 2024; 151:104620. [PMID: 38462064 DOI: 10.1016/j.jbi.2024.104620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 02/27/2024] [Accepted: 02/29/2024] [Indexed: 03/12/2024]
Abstract
OBJECTIVE Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in answering medical questions and provide direction for future research. METHODS An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was "ChatGPT," without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327. RESULTS A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %-60 %, I2 = 87 %) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency. CONCLUSION This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results' reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.
Collapse
Affiliation(s)
- Qiuhong Wei
- Big Data Center for Children's Medical Care, Children's Hospital of Chongqing Medical University, Chongqing, China; Children Nutrition Research Center, Children's Hospital of Chongqing Medical University, Chongqing, China; National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, China International Science and Technology Cooperation Base of Child Development and Critical Disorders, Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
| | - Zhengxiong Yao
- Department of Neurology, Children's Hospital of Chongqing Medical University, Chongqing, China
| | - Ying Cui
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Bo Wei
- Department of Global Statistics and Data Science, BeiGene USA Inc., San Mateo, CA, USA
| | - Zhezhen Jin
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Ximing Xu
- Big Data Center for Children's Medical Care, Children's Hospital of Chongqing Medical University, Chongqing, China
| |
Collapse
|
39
|
Bužančić I, Belec D, Držaić M, Kummer I, Brkić J, Fialová D, Ortner Hadžiabdić M. Clinical decision-making in benzodiazepine deprescribing by healthcare providers vs. AI-assisted approach. Br J Clin Pharmacol 2024; 90:662-674. [PMID: 37949663 DOI: 10.1111/bcp.15963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2023] [Revised: 10/26/2023] [Accepted: 10/29/2023] [Indexed: 11/12/2023] Open
Abstract
AIMS The aim of this study was to compare the clinical decision-making for benzodiazepine deprescribing between a healthcare provider (HCP) and an artificial intelligence (AI) chatbot GPT4 (ChatGPT-4). METHODS We analysed real-world data from a Croatian cohort of community-dwelling benzodiazepine patients (n = 154) within the EuroAgeism H2020 ESR 7 project. HCPs evaluated the data using pre-established deprescribing criteria to assess benzodiazepine discontinuation potential. The research team devised and tested AI prompts to ensure consistency with HCP judgements. An independent researcher employed ChatGPT-4 with predetermined prompts to simulate clinical decisions for each patient case. Data derived from human-HCP and ChatGPT-4 decisions were compared for agreement rates and Cohen's kappa. RESULTS Both HPC and ChatGPT identified patients for benzodiazepine deprescribing (96.1% and 89.6%, respectively), showing an agreement rate of 95% (κ = .200, P = .012). Agreement on four deprescribing criteria ranged from 74.7% to 91.3% (lack of indication κ = .352, P < .001; prolonged use κ = .088, P = .280; safety concerns κ = .123, P = .006; incorrect dosage κ = .264, P = .001). Important limitations of GPT-4 responses were identified, including 22.1% ambiguous outputs, generic answers and inaccuracies, posing inappropriate decision-making risks. CONCLUSIONS While AI-HCP agreement is substantial, sole AI reliance poses a risk for unsuitable clinical decision-making. This study's findings reveal both strengths and areas for enhancement of ChatGPT-4 in the deprescribing recommendations within a real-world sample. Our study underscores the need for additional research on chatbot functionality in patient therapy decision-making, further fostering the advancement of AI for optimal performance.
Collapse
Affiliation(s)
- Iva Bužančić
- Center for Applied Pharmacy, Faculty of Pharmacy and Biochemistry, University of Zagreb, Zagreb, Croatia
- City Pharmacy Zagreb, Zagreb, Croatia
| | - Dora Belec
- Center for Applied Pharmacy, Faculty of Pharmacy and Biochemistry, University of Zagreb, Zagreb, Croatia
| | - Margita Držaić
- Center for Applied Pharmacy, Faculty of Pharmacy and Biochemistry, University of Zagreb, Zagreb, Croatia
- City Pharmacy Zagreb, Zagreb, Croatia
| | - Ingrid Kummer
- Department of Social and Clinical Pharmacy, Faculty of Pharmacy in Hradec Králové, Charles University, Hradec Králové, Czech Republic
| | - Jovana Brkić
- Department of Social and Clinical Pharmacy, Faculty of Pharmacy in Hradec Králové, Charles University, Hradec Králové, Czech Republic
- Department of Social Pharmacy and Pharmaceutical Legislation, Faculty of Pharmacy, University of Belgrade, Belgrade, Serbia
| | - Daniela Fialová
- Department of Social and Clinical Pharmacy, Faculty of Pharmacy in Hradec Králové, Charles University, Hradec Králové, Czech Republic
- Department of Geriatrics and Gerontology, 1st Faculty of Medicine in Prague, Charles University, Prague, Czech Republic
| | - Maja Ortner Hadžiabdić
- Center for Applied Pharmacy, Faculty of Pharmacy and Biochemistry, University of Zagreb, Zagreb, Croatia
| |
Collapse
|
40
|
Ge J, Buenaventura A, Berrean B, Purvis J, Fontil V, Lai JC, Pletcher MJ. Applying human-centered design to the construction of a cirrhosis management clinical decision support system. Hepatol Commun 2024; 8:e0394. [PMID: 38407255 PMCID: PMC10898661 DOI: 10.1097/hc9.0000000000000394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 12/13/2023] [Indexed: 02/27/2024] Open
Abstract
BACKGROUND Electronic health record (EHR)-based clinical decision support is a scalable way to help standardize clinical care. Clinical decision support systems have not been extensively investigated in cirrhosis management. Human-centered design (HCD) is an approach that engages with potential users in intervention development. In this study, we applied HCD to design the features and interface for a clinical decision support system for cirrhosis management, called CirrhosisRx. METHODS We conducted technical feasibility assessments to construct a visual blueprint that outlines the basic features of the interface. We then convened collaborative-design workshops with generalist and specialist clinicians. We elicited current workflows for cirrhosis management, assessed gaps in existing EHR systems, evaluated potential features, and refined the design prototype for CirrhosisRx. At the conclusion of each workshop, we analyzed recordings and transcripts. RESULTS Workshop feedback showed that the aggregation of relevant clinical data into 6 cirrhosis decompensation domains (defined as common inpatient clinical scenarios) was the most important feature. Automatic inference of clinical events from EHR data, such as gastrointestinal bleeding from hemoglobin changes, was not accepted due to accuracy concerns. Visualizations for risk stratification scores were deemed not necessary. Lastly, the HCD co-design workshops allowed us to identify the target user population (generalists). CONCLUSIONS This is one of the first applications of HCD to design the features and interface for an electronic intervention for cirrhosis management. The HCD process altered features, modified the design interface, and likely improved CirrhosisRx's overall usability. The finalized design for CirrhosisRx proceeded to development and production and will be tested for effectiveness in a pragmatic randomized controlled trial. This work provides a model for the creation of other EHR-based interventions in hepatology care.
Collapse
Affiliation(s)
- Jin Ge
- Department of Medicine, Division of Gastroenterology and Hepatology, University of California—San Francisco, San Francisco, California, USA
| | - Ana Buenaventura
- School of Medicine Technology Services, University of California—San Francisco, San Francisco, California, USA
| | - Beth Berrean
- School of Medicine Technology Services, University of California—San Francisco, San Francisco, California, USA
| | - Jory Purvis
- School of Medicine Technology Services, University of California—San Francisco, San Francisco, California, USA
| | - Valy Fontil
- Family Health Centers, NYU-Langone Medical Center, Brooklyn, New York, USA
| | - Jennifer C. Lai
- Department of Medicine, Division of Gastroenterology and Hepatology, University of California—San Francisco, San Francisco, California, USA
| | - Mark J. Pletcher
- Department of Epidemiology and Biostatistics, University of California—San Francisco, San Francisco, California, USA
| |
Collapse
|
41
|
Chandra A, Chakraborty A. Exploring the role of large language models in radiation emergency response. JOURNAL OF RADIOLOGICAL PROTECTION : OFFICIAL JOURNAL OF THE SOCIETY FOR RADIOLOGICAL PROTECTION 2024; 44:011510. [PMID: 38324900 DOI: 10.1088/1361-6498/ad270c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 02/07/2024] [Indexed: 02/09/2024]
Abstract
In recent times, the field of artificial intelligence (AI) has been transformed by the introduction of large language models (LLMs). These models, popularized by OpenAI's GPT-3, have demonstrated the emergent capabilities of AI in comprehending and producing text resembling human language, which has helped them transform several industries. But its role has yet to be explored in the nuclear industry, specifically in managing radiation emergencies. The present work explores LLMs' contextual awareness, natural language interaction, and their capacity to comprehend diverse queries in a radiation emergency response setting. In this study we identify different user types and their specific LLM use-cases in radiation emergencies. Their possible interactions with ChatGPT, a popular LLM, has also been simulated and preliminary results are presented. Drawing on the insights gained from this exercise and to address concerns of reliability and misinformation, this study advocates for expert guided and domain-specific LLMs trained on radiation safety protocols and historical data. This study aims to guide radiation emergency management practitioners and decision-makers in effectively incorporating LLMs into their decision support framework.
Collapse
Affiliation(s)
- Anirudh Chandra
- Radiation Safety Systems Division, Bhabha Atomic Research Centre, Mumbai 400085, India
| | - Abinash Chakraborty
- Health Physics Division, Bhabha Atomic Research Centre, Mumbai 400085, India
| |
Collapse
|
42
|
Abi-Rafeh J, Xu HH, Kazan R, Tevlin R, Furnas H. Large Language Models and Artificial Intelligence: A Primer for Plastic Surgeons on the Demonstrated and Potential Applications, Promises, and Limitations of ChatGPT. Aesthet Surg J 2024; 44:329-343. [PMID: 37562022 DOI: 10.1093/asj/sjad260] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 08/02/2023] [Accepted: 08/04/2023] [Indexed: 08/12/2023] Open
Abstract
BACKGROUND The rapidly evolving field of artificial intelligence (AI) holds great potential for plastic surgeons. ChatGPT, a recently released AI large language model (LLM), promises applications across many disciplines, including healthcare. OBJECTIVES The aim of this article was to provide a primer for plastic surgeons on AI, LLM, and ChatGPT, including an analysis of current demonstrated and proposed clinical applications. METHODS A systematic review was performed identifying medical and surgical literature on ChatGPT's proposed clinical applications. Variables assessed included applications investigated, command tasks provided, user input information, AI-emulated human skills, output validation, and reported limitations. RESULTS The analysis included 175 articles reporting on 13 plastic surgery applications and 116 additional clinical applications, categorized by field and purpose. Thirty-four applications within plastic surgery are thus proposed, with relevance to different target audiences, including attending plastic surgeons (n = 17, 50%), trainees/educators (n = 8, 24.0%), researchers/scholars (n = 7, 21%), and patients (n = 2, 6%). The 15 identified limitations of ChatGPT were categorized by training data, algorithm, and ethical considerations. CONCLUSIONS Widespread use of ChatGPT in plastic surgery will depend on rigorous research of proposed applications to validate performance and address limitations. This systemic review aims to guide research, development, and regulation to safely adopt AI in plastic surgery.
Collapse
|
43
|
Stengel FC, Stienen MN, Ivanov M, Gandía-González ML, Raffa G, Ganau M, Whitfield P, Motov S. Can AI pass the written European Board Examination in Neurological Surgery? - Ethical and practical issues. BRAIN & SPINE 2024; 4:102765. [PMID: 38510593 PMCID: PMC10951784 DOI: 10.1016/j.bas.2024.102765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 01/28/2024] [Accepted: 02/12/2024] [Indexed: 03/22/2024]
Abstract
Introduction Artificial intelligence (AI) based large language models (LLM) contain enormous potential in education and training. Recent publications demonstrated that they are able to outperform participants in written medical exams. Research question We aimed to explore the accuracy of AI in the written part of the EANS board exam. Material and methods Eighty-six representative single best answer (SBA) questions, included at least ten times in prior EANS board exams, were selected by the current EANS board exam committee. The questions' content was classified as 75 text-based (TB) and 11 image-based (IB) and their structure as 50 interpretation-weighted, 30 theory-based and 6 true-or-false. Questions were tested with Chat GPT 3.5, Bing and Bard. The AI and participant results were statistically analyzed through ANOVA tests with Stata SE 15 (StataCorp, College Station, TX). P-values of <0.05 were considered as statistically significant. Results The Bard LLM achieved the highest accuracy with 62% correct questions overall and 69% excluding IB, outperforming human exam participants 59% (p = 0.67) and 59% (p = 0.42), respectively. All LLMs scored highest in theory-based questions, excluding IB questions (Chat-GPT: 79%; Bing: 83%; Bard: 86%) and significantly better than the human exam participants (60%; p = 0.03). AI could not answer any IB question correctly. Discussion and conclusion AI passed the written EANS board exam based on representative SBA questions and achieved results close to or even better than the human exam participants. Our results raise several ethical and practical implications, which may impact the current concept for the written EANS board exam.
Collapse
Affiliation(s)
- Felix C. Stengel
- Department of Neurosurgery & Spine Center of Eastern Switzerland, Kantonsspital St. Gallen & Medical School of St.Gallen, St. Gallen, Switzerland
| | - Martin N. Stienen
- Department of Neurosurgery & Spine Center of Eastern Switzerland, Kantonsspital St. Gallen & Medical School of St.Gallen, St. Gallen, Switzerland
| | - Marcel Ivanov
- Royal Hallamshire Hospital, Sheffield, United Kingdom
| | | | - Giovanni Raffa
- Division of Neurosurgery, BIOMORF Department, University of Messina, Messina, Italy
| | - Mario Ganau
- Oxford University Hospitals NHS Foundation Trust, Oxford, United Kingdom
| | | | - Stefan Motov
- Department of Neurosurgery & Spine Center of Eastern Switzerland, Kantonsspital St. Gallen & Medical School of St.Gallen, St. Gallen, Switzerland
| |
Collapse
|
44
|
Abdullahi T, Singh R, Eickhoff C. Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models. JMIR MEDICAL EDUCATION 2024; 10:e51391. [PMID: 38349725 PMCID: PMC10900078 DOI: 10.2196/51391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Revised: 11/07/2023] [Accepted: 12/11/2023] [Indexed: 02/15/2024]
Abstract
BACKGROUND Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains. OBJECTIVE This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance. METHODS We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks. RESULTS Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs. CONCLUSIONS Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model's characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.
Collapse
Affiliation(s)
- Tassallah Abdullahi
- Department of Computer Science, Brown University, Providence, RI, United States
| | - Ritambhara Singh
- Department of Computer Science, Brown University, Providence, RI, United States
- Center for Computational Molecular Biology, Brown University, Providence, RI, United States
| | | |
Collapse
|
45
|
Wang Z, Zhang Z, Traverso A, Dekker A, Qian L, Sun P. Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: enhancing interpretability with a chain of thought approach. Quant Imaging Med Surg 2024; 14:1602-1615. [PMID: 38415150 PMCID: PMC10895085 DOI: 10.21037/qims-23-1180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 11/30/2023] [Indexed: 02/29/2024]
Abstract
Background As artificial intelligence (AI) becomes increasingly prevalent in the medical field, the effectiveness of AI-generated medical reports in disease diagnosis remains to be evaluated. ChatGPT is a large language model developed by open AI with a notable capacity for text abstraction and comprehension. This study aimed to explore the capabilities, limitations, and potential of Generative Pre-trained Transformer (GPT)-4 in analyzing thyroid cancer ultrasound reports, providing diagnoses, and recommending treatment plans. Methods Using 109 diverse thyroid cancer cases, we evaluated GPT-4's performance by comparing its generated reports to those from doctors with various levels of experience. We also conducted a Turing Test and a consistency analysis. To enhance the interpretability of the model, we applied the Chain of Thought (CoT) method to deconstruct the decision-making chain of the GPT model. Results GPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression, but showed limitations in diagnostic accuracy. In addition, our consistency analysis highlighted certain discrepancies in the AI's performance. The CoT method effectively enhanced the interpretability of the AI's decision-making process. Conclusions GPT-4 exhibits potential as a supplementary tool in healthcare, especially for generating thyroid gland diagnostic reports. Our proposed online platform, "ThyroAIGuide", alongside the CoT method, underscores the potential of AI to augment diagnostic processes, elevate healthcare accessibility, and advance patient education. However, the journey towards fully integrating AI into healthcare is ongoing, requiring continuous research, development, and careful monitoring by medical professionals to ensure patient safety and quality of care.
Collapse
Affiliation(s)
- Zhixiang Wang
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Zhen Zhang
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Alberto Traverso
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Andre Dekker
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Linxue Qian
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| | - Pengfei Sun
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| |
Collapse
|
46
|
Segal S, Saha AK, Khanna AK. Appropriateness of Answers to Common Preanesthesia Patient Questions Composed by the Large Language Model GPT-4 Compared to Human Authors. Anesthesiology 2024; 140:333-335. [PMID: 38193737 DOI: 10.1097/aln.0000000000004824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2024]
Affiliation(s)
- Scott Segal
- Wake Forest University School of Medicine, Atrium Health Wake Forest Baptist Medical Center, Winston-Salem, North Carolina (S.S.).
| | | | | |
Collapse
|
47
|
Liao Z, Wang J, Shi Z, Lu L, Tabata H. Revolutionary Potential of ChatGPT in Constructing Intelligent Clinical Decision Support Systems. Ann Biomed Eng 2024; 52:125-129. [PMID: 37332008 DOI: 10.1007/s10439-023-03288-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 06/13/2023] [Indexed: 06/20/2023]
Abstract
Recently, Chatbot Generative Pre-trained Transformer (ChatGPT) is recognized as a promising clinical decision support system (CDSS) in the medical field owing to its advanced text analysis capabilities and interactive design. However, ChatGPT primarily focuses on learning text semantics rather than learning complex data structures and conducting real-time data analysis, which typically necessitate the development of intelligent CDSS employing specialized machine learning algorithms. Although ChatGPT cannot directly execute specific algorithms, it aids in algorithm design for intelligent CDSS at the textual level. In this study, besides discussing the types of CDSS and their relationship with ChatGPT, we mainly investigate the benefits and drawbacks of employing ChatGPT as an auxiliary design tool for intelligent CDSS. Our findings indicate that by collaborating with human expertise, ChatGPT has the potential to revolutionize the development of robust and effective intelligent CDSS.
Collapse
Affiliation(s)
- Zhiqiang Liao
- Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, 113-8656, Japan.
| | - Jian Wang
- Department of Orthopaedics, Qilu Hospital of Shandong University, Jinan, 250012, People's Republic of China
| | - Zhuozheng Shi
- Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, Tokyo, 113-8656, Japan
| | - Lintao Lu
- Department of Orthopaedics, Qilu Hospital of Shandong University, Jinan, 250012, People's Republic of China.
- Department of Orthopaedics, Qilu Hospital of Shandong University Dezhou Hospital, Dezhou, 253000, People's Republic of China.
| | - Hitoshi Tabata
- Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, 113-8656, Japan
- Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, Tokyo, 113-8656, Japan
| |
Collapse
|
48
|
Jin X, Frock A, Nagaraja S, Wallqvist A, Reifman J. AI algorithm for personalized resource allocation and treatment of hemorrhage casualties. Front Physiol 2024; 15:1327948. [PMID: 38332989 PMCID: PMC10851938 DOI: 10.3389/fphys.2024.1327948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Accepted: 01/09/2024] [Indexed: 02/10/2024] Open
Abstract
A deep neural network-based artificial intelligence (AI) model was assessed for its utility in predicting vital signs of hemorrhage patients and optimizing the management of fluid resuscitation in mass casualties. With the use of a cardio-respiratory computational model to generate synthetic data of hemorrhage casualties, an application was created where a limited data stream (the initial 10 min of vital-sign monitoring) could be used to predict the outcomes of different fluid resuscitation allocations 60 min into the future. The predicted outcomes were then used to select the optimal resuscitation allocation for various simulated mass-casualty scenarios. This allowed the assessment of the potential benefits of using an allocation method based on personalized predictions of future vital signs versus a static population-based method that only uses currently available vital-sign information. The theoretical benefits of this approach included up to 46% additional casualties restored to healthy vital signs and a 119% increase in fluid-utilization efficiency. Although the study is not immune from limitations associated with synthetic data under specific assumptions, the work demonstrated the potential for incorporating neural network-based AI technologies in hemorrhage detection and treatment. The simulated injury and treatment scenarios used delineated possible benefits and opportunities available for using AI in pre-hospital trauma care. The greatest benefit of this technology lies in its ability to provide personalized interventions that optimize clinical outcomes under resource-limited conditions, such as in civilian or military mass-casualty events, involving moderate and severe hemorrhage.
Collapse
Affiliation(s)
- Xin Jin
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, United States Army Medical Research and Development Command, Fort Detrick, MD, United States
- The Henry M. Jackson Foundation for the Advancement of Military Medicine Inc., Bethesda, MD, United States
| | - Andrew Frock
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, United States Army Medical Research and Development Command, Fort Detrick, MD, United States
- The Henry M. Jackson Foundation for the Advancement of Military Medicine Inc., Bethesda, MD, United States
| | - Sridevi Nagaraja
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, United States Army Medical Research and Development Command, Fort Detrick, MD, United States
- The Henry M. Jackson Foundation for the Advancement of Military Medicine Inc., Bethesda, MD, United States
| | - Anders Wallqvist
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, United States Army Medical Research and Development Command, Fort Detrick, MD, United States
| | - Jaques Reifman
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, United States Army Medical Research and Development Command, Fort Detrick, MD, United States
| |
Collapse
|
49
|
Liu X, Wu J, Shao A, Shen W, Ye P, Wang Y, Ye J, Jin K, Yang J. Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study. J Med Internet Res 2024; 26:e51926. [PMID: 38252483 PMCID: PMC10845019 DOI: 10.2196/51926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 10/07/2023] [Accepted: 11/30/2023] [Indexed: 01/23/2024] Open
Abstract
BACKGROUND Benefiting from rich knowledge and the exceptional ability to understand text, large language models like ChatGPT have shown great potential in English clinical environments. However, the performance of ChatGPT in non-English clinical settings, as well as its reasoning, have not been explored in depth. OBJECTIVE This study aimed to evaluate ChatGPT's diagnostic performance and inference abilities for retinal vascular diseases in a non-English clinical environment. METHODS In this cross-sectional study, we collected 1226 fundus fluorescein angiography reports and corresponding diagnoses written in Chinese and tested ChatGPT with 4 prompting strategies (direct diagnosis or diagnosis with a step-by-step reasoning process and in Chinese or English). RESULTS Compared with ChatGPT using Chinese prompts for direct diagnosis that achieved an F1-score of 70.47%, ChatGPT using English prompts for direct diagnosis achieved the best diagnostic performance (80.05%), which was inferior to ophthalmologists (89.35%) but close to ophthalmologist interns (82.69%). As for its inference abilities, although ChatGPT can derive a reasoning process with a low error rate (0.4 per report) for both Chinese and English prompts, ophthalmologists identified that the latter brought more reasoning steps with less incompleteness (44.31%), misinformation (1.96%), and hallucinations (0.59%) (all P<.001). Also, analysis of the robustness of ChatGPT with different language prompts indicated significant differences in the recall (P=.03) and F1-score (P=.04) between Chinese and English prompts. In short, when prompted in English, ChatGPT exhibited enhanced diagnostic and inference capabilities for retinal vascular disease classification based on Chinese fundus fluorescein angiography reports. CONCLUSIONS ChatGPT can serve as a helpful medical assistant to provide diagnosis in non-English clinical environments, but there are still performance gaps, language disparities, and errors compared to professionals, which demonstrate the potential limitations and the need to continually explore more robust large language models in ophthalmology practice.
Collapse
Affiliation(s)
- Xiaocong Liu
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
- School of Public Health, Zhejiang University School of Medicine, Zhejiang, China
| | - Jiageng Wu
- School of Public Health, Zhejiang University School of Medicine, Zhejiang, China
| | - An Shao
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Wenyue Shen
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Panpan Ye
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Yao Wang
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Juan Ye
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Kai Jin
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Jie Yang
- School of Public Health, Zhejiang University School of Medicine, Zhejiang, China
| |
Collapse
|
50
|
Harrington L. ChatGPT Is Trending: Trust but Verify. AACN Adv Crit Care 2023; 34:280-286. [PMID: 37619604 DOI: 10.4037/aacnacc2023129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/26/2023]
Affiliation(s)
- Linda Harrington
- Linda Harrington is an Independent Consultant, Health Informatics and Digital Strategy, and Adjunct Faculty at Texas Christian University, 2800 South University Drive, Fort Worth, TX 76109
| |
Collapse
|