1
|
Unadkat KD, Abdulwadood I, Hiredesai AN, Howlett CP, Geldmaker LE, Noland SS. ChatGPT 4.0's efficacy in the self-diagnosis of non-traumatic hand conditions. J Hand Microsurg 2025; 17:100217. [PMID: 40007763 PMCID: PMC11849648 DOI: 10.1016/j.jham.2025.100217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 12/20/2024] [Accepted: 01/22/2025] [Indexed: 02/27/2025] Open
Abstract
Background With advancements in artificial intelligence, patients increasingly turn to generative AI models like ChatGPT for medical advice. This study explores the utility of ChatGPT 4.0 (GPT-4.0), the most recent version of ChatGPT, as an interim diagnostician for common hand conditions. Secondarily, the study evaluates the terminology GPT-4.0 associates with each condition by assessing its ability to generate condition-specific questions from a patient's perspective. Methods Five common hand conditions were identified: trigger finger (TF), Dupuytren's Contracture (DC), carpal tunnel syndrome (CTS), de Quervain's tenosynovitis (DQT), and thumb carpometacarpal osteoarthritis (CMC). GPT-4.0 was queried with author-generated questions. The frequency of correct diagnoses, differential diagnoses, and recommendations were recorded. Chi-squared and pairwise Fisher's exact tests were used to compare response accuracy between conditions. GPT-4.0 was prompted to produce its own questions. Common terms in responses were recorded. Results GPT-4.0's diagnostic accuracy significantly differed between conditions (p < 0.005). While GPT-4.0 diagnosed CTS, TF, DQT, and DC with >95 % accuracy, 60 % (n = 15) of CMC queries were correctly diagnosed. Additionally, there were significant differences in providing of differential diagnoses (p < 0.005), diagnostic tests (p < 0.005), and risk factors (p < 0.05). GPT-4.0 recommended visiting a healthcare provider for 97 % (n = 121) of the questions. Analysis of ChatGPT-generated questions showed four of the ten most used terms were shared between DQT and CMC. Conclusions The results suggest that GPT-4.0 has potential preliminary diagnostic utility. Future studies should further investigate factors that improve or worsen AI's diagnostic power and consider the implications of patient utilization.
Collapse
Affiliation(s)
- Krishna D. Unadkat
- Mayo Clinic Alix School of Medicine - 13400 E. Shea Blvd., Scottsdale, AZ, 85259, USA
| | - Isra Abdulwadood
- Mayo Clinic Alix School of Medicine - 13400 E. Shea Blvd., Scottsdale, AZ, 85259, USA
| | - Annika N. Hiredesai
- Mayo Clinic Alix School of Medicine - 13400 E. Shea Blvd., Scottsdale, AZ, 85259, USA
| | - Carina P. Howlett
- Mayo Clinic Alix School of Medicine - 13400 E. Shea Blvd., Scottsdale, AZ, 85259, USA
| | - Laura E. Geldmaker
- Mayo Clinic Alix School of Medicine - 13400 E. Shea Blvd., Scottsdale, AZ, 85259, USA
| | - Shelley S. Noland
- Division of Hand Surgery, Department of Orthopedic Surgery, Mayo Clinic - 5777 E. Mayo Blvd, Phoenix, AZ, 85054, USA
| |
Collapse
|
2
|
Keyßer G, Pfeil A, Reuß-Borst M, Frohne I, Schultz O, Sander O. [What is the potential of ChatGPT for qualified patient information? : Attempt of a structured analysis on the basis of a survey regarding complementary and alternative medicine (CAM) in rheumatology]. Z Rheumatol 2025; 84:179-187. [PMID: 38985176 DOI: 10.1007/s00393-024-01535-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/18/2024] [Indexed: 07/11/2024]
Abstract
INTRODUCTION The chatbot ChatGPT represents a milestone in the interaction between humans and large databases that are accessible via the internet. It facilitates the answering of complex questions by enabling a communication in everyday language. Therefore, it is a potential source of information for those who are affected by rheumatic diseases. The aim of our investigation was to find out whether ChatGPT (version 3.5) is capable of giving qualified answers regarding the application of specific methods of complementary and alternative medicine (CAM) in three rheumatic diseases: rheumatoid arthritis (RA), systemic lupus erythematosus (SLE) and granulomatosis with polyangiitis (GPA). In addition, it was investigated how the answers of the chatbot were influenced by the wording of the question. METHODS The questioning of ChatGPT was performed in three parts. Part A consisted of an open question regarding the best way of treatment of the respective disease. In part B, the questions were directed towards possible indications for the application of CAM in general in one of the three disorders. In part C, the chatbot was asked for specific recommendations regarding one of three CAM methods: homeopathy, ayurvedic medicine and herbal medicine. Questions in parts B and C were expressed in two modifications: firstly, it was asked whether the specific CAM was applicable at all in certain rheumatic diseases. The second question asked which procedure of the respective CAM method worked best in the specific disease. The validity of the answers was checked by using the ChatGPT reliability score, a Likert scale ranging from 1 (lowest validity) to 7 (highest validity). RESULTS The answers to the open questions of part A had the highest validity. In parts B and C, ChatGPT suggested a variety of CAM applications that lacked scientific evidence. The validity of the answers depended on the wording of the questions. If the question suggested the inclination to apply a certain CAM, the answers often lacked the information of missing evidence and were graded with lower score values. CONCLUSION The answers of ChatGPT (version 3.5) regarding the applicability of CAM in selected rheumatic diseases are not convincingly based on scientific evidence. In addition, the wording of the questions affects the validity of the information. Currently, an uncritical application of ChatGPT as an instrument for patient information cannot be recommended.
Collapse
Affiliation(s)
- Gernot Keyßer
- Klinik und Poliklinik für Innere Medizin II, Universitätsklinikum Halle, Ernst-Grube-Str. 40, 06120, Halle (Saale), Deutschland.
| | - Alexander Pfeil
- Klinik für Innere Medizin III, Universitätsklinikum Jena, Friedrich-Schiller-Universität Jena, Jena, Deutschland
| | | | - Inna Frohne
- Privatpraxis für Rheumatologie, Essen, Deutschland
| | - Olaf Schultz
- Abteilung Rheumatologie, ACURA Kliniken Baden-Baden, Baden-Baden, Deutschland
| | - Oliver Sander
- Klinik für Rheumatologie, Universitätsklinikum Düsseldorf, Düsseldorf, Deutschland
| |
Collapse
|
3
|
Yun HS, Bickmore T. Online Health Information-Seeking in the Era of Large Language Models: Cross-Sectional Web-Based Survey Study. J Med Internet Res 2025; 27:e68560. [PMID: 40163112 DOI: 10.2196/68560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Revised: 02/20/2025] [Accepted: 03/01/2025] [Indexed: 04/02/2025] Open
Abstract
BACKGROUND As large language model (LLM)-based chatbots such as ChatGPT (OpenAI) grow in popularity, it is essential to understand their role in delivering online health information compared to other resources. These chatbots often generate inaccurate content, posing potential safety risks. This motivates the need to examine how users perceive and act on health information provided by LLM-based chatbots. OBJECTIVE This study investigates the patterns, perceptions, and actions of users seeking health information online, including LLM-based chatbots. The relationships between online health information-seeking behaviors and important sociodemographic characteristics are examined as well. METHODS A web-based survey of crowd workers was conducted via Prolific. The questionnaire covered sociodemographic information, trust in health care providers, eHealth literacy, artificial intelligence (AI) attitudes, chronic health condition status, online health information source types, perceptions, and actions, such as cross-checking or adherence. Quantitative and qualitative analyses were applied. RESULTS Most participants consulted search engines (291/297, 98%) and health-related websites (203/297, 68.4%) for their health information, while 21.2% (63/297) used LLM-based chatbots, with ChatGPT and Microsoft Copilot being the most popular. Most participants (268/297, 90.2%) sought information on health conditions, with fewer seeking advice on medication (179/297, 60.3%), treatments (137/297, 46.1%), and self-diagnosis (62/297, 23.2%). Perceived information quality and trust varied little across source types. The preferred source for validating information from the internet was consulting health care professionals (40/132, 30.3%), while only a very small percentage of participants (5/214, 2.3%) consulted AI tools to cross-check information from search engines and health-related websites. For information obtained from LLM-based chatbots, 19.4% (12/63) of participants cross-checked the information, while 48.4% (30/63) of participants followed the advice. Both of these rates were lower than information from search engines, health-related websites, forums, or social media. Furthermore, use of LLM-based chatbots for health information was negatively correlated with age (ρ=-0.16, P=.006). In contrast, attitudes surrounding AI for medicine had significant positive correlations with the number of source types consulted for health advice (ρ=0.14, P=.01), use of LLM-based chatbots for health information (ρ=0.31, P<.001), and number of health topics searched (ρ=0.19, P<.001). CONCLUSIONS Although traditional online sources remain dominant, LLM-based chatbots are emerging as a resource for health information for some users, specifically those who are younger and have a higher trust in AI. The perceived quality and trustworthiness of health information varied little across source types. However, the adherence to health information from LLM-based chatbots seemed more cautious compared to search engines or health-related websites. As LLMs continue to evolve, enhancing their accuracy and transparency will be essential in mitigating any potential risks by supporting responsible information-seeking while maximizing the potential of AI in health contexts.
Collapse
Affiliation(s)
- Hye Sun Yun
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, United States
| | - Timothy Bickmore
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, United States
| |
Collapse
|
4
|
Keasler PM, Chan JCY, Sng BL. Effectiveness of artificial intelligence (AI) chatbots in providing labor epidural analgesia information: are we there yet? Int J Obstet Anesth 2025; 62:104353. [PMID: 40174425 DOI: 10.1016/j.ijoa.2025.104353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/18/2025] [Revised: 03/05/2025] [Accepted: 03/06/2025] [Indexed: 04/04/2025]
Abstract
Artificial intelligence (AI) chatbots have gained popularity in healthcare. Their ability to understand and respond to language queries make them suitable for many practical applications ranging from medical advice to counselling. However, AI chatbots' ability to provide personalized complex medical information about labor epidural analgesia may be limited. In this Editorial, we highlight findings from four recent publications in our Journal related to the use of AI chatbots and their effectiveness to provide or enhance patient education on labor epidural analgesia. Effectiveness can be measured by evaluating AI chatbots' accuracy, readability, completeness, sentiment, and overall quality. While AI chatbots are promising tools for patient education, studies show that they may provide incomplete or inaccurate responses. Based on existing anesthesia societies and associations' guidelines, developing standards to assess the medical rigor and users' comprehension of chatbot-generated responses is needed to ensure optimized patient education.
Collapse
Affiliation(s)
- Paige M Keasler
- Department of Anesthesia, University of Washington Medical Center Montlake, Seattle, United States
| | - Joel Chee Yee Chan
- Department of Women's Anesthesia, KK Women's and Children's Hospital, Singapore
| | - Ban Leong Sng
- Department of Women's Anesthesia, KK Women's and Children's Hospital, Singapore; Department of Women's Anesthesia, KK Women's and Children's Hospital, Anesthesiology and Perioperative Sciences Academic Clinical Program, Duke-NUS Medical School, Singapore.
| |
Collapse
|
5
|
Alessandro L, Bianciotti N, Salama L, Volmaro S, Navarrine V, Ameghino L, Arena J, Bestoso S, Bruno V, Castillo Torres S, Chamorro M, Couto B, De La Riestra T, Echeverria F, Genco J, Gonzalez Del Boca F, Guarnaschelli M, Giugni JC, Laffue A, Martinez Villota V, Medina Escobar A, Paez Maggio M, Rauek S, Rodriguez Quiroga S, Tela M, Villa C, Sanguinetti O, Kauffman M, Fernandez Slezak D, Farez MF, Rossi M. Artificial Intelligence-Based Virtual Assistant for the Diagnostic Approach of Chronic Ataxias. Mov Disord 2025. [PMID: 40119570 DOI: 10.1002/mds.30168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Revised: 02/09/2025] [Accepted: 02/19/2025] [Indexed: 03/24/2025] Open
Abstract
BACKGROUND Chronic ataxias, a complex group of over 300 diseases, pose significant diagnostic challenges because of their clinical and genetic heterogeneity. Here, we propose that artificial intelligence (AI) can aid in the identification and understanding of these disorders through the utilization of a smart virtual assistant. OBJECTIVES The aim is to develop and validate an AI-powered virtual assistant for diagnosing chronic ataxias. METHODS A non-commercial virtual assistant was developed using advanced algorithms, decision trees, and large language models. In the validation process, 453 clinical cases from the literature were selected from 151 causes of chronic ataxia. The diagnostic accuracy was compared with that of 21 neurologists specializing in movement disorders and GPT-4. Usability regarding time and number of questions needed were also evaluated. RESULTS The virtual assistant accuracy was 90.9%, higher than neurologists (18.3%), and GPT-4 (19.4%). It also significantly outperformed in causes of ataxia distributed by age, inheritance, frequency, associated clinical manifestations, and treatment availability. Neurologists and GPT-4 mentioned 110 incorrect diagnoses, 83.6% of which were made by GPT-4, which also generated seven data hallucinations. The virtual assistant required an average of 14 questions and 1.5 minutes to generate a list of differential diagnoses, significantly faster than the neurologists (mean, 19.4 minutes). CONCLUSIONS The virtual assistant proved to be accurate and easy fast-use for the diagnosis of chronic ataxias, potentially serving as a support tool in neurological consultation. This diagnostic approach could also be expanded to other neurological and non-neurological diseases. © 2025 International Parkinson and Movement Disorder Society.
Collapse
Affiliation(s)
- Lucas Alessandro
- Departmento de Neurologia, Fleni, Buenos Aires, Argentina
- Entelai, Buenos Aires, Argentina
| | - Nicolas Bianciotti
- Facultad de Medicina, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Luciana Salama
- Facultad de Medicina, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Santiago Volmaro
- Facultad de Medicina, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Veronica Navarrine
- Facultad de Medicina, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Lucia Ameghino
- Sección de Movimientos Anormales, Departamento de Neurología, Fleni, Buenos Aires, Argentina
| | - Julieta Arena
- Sección de Movimientos Anormales, Departamento de Neurología, Fleni, Buenos Aires, Argentina
| | - Santiago Bestoso
- Sección Parkinson y Trastornos del Movimiento del Hospital Italiano Buenos Aires (HIBA)
| | - Veronica Bruno
- Department of Clinical Neurosciences, University of Calgary Hotchkiss Brain Institute, Calgary, Alberta, Canada
| | - Sergio Castillo Torres
- Servicio de Neurología, Hospital Universitario Dr. Jose Eleuterio Gonzalez, Universidad Autónoma de Nuevo Leon, Monterrey, Mexico
| | - Mauricio Chamorro
- Sanatorio Parque, Servicio de Neurología. INECO Neurociencias Oroño. Clínica de Movimientos Anormales, Unidad de DBS, Rosario, Argentina
| | - Blas Couto
- Instituto de Neurociencia Cognitiva y Traslacional (INECO-CONICET-Favaloro), Ciudad de Buenos Aires, Argentina
| | | | | | - Juan Genco
- Consultorio de Trastornos del Movimiento, Servicio de Neurología y Neurocirugía, Hospital Luis Carlos Lagomaggiore, Mendoza, Argentina
| | | | - Marlene Guarnaschelli
- Facultad de Ciencias de la Salud, Universidad Adventista del Plata, Libertador de San Martin, Entre Rios, Argentina
| | | | - Alfredo Laffue
- Departmento de Neurologia, Fleni, Buenos Aires, Argentina
| | | | - Alex Medina Escobar
- Moncton Interdisciplinary Neurodegenerative Diseases Clinic, Horizon Health Network, Moncton, New Brunswick, Canada
| | - Mauricio Paez Maggio
- Seccion de Movimientos anormales. Departamento de Neurología, Hospital Britanico, Buenos Aires, Argentina
| | | | - Sergio Rodriguez Quiroga
- Departamento de Neurología Hospital J.M. Ramos Mejia, Unidad de Movimientos Anormales y Neurogenética, Buenos Aires, Argentina
| | - Marcela Tela
- Sección de Movimientos Anormales, Departamento de Neurología, Fleni, Buenos Aires, Argentina
| | | | | | - Marcelo Kauffman
- IIMT-FCB-Universidad Austral-CONICET, Buenos Aires, Argentina
- Consultorio y Laboratorio de Neurogenética Hospital J.M Ramos Mejia, Buenos Aires, Argentina
| | - Diego Fernandez Slezak
- Entelai, Buenos Aires, Argentina
- Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
- Instituto de Investigacion en Ciencias de la Computación, CONICET-UBA, Buenos Aires, Argentina
| | - Mauricio F Farez
- Entelai, Buenos Aires, Argentina
- Centro de Investigación en Enfermedades Neuroinmunológicas (CIEN), Buenos Aires, Argentina
| | - Malco Rossi
- Sección de Movimientos Anormales, Departamento de Neurología, Fleni, Buenos Aires, Argentina
- Instituto Fleni-CONICET (INEU) Buenos Aire, Buenos Aires, Argentina
| |
Collapse
|
6
|
Wang R, Situ X, Sun X, Zhan J, Liu X. Assessing AI in Various Elements of Enhanced Recovery After Surgery (ERAS)-Guided Ankle Fracture Treatment: A Comparative Analysis with Expert Agreement. J Multidiscip Healthc 2025; 18:1629-1638. [PMID: 40130076 PMCID: PMC11930842 DOI: 10.2147/jmdh.s508511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Accepted: 03/06/2025] [Indexed: 03/26/2025] Open
Abstract
Objective This study aimed to assess and compare the performance of ChatGPT and iFlytek Spark, two AI-powered large language models (LLMs), in generating clinical recommendations aligned with expert consensus on Enhanced Recovery After Surgery (ERAS)-guided ankle fracture treatment. This study aims to determine the applicability and reliability of AI in supporting ERAS protocols for optimized patient outcomes. Methods A qualitative comparative analysis was conducted using 35 structured clinical questions derived from the Expert Consensus on Optimizing Ankle Fracture Treatment Protocols under ERAS Principles. Questions covered preoperative preparation, intraoperative management, postoperative pain control and rehabilitation, and complication management. Responses from ChatGPT and iFlytek Spark were independently evaluated by two experienced trauma orthopedic specialists based on clinical relevance, consistency with expert consensus, and depth of reasoning. Results ChatGPT demonstrated higher alignment with expert consensus (29/35 questions, 82.9%), particularly in comprehensive perioperative recommendations, detailed medical rationales, and structured treatment plans. However, discrepancies were noted in intraoperative blood pressure management and preoperative antiemetic selection. iFlytek Spark aligned with expert consensus in 22/35 questions (62.9%), but responses were often more generalized, less clinically detailed, and occasionally inconsistent with best practices. Agreement between ChatGPT and iFlytek Spark was observed in 23/35 questions (65.7%), with ChatGPT generally exhibiting greater specificity, timeliness, and precision in its recommendations. Conclusion AI-powered LLMs, particularly ChatGPT, show promise in supporting clinical decision-making for ERAS-guided ankle fracture management. While ChatGPT provided more accurate and contextually relevant responses, inconsistencies with expert consensus highlight the need for further refinement, validation, and clinical integration. iFlytek Spark's lower conformity suggests potential differences in training data and underlying algorithms, underscoring the variability in AI-generated medical advice. To optimize AI's role in orthopedic care, future research should focus on enhancing AI alignment with medical guidelines, improving model transparency, and integrating physician oversight to ensure safe and effective clinical applications.
Collapse
Affiliation(s)
- Rui Wang
- Department of Orthopaedic, Zhongshan City Orthopaedic Hospital, Zhongshan, Guangdong Province, People’s Republic of China
| | - Xuanming Situ
- Department of Orthopaedic, Zhongshan City Orthopaedic Hospital, Zhongshan, Guangdong Province, People’s Republic of China
| | - Xu Sun
- Department of Orthopaedic Trauma, Beijing Jishuitan Hospital, Beijing, People’s Republic of China
| | - Jinchang Zhan
- Department of Orthopaedic, Zhongshan City Orthopaedic Hospital, Zhongshan, Guangdong Province, People’s Republic of China
| | - Xi Liu
- Department of Sports, Sun Yat-sen Memorial Primary School, Zhongshan, Guangdong Province, People’s Republic of China
| |
Collapse
|
7
|
Kunze KN, Nwachukwu BU, Cote MP, Ramkumar PN. Large Language Models Applied to Health Care Tasks May Improve Clinical Efficiency, Value of Care Rendered, Research, and Medical Education. Arthroscopy 2025; 41:547-556. [PMID: 39694303 DOI: 10.1016/j.arthro.2024.12.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/19/2024] [Revised: 12/01/2024] [Accepted: 12/02/2024] [Indexed: 12/20/2024]
Abstract
Large language models (LLMs) are generative artificial intelligence models that create content on the basis of the data on which it was trained. Processing capabilities have evolved from text only to being multimodal including text, images, audio, and video features. In health care settings, LLMs are being applied to several clinically important areas, including patient care and workflow efficiency, communications, hospital operations and data management, medical education, practice management, and health care research. Under the umbrella of patient care, several core use cases of LLMs include simplifying documentation tasks, enhancing patient communication (interactive language and written), conveying medical knowledge, and performing medical triage and diagnosis. However, LLMs warrant scrutiny when applied to health care tasks, as errors may have negative implications for health care outcomes, specifically in the context of perpetuating bias, ethical considerations, and cost-effectiveness. Customized LLMs developed for more narrow purposes may help overcome certain performance limitations, transparency challenges, and biases present in contemporary generalized LLMs by curating training data. Methods of customizing LLMs broadly fall under 4 categories: prompt engineering, retrieval augmented generation, fine-tuning, and agentic augmentation, with each approach conferring different information-retrieval properties for the LLM. LEVEL OF EVIDENCE: Level V, expert opinion.
Collapse
Affiliation(s)
- Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A..
| | - Benedict U Nwachukwu
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Mark P Cote
- Department of Orthopaedic Surgery, Massachusetts General Hospital, Boston, Massachusetts, U.S.A
| | | |
Collapse
|
8
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
9
|
Yang X, Li T, Su Q, Liu Y, Kang C, Lyu Y, Zhao L, Nie Y, Pan Y. Application of large language models in disease diagnosis and treatment. Chin Med J (Engl) 2025; 138:130-142. [PMID: 39722188 PMCID: PMC11745858 DOI: 10.1097/cm9.0000000000003456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Indexed: 12/28/2024] Open
Abstract
ABSTRACT Large language models (LLMs) such as ChatGPT, Claude, Llama, and Qwen are emerging as transformative technologies for the diagnosis and treatment of various diseases. With their exceptional long-context reasoning capabilities, LLMs are proficient in clinically relevant tasks, particularly in medical text analysis and interactive dialogue. They can enhance diagnostic accuracy by processing vast amounts of patient data and medical literature and have demonstrated their utility in diagnosing common diseases and facilitating the identification of rare diseases by recognizing subtle patterns in symptoms and test results. Building on their image-recognition abilities, multimodal LLMs (MLLMs) show promising potential for diagnosis based on radiography, chest computed tomography (CT), electrocardiography (ECG), and common pathological images. These models can also assist in treatment planning by suggesting evidence-based interventions and improving clinical decision support systems through integrated analysis of patient records. Despite these promising developments, significant challenges persist regarding the use of LLMs in medicine, including concerns regarding algorithmic bias, the potential for hallucinations, and the need for rigorous clinical validation. Ethical considerations also underscore the importance of maintaining the function of supervision in clinical practice. This paper highlights the rapid advancements in research on the diagnostic and therapeutic applications of LLMs across different medical disciplines and emphasizes the importance of policymaking, ethical supervision, and multidisciplinary collaboration in promoting more effective and safer clinical applications of LLMs. Future directions include the integration of proprietary clinical knowledge, the investigation of open-source and customized models, and the evaluation of real-time effects in clinical diagnosis and treatment practices.
Collapse
Affiliation(s)
- Xintian Yang
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Tongxin Li
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Qin Su
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yaling Liu
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Chenxi Kang
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yong Lyu
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Lina Zhao
- Department of Radiotherapy, Xijing Hospital, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yongzhan Nie
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yanglin Pan
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| |
Collapse
|
10
|
Shan K, Patel MA, McCreary M, Punnen TG, Villalobos F, Tardo LM, Horton LA, Sguigna PV, Blackburn KM, Munoz SB, Burgess KW, Moog TM, Smith AD, Okuda DT. Faster and better than a physician?: Assessing diagnostic proficiency of ChatGPT in misdiagnosed individuals with neuromyelitis optica spectrum disorder. J Neurol Sci 2025; 468:123360. [PMID: 39733714 DOI: 10.1016/j.jns.2024.123360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 11/27/2024] [Accepted: 12/15/2024] [Indexed: 12/31/2024]
Abstract
BACKGROUND Neuromyelitis optica spectrum disorder (NMOSD) is a commonly misdiagnosed condition. Driven by cost-consciousness and technological fluency, distinct generations may gravitate towards healthcare alternatives, including artificial intelligence (AI) models, such as ChatGPT (Generative Pre-trained Transformer). Our objective was to evaluate the speed and accuracy of ChatGPT-3.5 (GPT-3.5) in the diagnosis of people with NMOSD (PwNMOSD) initially misdiagnosed. METHODS Misdiagnosed PwNMOSD were retrospectively identified with clinical symptoms and time line of medically related events processed through GPT-3.5. For each subject, seven digital derivatives representing different races, ethnicities, and sexes were created and processed identically to evaluate the impact of these variables on accuracy. Scoresheets were used to track diagnostic success and time to diagnosis. Diagnostic speed of GPT-3.5 was evaluated against physicians using a Cox proportional hazards model, clustered by subject. Logistical regression was used to estimate the diagnostic accuracy of GPT-3.5 compared with the estimated accuracy of physicians. RESULTS Clinical time lines for 68 individuals (59 female, 42 Black/African American, 13 White, 11 Hispanic, 2 Asian; mean age at first symptoms 34.4 years (y) (standard deviation = 15.5y)) were analyzed and 476 digital simulations created, yielding 544 conversations for analysis. The instantaneous probability of correct diagnosis was 70.65% less for physicians relative to GPT-3.5 within 240 days of symptom onset (p < 0.0001). The estimated probability of correct diagnosis for GPT-3.5 was 80.88% [95% CI = (76.35%, 99.81%)]. CONCLUSION GPT-3.5 may be of value in recognizing NMOSD. However, the manner in which medical information is conveyed, combined with the potential for inaccuracies may result in unnecessary psychological stress.
Collapse
Affiliation(s)
- Kevin Shan
- The University of Texas Southwestern Medical Center, School of Medicine, Dallas, TX, USA
| | - Mahi A Patel
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Morgan McCreary
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Tom G Punnen
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Francisco Villalobos
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Lauren M Tardo
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Lindsay A Horton
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Peter V Sguigna
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Kyle M Blackburn
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Shanan B Munoz
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Katy W Burgess
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Tatum M Moog
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
| | - Alexander D Smith
- Texas Tech University Health Sciences Center, School of Medicine, Lubbock, TX, USA
| | - Darin T Okuda
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA.
| |
Collapse
|
11
|
Tanaka C, Kinoshita T, Okada Y, Satoh K, Homma Y, Suzuki K, Yokobori S, Oda J, Otomo Y, Tagami T. Medical validity and layperson interpretation of emergency visit recommendations by the GPT model: A cross-sectional study. Acute Med Surg 2025; 12:e70042. [PMID: 40078650 PMCID: PMC11897724 DOI: 10.1002/ams2.70042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Revised: 01/08/2025] [Accepted: 01/26/2025] [Indexed: 03/14/2025] Open
Abstract
Aim In Japan, emergency ambulance dispatches involve minor cases requiring outpatient services, emphasizing the need for improved public guidance regarding emergency care. This study evaluated both the medical plausibility of the GPT model in aiding laypersons to determine the need for emergency medical care and the laypersons' interpretations of its outputs. Methods This cross-sectional study was conducted from December 10, 2023, to March 7, 2024. We input clinical scenarios into the GPT model and evaluated the need for emergency visits based on the outputs. A total of 314 scenarios were labeled with red tags (emergency, immediate emergency department [ED] visit) and 152 with green tags (less urgent). Seven medical specialists assessed the outputs' validity, and 157 laypersons interpreted them via a web-based questionnaire. Results Experts reported that the GPT model accurately identified important information in 95.9% (301/314) of red-tagged scenarios and recommended immediate ED visits in 96.5% (303/314). However, only 43.0% (135/314) of laypersons interpreted those outputs as indicating urgent hospital visits. The model identified important information in 99.3% (151/152) of green-tagged scenarios and advised against immediate visits in 88.8% (135/152). However, only 32.2% (49/152) of laypersons considered them routine follow-ups. Conclusions Expert evaluations revealed that the GPT model could be highly accurate in advising on emergency visits. However, laypersons frequently misinterpreted its recommendations, highlighting a substantial gap in understanding AI-generated medical advice.
Collapse
Affiliation(s)
- Chie Tanaka
- Department of Emergency and Critical Care MedicineNippon Medical School Tama Nagayama HospitalTokyoJapan
| | | | - Yohei Okada
- Health Services and Systems ResearchDuke‐NUS Medical SchoolSingaporeSingapore
| | - Kasumi Satoh
- Department of Emergency and Critical Care MedicineAkita University Graduate School of MedicineAkitaJapan
| | - Yosuke Homma
- Department of Emergency MedicineChiba Kaihin Municipal HospitalChibaJapan
| | - Kensuke Suzuki
- The Graduate School of Health and Sport ScienceNippon Sport Science UniversityKanagawaJapan
| | - Shoji Yokobori
- Department of Emergency and Critical Care MedicineNippon Medical SchoolTokyoJapan
| | - Jun Oda
- Department of Traumatology and Acute Critical MedicineOsaka University Graduate School of MedicineOsakaJapan
| | - Yasuhiro Otomo
- Department of Trauma and Critical Care MedicineNational Hospital Organization (NHO) Disaster Medical CenterTokyoJapan
| | - Takashi Tagami
- Department of Emergency and Critical Care MedicineNippon Medical School Musashikosugi HospitalKanagawaJapan
| | | |
Collapse
|
12
|
Su Z, Jin K, Wu H, Luo Z, Grzybowski A, Ye J. Assessment of Large Language Models in Cataract Care Information Provision: A Quantitative Comparison. Ophthalmol Ther 2025; 14:103-116. [PMID: 39516445 PMCID: PMC11724831 DOI: 10.1007/s40123-024-01066-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Accepted: 10/24/2024] [Indexed: 11/16/2024] Open
Abstract
INTRODUCTION Cataracts are a significant cause of blindness. While individuals frequently turn to the Internet for medical advice, distinguishing reliable information can be challenging. Large language models (LLMs) have attracted attention for generating accurate, human-like responses that may be used for medical consultation. However, a comprehensive assessment of LLMs' accuracy within specific medical domains is still lacking. METHODS We compiled 46 commonly inquired questions related to cataract care, categorized into six domains. Each question was presented to the LLMs, and three consultant-level ophthalmologists independently assessed the accuracy of their responses on a three-point scale (poor, borderline, good) and their comprehensiveness on a five-point scale. A majority consensus approach established the final rating for each response. Responses rated as 'Poor' were prompted for self-correction and reassessed. RESULTS For accuracy, ChatGPT-4o and Google Bard both achieved average sum scores of 8.7 (out of 9), followed by ChatGPT-3.5, Bing Chat, Llama 2, and Wenxin Yiyan. In consensus-based ratings, ChatGPT-4o outperformed Google Bard in the 'Good' rating. For completeness, ChatGPT-4o had the highest average sum score of 13.22 (out of 15), followed by Google Bard, ChatGPT-3.5, Llama 2, Bing Chat, and Wenxin Yiyan. Detailed performance data reveal nuanced differences in model capabilities. In the 'Prevention' domain, apart from Wenxin Yiyan, all other models were rated as 'Good'. All models showed improvement in self-correction. Bard and Bing improved 1/1 from 'Poor' to better, Llama improved 3/4, and Wenxin Yiyan improved 4/5. CONCLUSIONS Our findings emphasize the potential of LLMs, particularly ChatGPT-4o, to deliver accurate and comprehensive responses to cataract-related queries, especially in prevention, indicating potential for medical consultation. Continuous efforts to enhance LLMs' accuracy through ongoing strategies and evaluations are essential.
Collapse
Affiliation(s)
- Zichang Su
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China
| | - Kai Jin
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China.
| | - Hongkang Wu
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China
| | - Ziyao Luo
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China
- Zhejiang University Chu Kochen Honors College, Hangzhou, 310009, China
| | - Andrzej Grzybowski
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, Poznań, Poland
| | - Juan Ye
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China.
| |
Collapse
|
13
|
Chang Y, Yin JM, Li JM, Liu C, Cao LY, Lin SY. Applications and Future Prospects of Medical LLMs: A Survey Based on the M-KAT Conceptual Framework. J Med Syst 2024; 48:112. [PMID: 39725770 DOI: 10.1007/s10916-024-02132-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 12/10/2024] [Indexed: 12/28/2024]
Abstract
The success of large language models (LLMs) in general areas have sparked a wave of research into their applications in the medical field. However, enhancing the medical professionalism of these models remains a major challenge. This study proposed a novel model training theoretical framework, the M-KAT framework, which integrated domain-specific training methods for LLMs with the unique characteristics of the medical discipline. This framework aimed to improve the medical professionalism of the models from three perspectives: general knowledge acquisition, specialized skill development, and alignment with clinical thinking. This study summarized the outcomes of medical LLMs across four tasks: clinical diagnosis and treatment, medical question answering, medical research, and health management. Using the M-KAT framework, we analyzed the contribution to enhancement of professionalism of models through different training stages. At the same time, for some of the potential risks associated with medical LLMs, targeted solutions can be achieved through pre-training, SFT, and model alignment based on cultivated professional capabilities. Additionally, this study identified main directions for future research on medical LLMs: advancing professional evaluation datasets and metrics tailored to the needs of medical tasks, conducting in-depth studies on medical multimodal large language models (MLLMs) capable of integrating diverse data types, and exploring the forms of medical agents and multi-agent frameworks that can interact with real healthcare environments and support clinical decision-making. It is hoped that predictions of work can provide a reference for subsequent research.
Collapse
Affiliation(s)
- Ying Chang
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China
| | - Jian-Ming Yin
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China
| | - Jian-Min Li
- Gancao Doctor Chinese Medicine Artificial Intelligence Joint Engineering Center, Zhejiang Chinese Medical University, Zhejiang Chinese Medical University, Hangzhou, China
| | - Chang Liu
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China
- Gancao Doctor Chinese Medicine Artificial Intelligence Joint Engineering Center, Zhejiang Chinese Medical University, Zhejiang Chinese Medical University, Hangzhou, China
- Breast Disease Specialist Hospital of Guangdong Provincial Hospital of Chinese Medicine, Guangdong Provincial Hospital of Chinese Medicine, Guangzhou, China
| | - Ling-Yong Cao
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China.
| | - Shu-Yuan Lin
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China.
- Gancao Doctor Chinese Medicine Artificial Intelligence Joint Engineering Center, Zhejiang Chinese Medical University, Zhejiang Chinese Medical University, Hangzhou, China.
| |
Collapse
|
14
|
Kusaka S, Akitomo T, Hamada M, Asao Y, Iwamoto Y, Tachikake M, Mitsuhata C, Nomura R. Usefulness of Generative Artificial Intelligence (AI) Tools in Pediatric Dentistry. Diagnostics (Basel) 2024; 14:2818. [PMID: 39767179 PMCID: PMC11674453 DOI: 10.3390/diagnostics14242818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Revised: 12/11/2024] [Accepted: 12/12/2024] [Indexed: 01/11/2025] Open
Abstract
Background/Objectives: Generative artificial intelligence (AI) such as ChatGPT has developed rapidly in recent years, and in the medical field, its usefulness for diagnostic assistance has been reported. However, there are few reports of AI use in dental fields. Methods: We created 20 questions that we had encountered in clinical pediatric dentistry, and collected the responses to these questions from three types of generative AI. The responses were evaluated on a 5-point scale by six pediatric dental specialists using the Global Quality Scale. Results: The average scores were >3 for the three types of generated AI tools that we tested; the overall average was 3.34. Although the responses for questions related to "consultations from guardians" or "systemic diseases" had high scores (>3.5), the score for questions related to "dental abnormalities" was 2.99, which was the lowest among the four categories. Conclusions: Our results show the usefulness of generative AI tools in clinical pediatric dentistry, indicating that these tools will be useful assistants in the dental field.
Collapse
Affiliation(s)
- Satoru Kusaka
- Department of Pediatric Dentistry, Hiroshima University Hospital, Hiroshima 734-8551, Japan; (S.K.); (M.T.)
| | - Tatsuya Akitomo
- Department of Pediatric Dentistry, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima 734-8553, Japan; (Y.A.); (Y.I.); (C.M.); (R.N.)
| | - Masakazu Hamada
- Department of Oral & Maxillofacial Oncology and Surgery, Graduate School of Dentistry, The University of Osaka, Osaka 565-0871, Japan
| | - Yuria Asao
- Department of Pediatric Dentistry, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima 734-8553, Japan; (Y.A.); (Y.I.); (C.M.); (R.N.)
| | - Yuko Iwamoto
- Department of Pediatric Dentistry, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima 734-8553, Japan; (Y.A.); (Y.I.); (C.M.); (R.N.)
| | - Meiko Tachikake
- Department of Pediatric Dentistry, Hiroshima University Hospital, Hiroshima 734-8551, Japan; (S.K.); (M.T.)
| | - Chieko Mitsuhata
- Department of Pediatric Dentistry, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima 734-8553, Japan; (Y.A.); (Y.I.); (C.M.); (R.N.)
| | - Ryota Nomura
- Department of Pediatric Dentistry, Graduate School of Biomedical and Health Sciences, Hiroshima University, Hiroshima 734-8553, Japan; (Y.A.); (Y.I.); (C.M.); (R.N.)
| |
Collapse
|
15
|
Yu H, Fan L, Li L, Zhou J, Ma Z, Xian L, Hua W, He S, Jin M, Zhang Y, Gandhi A, Ma X. Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024; 8:658-711. [PMID: 39463859 PMCID: PMC11499577 DOI: 10.1007/s41666-024-00171-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 08/16/2024] [Accepted: 08/22/2024] [Indexed: 10/29/2024]
Abstract
Large language models (LLMs) have rapidly become important tools in Biomedical and Health Informatics (BHI), potentially enabling new ways to analyze data, treat patients, and conduct research. This study aims to provide a comprehensive overview of LLM applications in BHI, highlighting their transformative potential and addressing the associated ethical and practical challenges. We reviewed 1698 research articles from January 2022 to December 2023, categorizing them by research themes and diagnostic categories. Additionally, we conducted network analysis to map scholarly collaborations and research dynamics. Our findings reveal a substantial increase in the potential applications of LLMs to a variety of BHI tasks, including clinical decision support, patient interaction, and medical document analysis. Notably, LLMs are expected to be instrumental in enhancing the accuracy of diagnostic tools and patient care protocols. The network analysis highlights dense and dynamically evolving collaborations across institutions, underscoring the interdisciplinary nature of LLM research in BHI. A significant trend was the application of LLMs in managing specific disease categories, such as mental health and neurological disorders, demonstrating their potential to influence personalized medicine and public health strategies. LLMs hold promising potential to further transform biomedical research and healthcare delivery. While promising, the ethical implications and challenges of model validation call for rigorous scrutiny to optimize their benefits in clinical settings. This survey serves as a resource for stakeholders in healthcare, including researchers, clinicians, and policymakers, to understand the current state and future potential of LLMs in BHI.
Collapse
Affiliation(s)
- Huizi Yu
- University of Michigan, Ann Arbor, MI USA
| | - Lizhou Fan
- University of Michigan, Ann Arbor, MI USA
| | - Lingyao Li
- University of Michigan, Ann Arbor, MI USA
| | | | - Zihui Ma
- University of Maryland, College Park, MD USA
| | - Lu Xian
- University of Michigan, Ann Arbor, MI USA
| | | | - Sijia He
- University of Michigan, Ann Arbor, MI USA
| | | | | | - Ashvin Gandhi
- University of California, Los Angeles, Los Angeles, CA USA
| | - Xin Ma
- Shandong University, Jinan, Shandong China
| |
Collapse
|
16
|
Rotem R, Zamstein O, Rottenstreich M, O'Sullivan OE, O'reilly BA, Weintraub AY. The future of patient education: A study on AI-driven responses to urinary incontinence inquiries. Int J Gynaecol Obstet 2024; 167:1004-1009. [PMID: 38944693 DOI: 10.1002/ijgo.15751] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 05/30/2024] [Accepted: 06/14/2024] [Indexed: 07/01/2024]
Abstract
OBJECTIVE To evaluate the effectiveness of ChatGPT in providing insights into common urinary incontinence concerns within urogynecology. By analyzing the model's responses against established benchmarks of accuracy, completeness, and safety, the study aimed to quantify its usefulness for informing patients and aiding healthcare providers. METHODS An expert-driven questionnaire was developed, inviting urogynecologists worldwide to assess ChatGPT's answers to 10 carefully selected questions on urinary incontinence (UI). These assessments focused on the accuracy of the responses, their comprehensiveness, and whether they raised any safety issues. Subsequent statistical analyses determined the average consensus among experts and identified the proportion of responses receiving favorable evaluations (a score of 4 or higher). RESULTS Of 50 urogynecologists that were approached worldwide, 37 responded, offering insights into ChatGPT's responses on UI. The overall feedback averaged a score of 4.0, indicating a positive acceptance. Accuracy scores averaged 3.9 with 71% rated favorably, whereas comprehensiveness scored slightly higher at 4 with 74% favorable ratings. Safety assessments also averaged 4 with 74% favorable responses. CONCLUSION This investigation underlines ChatGPT's favorable performance across the evaluated domains of accuracy, comprehensiveness, and safety within the context of UI queries. However, despite this broadly positive reception, the study also signals a clear avenue for improvement, particularly in the precision of the provided information. Refining ChatGPT's accuracy and ensuring the delivery of more pinpointed responses are essential steps forward, aiming to bolster its utility as a comprehensive educational resource for patients and a supportive tool for healthcare practitioners.
Collapse
Affiliation(s)
- Reut Rotem
- Department of Urogynaecology, Cork University Maternity Hospital, Cork, Ireland
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Omri Zamstein
- Department of Obstetrics and Gynecology, Soroka University Medical Center, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Misgav Rottenstreich
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | | | - Barry A O'reilly
- Department of Urogynaecology, Cork University Maternity Hospital, Cork, Ireland
| | - Adi Y Weintraub
- Department of Obstetrics and Gynecology, Soroka University Medical Center, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| |
Collapse
|
17
|
Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, Freire Y. Beyond the Scalpel: Assessing ChatGPT's potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J 2024; 24:46-52. [PMID: 38162955 PMCID: PMC10755495 DOI: 10.1016/j.csbj.2023.11.058] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/28/2023] [Accepted: 11/28/2023] [Indexed: 01/03/2024] Open
Abstract
AI has revolutionized the way we interact with technology. Noteworthy advances in AI algorithms and large language models (LLM) have led to the development of natural generative language (NGL) systems such as ChatGPT. Although these LLM can simulate human conversations and generate content in real time, they face challenges related to the topicality and accuracy of the information they generate. This study aimed to assess whether ChatGPT-4 could provide accurate and reliable answers to general dentists in the field of oral surgery, and thus explore its potential as an intelligent virtual assistant in clinical decision making in oral surgery. Thirty questions related to oral surgery were posed to ChatGPT4, each question repeated 30 times. Subsequently, a total of 900 responses were obtained. Two surgeons graded the answers according to the guidelines of the Spanish Society of Oral Surgery, using a three-point Likert scale (correct, partially correct/incomplete, and incorrect). Disagreements were arbitrated by an experienced oral surgeon, who provided the final grade Accuracy was found to be 71.7%, and consistency of the experts' grading across iterations, ranged from moderate to almost perfect. ChatGPT-4, with its potential capabilities, will inevitably be integrated into dental disciplines, including oral surgery. In the future, it could be considered as an auxiliary intelligent virtual assistant, though it would never replace oral surgery experts. Proper training and verified information by experts will remain vital to the implementation of the technology. More comprehensive research is needed to ensure the safe and successful application of AI in oral surgery.
Collapse
Affiliation(s)
- Ana Suárez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Jaime Jiménez
- Department of Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - María Llorente de Pedro
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Cristina Andreu-Vázquez
- Department of Veterinary Medicine, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Víctor Díaz-Flores García
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Margarita Gómez Sánchez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Yolanda Freire
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| |
Collapse
|
18
|
Lee J, Park S, Shin J, Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 2024; 24:366. [PMID: 39614219 PMCID: PMC11606129 DOI: 10.1186/s12911-024-02709-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/03/2024] [Indexed: 12/01/2024] Open
Abstract
BACKGROUND Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs. OBJECTIVE This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies. METHODS & MATERIALS We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy. RESULTS A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering. CONCLUSIONS More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
Collapse
Affiliation(s)
- Junbok Lee
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sungkyung Park
- Department of Bigdata AI Management Information, Seoul National University of Science and Technology, Seoul, Republic of Korea
| | - Jaeyong Shin
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, 50-1, Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea.
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea.
| | - Belong Cho
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
| |
Collapse
|
19
|
Cao M, Wang Q, Zhang X, Lang Z, Qiu J, Yung PSH, Ong MTY. Large language models' performances regarding common patient questions about osteoarthritis: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Perplexity. JOURNAL OF SPORT AND HEALTH SCIENCE 2024:101016. [PMID: 39613294 DOI: 10.1016/j.jshs.2024.101016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 06/19/2024] [Accepted: 09/23/2024] [Indexed: 12/01/2024]
Abstract
BACKGROUND Large Language Models (LLMs) have gained much attention and, in part, have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses. Osteoarthritis (OA) is a common topic in skeletal muscle disorders, and patients often seek information about it online. Our study evaluated the ability of 3 LLMs (ChatGPT-3.5, ChatGPT-4.0, and Perplexity) to accurately answer common OA-related queries. METHODS We defined 6 themes (pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis) based on a generalization of 25 frequently asked questions about OA. Three consultant-level orthopedic specialists independently rated the LLMs' replies on a 4-point accuracy scale. The final ratings for each response were determined using a majority consensus approach. Responses classified as "satisfactory" were evaluated for comprehensiveness on a 5-point scale. RESULTS ChatGPT-4.0 demonstrated superior accuracy, with 64% of responses rated as "excellent", compared to 40% for ChatGPT-3.5 and 28% for Perplexity (Pearson's chi-squared test with Fisher's exact test, all p < 0.001). All 3 LLM-chatbots had high mean comprehensiveness ratings (Perplexity = 3.88; ChatGPT-4.0 = 4.56; ChatGPT-3.5 = 3.96, out of a maximum score of 5). The LLM-chatbots performed reliably across domains, except for "treatment and prevention" However, ChatGPT-4.0 still outperformed ChatGPT-3.5 and Perplexity, garnering 53.8% "excellent" ratings (Pearson's chi-squared test with Fisher's exact test, all p < 0.001). CONCLUSION Our findings underscore the potential of LLMs, specifically ChatGPT-4.0 and Perplexity, to deliver accurate and thorough responses to OA-related queries. Targeted correction of specific misconceptions to improve the accuracy of LLMs remains crucial.
Collapse
Affiliation(s)
- Mingde Cao
- Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China; Center for Neuromusculoskeletal Restorative Medicine (CNRM), The Chinese University of Hong Kong, Hong Kong 999077, China
| | - Qianwen Wang
- Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China
| | - Xueyou Zhang
- Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China
| | - Zuru Lang
- Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China
| | - Jihong Qiu
- School of Exercise and Health, Shanghai University of Sport, Shanghai 200438, China
| | - Patrick Shu-Hang Yung
- Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China; Center for Neuromusculoskeletal Restorative Medicine (CNRM), The Chinese University of Hong Kong, Hong Kong 999077, China
| | - Michael Tim-Yun Ong
- Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China; Center for Neuromusculoskeletal Restorative Medicine (CNRM), The Chinese University of Hong Kong, Hong Kong 999077, China.
| |
Collapse
|
20
|
Zhou Y, Li SJ, Tang XY, He YC, Ma HM, Wang AQ, Pei RY, Piao MH. Using ChatGPT in Nursing: Scoping Review of Current Opinions. JMIR MEDICAL EDUCATION 2024; 10:e54297. [PMID: 39622702 PMCID: PMC11611787 DOI: 10.2196/54297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Revised: 07/25/2024] [Accepted: 08/19/2024] [Indexed: 12/06/2024]
Abstract
Background Since the release of ChatGPT in November 2022, this emerging technology has garnered a lot of attention in various fields, and nursing is no exception. However, to date, no study has comprehensively summarized the status and opinions of using ChatGPT across different nursing fields. Objective We aim to synthesize the status and opinions of using ChatGPT according to different nursing fields, as well as assess ChatGPT's strengths, weaknesses, and the potential impacts it may cause. Methods This scoping review was conducted following the framework of Arksey and O'Malley and guided by the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). A comprehensive literature research was conducted in 4 web-based databases (PubMed, Embase, Web of Science, and CINHAL) to identify studies reporting the opinions of using ChatGPT in nursing fields from 2022 to September 3, 2023. The references of the included studies were screened manually to further identify relevant studies. Two authors conducted studies screening, eligibility assessments, and data extraction independently. Results A total of 30 studies were included. The United States (7 studies), Canada (5 studies), and China (4 studies) were countries with the most publications. In terms of fields of concern, studies mainly focused on "ChatGPT and nursing education" (20 studies), "ChatGPT and nursing practice" (10 studies), and "ChatGPT and nursing research, writing, and examination" (6 studies). Six studies addressed the use of ChatGPT in multiple nursing fields. Conclusions As an emerging artificial intelligence technology, ChatGPT has great potential to revolutionize nursing education, nursing practice, and nursing research. However, researchers, institutions, and administrations still need to critically examine its accuracy, safety, and privacy, as well as academic misconduct and potential ethical issues that it may lead to before applying ChatGPT to practice.
Collapse
Affiliation(s)
- You Zhou
- School of Nursing, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 33 Badachu Road, Shijingshan District, Beijing, 100433, China, 86 13522112889
| | - Si-Jia Li
- School of Nursing, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 33 Badachu Road, Shijingshan District, Beijing, 100433, China, 86 13522112889
| | - Xing-Yi Tang
- School of Nursing, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 33 Badachu Road, Shijingshan District, Beijing, 100433, China, 86 13522112889
| | - Yi-Chen He
- School of Nursing, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 33 Badachu Road, Shijingshan District, Beijing, 100433, China, 86 13522112889
| | - Hao-Ming Ma
- School of Nursing, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 33 Badachu Road, Shijingshan District, Beijing, 100433, China, 86 13522112889
| | - Ao-Qi Wang
- School of Nursing, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 33 Badachu Road, Shijingshan District, Beijing, 100433, China, 86 13522112889
| | - Run-Yuan Pei
- School of Nursing, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 33 Badachu Road, Shijingshan District, Beijing, 100433, China, 86 13522112889
| | - Mei-Hua Piao
- School of Nursing, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 33 Badachu Road, Shijingshan District, Beijing, 100433, China, 86 13522112889
| |
Collapse
|
21
|
Zhang C, Liu S, Zhou X, Zhou S, Tian Y, Wang S, Xu N, Li W. Examining the Role of Large Language Models in Orthopedics: Systematic Review. J Med Internet Res 2024; 26:e59607. [PMID: 39546795 DOI: 10.2196/59607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 08/01/2024] [Accepted: 09/11/2024] [Indexed: 11/17/2024] Open
Abstract
BACKGROUND Large language models (LLMs) can understand natural language and generate corresponding text, images, and even videos based on prompts, which holds great potential in medical scenarios. Orthopedics is a significant branch of medicine, and orthopedic diseases contribute to a significant socioeconomic burden, which could be alleviated by the application of LLMs. Several pioneers in orthopedics have conducted research on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies, and a systematic summary of existing research is absent. OBJECTIVE The objective of this review was to comprehensively summarize research findings on the application of LLMs in the field of orthopedics and explore the potential opportunities and challenges. METHODS PubMed, Embase, and Cochrane Library databases were searched from January 1, 2014, to February 22, 2024, with the language limited to English. The terms, which included variants of "large language model," "generative artificial intelligence," "ChatGPT," and "orthopaedics," were divided into 2 categories: large language model and orthopedics. After completing the search, the study selection process was conducted according to the inclusion and exclusion criteria. The quality of the included studies was assessed using the revised Cochrane risk-of-bias tool for randomized trials and CONSORT-AI (Consolidated Standards of Reporting Trials-Artificial Intelligence) guidance. Data extraction and synthesis were conducted after the quality assessment. RESULTS A total of 68 studies were selected. The application of LLMs in orthopedics involved the fields of clinical practice, education, research, and management. Of these 68 studies, 47 (69%) focused on clinical practice, 12 (18%) addressed orthopedic education, 8 (12%) were related to scientific research, and 1 (1%) pertained to the field of management. Of the 68 studies, only 8 (12%) recruited patients, and only 1 (1%) was a high-quality randomized controlled trial. ChatGPT was the most commonly mentioned LLM tool. There was considerable heterogeneity in the definition, measurement, and evaluation of the LLMs' performance across the different studies. For diagnostic tasks alone, the accuracy ranged from 55% to 93%. When performing disease classification tasks, ChatGPT with GPT-4's accuracy ranged from 2% to 100%. With regard to answering questions in orthopedic examinations, the scores ranged from 45% to 73.6% due to differences in models and test selections. CONCLUSIONS LLMs cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. More high-quality clinical trials are needed in the future, aiming to identify optimal applications of LLMs and advance orthopedics toward higher efficiency and precision.
Collapse
Affiliation(s)
- Cheng Zhang
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Shanshan Liu
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Xingyu Zhou
- Peking University Health Science Center, Beijing, China
| | - Siyu Zhou
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Yinglun Tian
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Shenglin Wang
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Nanfang Xu
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| | - Weishi Li
- Department of Orthopaedics, Peking University Third Hospital, Beijing, China
- Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China
- Beijing Key Laboratory of Spinal Disease Research, Beijing, China
| |
Collapse
|
22
|
Ezanno AC, Fougerousse AC, Pruvost-Balland C, Maccari F, Fite C. AI in Hidradenitis Suppurativa: Expert Evaluation of Patient-Facing Information. Clin Cosmet Investig Dermatol 2024; 17:2459-2464. [PMID: 39507766 PMCID: PMC11539865 DOI: 10.2147/ccid.s478309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Accepted: 10/26/2024] [Indexed: 11/08/2024]
Abstract
Purpose This study investigates the accuracy of Artificial Intelligence (AI) chatbots, ChatGPT and Bard, in providing information on Hidradenitis Suppurativa (HS), aiming to explore their potential in assisting HS patients by offering insights into symptoms, thus possibly reducing the diagnostic and treatment time gap. Patients and Methods Using questions formulated with the help of HS patient associations, both ChatGPT and Bard were assessed. Responses to these questions were evaluated by 18 hS experts. Results ChatGPT's responses were considered accurate in 86% of cases, significantly outperforming Bard, which only achieved 14% accuracy. Despite the general efficacy of ChatGPT in providing relevant information across a range of HS-related queries, both AI systems showed limitations in offering adequate advice on treatments. The study identifies a significant difference in the performance of the two AIs, emphasizing the need for improvement in AI-driven medical advice, particularly regarding treatment options. Conclusion The study highlights the potential of AI chatbots, particularly ChatGPT, in supporting HS patients by improving symptom understanding and potentially reducing the time to diagnosis and treatment. AI chatbots, while promising, cannot yet substitute for professional medical diagnosis and treatment, indicating the importance of enhancing AI capabilities for more accurate and reliable medical information dissemination.
Collapse
Affiliation(s)
- Anne-Cécile Ezanno
- Department of Digestive, Surgery, Begin Military Teaching Hospital, Saint Mandé, France
| | | | | | - François Maccari
- Department of Dermatology, Begin Military Teaching Hospital, Saint Mandé and Medical Center, La Varenne Saint-Hilaire, France
| | - Charlotte Fite
- Department of Dermatology, Saint Joseph Hospital, Paris, France
| | - On behalf of ResoVerneuil
- Department of Digestive, Surgery, Begin Military Teaching Hospital, Saint Mandé, France
- Department of Dermatology, Begin Military Teaching Hospital, Saint Mandé, France
- Department of Dermatology, University Hospital Pontchaillou, Rennes, France
- Department of Dermatology, Begin Military Teaching Hospital, Saint Mandé and Medical Center, La Varenne Saint-Hilaire, France
- Department of Dermatology, Saint Joseph Hospital, Paris, France
| |
Collapse
|
23
|
Wang K, Tan X, Nan S, Sang L, Chen H, Duan H. OLR-Net: Object Label Retrieval Network for principal diagnosis extraction. Comput Biol Med 2024; 182:109130. [PMID: 39288555 DOI: 10.1016/j.compbiomed.2024.109130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 08/26/2024] [Accepted: 09/06/2024] [Indexed: 09/19/2024]
Abstract
BACKGROUND Extracting principal diagnosis from patient discharge summaries is an essential task for the meaningful use of medical data. The extraction process, usually by medical staff, is laborious and time-consuming. Although automatic models have been proposed to retrieve principal diagnoses from medical records, many rare diagnoses and a small amount of training data per rare diagnosis provide significant statistical and computational challenges. OBJECTIVE In this study, we aimed to extract principal diagnoses with limited available data. METHODS We proposed the OLR-Net, Object Label Retrieval Network, to extract principal diagnoses for discharge summaries. Our approach included semantic extraction, label localization, label retrieval, and recommendation. The semantic information of discharge summaries was mapped into the diagnoses set. Then, one-dimensional convolutional neural networks slid into the bottom-up region for diagnosis localization to enrich rare diagnoses. Finally, OLR-Net detected the principal diagnosis in the localized region. The evaluation metrics focus on the hit ratio, mean reciprocal rank, and the area under the receiver operating characteristic curve (AUROC). RESULTS 12,788 desensitized discharge summary records were collected from the oncology department at Hainan Hospital of Chinese People's Liberation Army General Hospital. We designed five distinct settings based on the number of training data per diagnosis: the full dataset, the top-50 dataset, the few-shot dataset, the one-shot dataset, and the zero-shot dataset. The performance of our model had the highest HR@5 of 0.8778 and macro-AUROC of 0.9851. In the limited available (few-shot and one-shot) dataset, the macro-AUROC were 0.9833 and 0.9485, respectively. CONCLUSIONS OLR-Net has great potential for extracting principal diagnosis with limited available data through label localization and retrieval.
Collapse
Affiliation(s)
- Kai Wang
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China; School of Information and Communication Engineering, Hainan University, Haikou 570228, China
| | - Xin Tan
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China; College of Biomedical Engineering and Instrumental Science, Zhejiang University, Hangzhou 310027, China
| | - Shan Nan
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China.
| | - Lei Sang
- Hainan Hospital of Chinese People's Liberation Army General Hospital, Sanya 572013, China
| | - Han Chen
- Hainan Hospital of Chinese People's Liberation Army General Hospital, Sanya 572013, China
| | - Huilong Duan
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China; College of Biomedical Engineering and Instrumental Science, Zhejiang University, Hangzhou 310027, China
| |
Collapse
|
24
|
Lim B, Seth I, Cuomo R, Kenney PS, Ross RJ, Sofiadellis F, Pentangelo P, Ceccaroni A, Alfano C, Rozen WM. Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients. Aesthetic Plast Surg 2024; 48:4712-4724. [PMID: 38898239 PMCID: PMC11645314 DOI: 10.1007/s00266-024-04157-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Accepted: 05/21/2024] [Indexed: 06/21/2024]
Abstract
BACKGROUND Abdominoplasty is a common operation, used for a range of cosmetic and functional issues, often in the context of divarication of recti, significant weight loss, and after pregnancy. Despite this, patient-surgeon communication gaps can hinder informed decision-making. The integration of large language models (LLMs) in healthcare offers potential for enhancing patient information. This study evaluated the feasibility of using LLMs for answering perioperative queries. METHODS This study assessed the efficacy of four leading LLMs-OpenAI's ChatGPT-3.5, Anthropic's Claude, Google's Gemini, and Bing's CoPilot-using fifteen unique prompts. All outputs were evaluated using the Flesch-Kincaid, Flesch Reading Ease score, and Coleman-Liau index for readability assessment. The DISCERN score and a Likert scale were utilized to evaluate quality. Scores were assigned by two plastic surgical residents and then reviewed and discussed until a consensus was reached by five plastic surgeon specialists. RESULTS ChatGPT-3.5 required the highest level for comprehension, followed by Gemini, Claude, then CoPilot. Claude provided the most appropriate and actionable advice. In terms of patient-friendliness, CoPilot outperformed the rest, enhancing engagement and information comprehensiveness. ChatGPT-3.5 and Gemini offered adequate, though unremarkable, advice, employing more professional language. CoPilot uniquely included visual aids and was the only model to use hyperlinks, although they were not very helpful and acceptable, and it faced limitations in responding to certain queries. CONCLUSION ChatGPT-3.5, Gemini, Claude, and Bing's CoPilot showcased differences in readability and reliability. LLMs offer unique advantages for patient care but require careful selection. Future research should integrate LLM strengths and address weaknesses for optimal patient education. LEVEL OF EVIDENCE V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
Affiliation(s)
- Bryan Lim
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Ishith Seth
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Roberto Cuomo
- Plastic Surgery Unit, Department of Medicine, Surgery and Neuroscience, University of Siena, Siena, Italy.
| | - Peter Sinkjær Kenney
- Department of Plastic Surgery, Velje Hospital, Beriderbakken 4, 7100, Vejle, Denmark
- Department of Plastic and Breast Surgery, Aarhus University Hospital, Aarhus, Denmark
| | - Richard J Ross
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Foti Sofiadellis
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | | | | | | | - Warren Matthew Rozen
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| |
Collapse
|
25
|
Küchemann S, Rau M, Schmidt A, Kuhn J. ChatGPT's quality: Reliability and validity of concept inventory items. Front Psychol 2024; 15:1426209. [PMID: 39439749 PMCID: PMC11493723 DOI: 10.3389/fpsyg.2024.1426209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 09/19/2024] [Indexed: 10/25/2024] Open
Abstract
Introduction The recent advances of large language models (LLMs) have opened a wide range of opportunities, but at the same time, they pose numerous challenges and questions that research needs to answer. One of the main challenges are the quality and correctness of the output of LLMs as well as the overreliance of students on the output without critically reflecting on it. This poses the question of the quality of the output of LLMs in educational tasks and what students and teachers need to consider when using LLMs for creating educational items. In this work, we focus on the quality and characteristics of conceptual items developed using ChatGPT without user-generated improvements. Methods For this purpose, we optimized prompts and created 30 conceptual items in kinematics, which is a standard topic in high-school level physics. The items were rated by two independent experts. Those 15 items that received the highest rating were included in a conceptual survey. The dimensions were designed to align with the ones in the most commonly used concept inventory, the Force Concept Inventory (FCI). We administered the designed items together with the FCI to 172 first-year university students. The results show that ChatGPT items have a medium difficulty and discriminatory index but they overall exhibit a slightly lower average values as the FCI. Moreover, a confirmatory factor analysis confirmed a three factor model that is closely aligned with a previously suggested expert model. Results and discussion In this way, after careful prompt engineering, thorough analysis and selection of fully automatically generated items by ChatGPT, we were able to create concept items that had only a slightly lower quality than carefully human-generated concept items. The procedures to create and select such a high-quality set of items that is fully automatically generated require large efforts and point towards cognitive demands of teachers when using LLMs to create items. Moreover, the results demonstrate that human oversight or student interviews are necessary when creating one-dimensional assessments and distractors that are closely aligned with students' difficulties.
Collapse
Affiliation(s)
- Stefan Küchemann
- Chair of Physics Education Research, Faculty of Physics, Ludwig-Maximilians-Universität München (LMU Munich), Munich, Germany
| | - Martina Rau
- Chair of Research on Learning and Instruction, Department of Humanities, Social and Political Sciences, ETH Zurich, Zurich, Switzerland
| | - Albrecht Schmidt
- Chair for Human-Centered Ubiquitous Media, Institute of Informatics, Ludwig-Maximilians-Universität München (LMU Munich), Munich, Germany
| | - Jochen Kuhn
- Chair of Physics Education Research, Faculty of Physics, Ludwig-Maximilians-Universität München (LMU Munich), Munich, Germany
| |
Collapse
|
26
|
Armbruster J, Bussmann F, Rothhaas C, Titze N, Grützner PA, Freischmidt H. "Doctor ChatGPT, Can You Help Me?" The Patient's Perspective: Cross-Sectional Study. J Med Internet Res 2024; 26:e58831. [PMID: 39352738 PMCID: PMC11480680 DOI: 10.2196/58831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 07/12/2024] [Accepted: 08/01/2024] [Indexed: 10/03/2024] Open
Abstract
BACKGROUND Artificial intelligence and the language models derived from it, such as ChatGPT, offer immense possibilities, particularly in the field of medicine. It is already evident that ChatGPT can provide adequate and, in some cases, expert-level responses to health-related queries and advice for patients. However, it is currently unknown how patients perceive these capabilities, whether they can derive benefit from them, and whether potential risks, such as harmful suggestions, are detected by patients. OBJECTIVE This study aims to clarify whether patients can get useful and safe health care advice from an artificial intelligence chatbot assistant. METHODS This cross-sectional study was conducted using 100 publicly available health-related questions from 5 medical specialties (trauma, general surgery, otolaryngology, pediatrics, and internal medicine) from a web-based platform for patients. Responses generated by ChatGPT-4.0 and by an expert panel (EP) of experienced physicians from the aforementioned web-based platform were packed into 10 sets consisting of 10 questions each. The blinded evaluation was carried out by patients regarding empathy and usefulness (assessed through the question: "Would this answer have helped you?") on a scale from 1 to 5. As a control, evaluation was also performed by 3 physicians in each respective medical specialty, who were additionally asked about the potential harm of the response and its correctness. RESULTS In total, 200 sets of questions were submitted by 64 patients (mean 45.7, SD 15.9 years; 29/64, 45.3% male), resulting in 2000 evaluated answers of ChatGPT and the EP each. ChatGPT scored higher in terms of empathy (4.18 vs 2.7; P<.001) and usefulness (4.04 vs 2.98; P<.001). Subanalysis revealed a small bias in terms of levels of empathy given by women in comparison with men (4.46 vs 4.14; P=.049). Ratings of ChatGPT were high regardless of the participant's age. The same highly significant results were observed in the evaluation of the respective specialist physicians. ChatGPT outperformed significantly in correctness (4.51 vs 3.55; P<.001). Specialists rated the usefulness (3.93 vs 4.59) and correctness (4.62 vs 3.84) significantly lower in potentially harmful responses from ChatGPT (P<.001). This was not the case among patients. CONCLUSIONS The results indicate that ChatGPT is capable of supporting patients in health-related queries better than physicians, at least in terms of written advice through a web-based platform. In this study, ChatGPT's responses had a lower percentage of potentially harmful advice than the web-based EP. However, it is crucial to note that this finding is based on a specific study design and may not generalize to all health care settings. Alarmingly, patients are not able to independently recognize these potential dangers.
Collapse
Affiliation(s)
- Jonas Armbruster
- Department of Trauma and Orthopedic Surgery, BG Klinik Ludwigshafen, Ludwigshafen am Rhein, Germany
| | - Florian Bussmann
- Department of Trauma and Orthopedic Surgery, BG Klinik Ludwigshafen, Ludwigshafen am Rhein, Germany
| | - Catharina Rothhaas
- Department of Trauma and Orthopedic Surgery, BG Klinik Ludwigshafen, Ludwigshafen am Rhein, Germany
| | - Nadine Titze
- Department of Trauma and Orthopedic Surgery, BG Klinik Ludwigshafen, Ludwigshafen am Rhein, Germany
| | - Paul Alfred Grützner
- Department of Trauma and Orthopedic Surgery, BG Klinik Ludwigshafen, Ludwigshafen am Rhein, Germany
| | - Holger Freischmidt
- Department of Trauma and Orthopedic Surgery, BG Klinik Ludwigshafen, Ludwigshafen am Rhein, Germany
| |
Collapse
|
27
|
Ghilzai U, Fiedler B, Ghali A, Singh A, Cass B, Young A, Ahmed AS. ChatGPT provides acceptable responses to patient questions regarding common shoulder pathology. Shoulder Elbow 2024:17585732241283971. [PMID: 39545009 PMCID: PMC11559869 DOI: 10.1177/17585732241283971] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 07/28/2024] [Accepted: 08/28/2024] [Indexed: 11/17/2024]
Abstract
Background ChatGPT is rapidly becoming a source of medical knowledge for patients. This study aims to assess the completeness and accuracy of ChatGPT's answers to the most frequently asked patients' questions about shoulder pathology. Methods ChatGPT (version 3.5) was queried to produce the five most common shoulder pathologies: biceps tendonitis, rotator cuff tears, shoulder arthritis, shoulder dislocation and adhesive capsulitis. Subsequently, it generated the five most common patient questions regarding these pathologies and was queried to respond. Responses were evaluated by three shoulder and elbow fellowship-trained orthopedic surgeons with a mean of 9 years of independent practice, on Likert scales for accuracy (1-6) and completeness (rated 1-3). Results For all questions, responses were deemed acceptable, rated at least "nearly all correct," indicated by a score of 5 or greater for accuracy, and "adequately complete," indicated by a minimum of 2 for completeness. The mean scores for accuracy and completeness, respectively, were 5.5 and 2.6 for rotator cuff tears, 5.8 and 2.7 for shoulder arthritis, 5.5 and 2.3 for shoulder dislocations, 5.1 and 2.4 for adhesive capsulitis, 5.8 and 2.9 for biceps tendonitis. Conclusion ChatGPT provides both accurate and complete responses to the most common patients' questions about shoulder pathology. These findings suggest that Large Language Models might play a role as a patient resource; however, patients should always verify online information with their physician. Level of Evidence Level V Expert Opinion.
Collapse
Affiliation(s)
- Umar Ghilzai
- Baylor College of Medicine, Department of Orthopedic Surgery, Houston, TX, USA
| | - Benjamin Fiedler
- Baylor College of Medicine, Department of Orthopedic Surgery, Houston, TX, USA
| | - Abdullah Ghali
- Baylor College of Medicine, Department of Orthopedic Surgery, Houston, TX, USA
| | - Aaron Singh
- UT Health San Antonio, Department of Orthopaedics, San Antonio, TX, USA
| | - Benjamin Cass
- Sydney Shoulder Research Institute, Sydney Shoulder Specialists, Greenwich, New South Wales, Australia
| | - Allan Young
- Sydney Shoulder Research Institute, Sydney Shoulder Specialists, Greenwich, New South Wales, Australia
| | - Adil Shahzad Ahmed
- Baylor College of Medicine, Department of Orthopedic Surgery, Houston, TX, USA
| |
Collapse
|
28
|
Quinn M, Milner JD, Schmitt P, Morrissey P, Lemme N, Marcaccio S, DeFroda S, Tabaddor R, Owens BD. Artificial Intelligence Large Language Models Address Anterior Cruciate Ligament Reconstruction: Superior Clarity and Completeness by Gemini Compared With ChatGPT-4 in Response to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines. Arthroscopy 2024:S0749-8063(24)00736-9. [PMID: 39313138 DOI: 10.1016/j.arthro.2024.09.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Revised: 08/31/2024] [Accepted: 09/05/2024] [Indexed: 09/25/2024]
Abstract
PURPOSE To assess the ability of ChatGPT-4 and Gemini to generate accurate and relevant responses to the 2022 American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPG) for anterior cruciate ligament reconstruction (ACLR). METHODS Responses from ChatGPT-4 and Gemini to prompts derived from all 15 AAOS guidelines were evaluated by 7 fellowship-trained orthopaedic sports medicine surgeons using a structured questionnaire assessing 5 key characteristics on a scale from 1 to 5. The prompts were categorized into 3 areas: diagnosis and preoperative management, surgical timing and technique, and rehabilitation and prevention. Statistical analysis included mean scoring, standard deviation, and 2-sided t tests to compare the performance between the 2 large language models (LLMs). Scores were then evaluated for inter-rater reliability (IRR). RESULTS Overall, both LLMs performed well with mean scores >4 for the 5 key characteristics. Gemini demonstrated superior performance in overall clarity (4.848 ± 0.36 vs 4.743 ± 0.481, P = .034), but all other characteristics demonstrated nonsignificant differences (P > .05). Gemini also demonstrated superior clarity in the surgical timing and technique (P = .038) as well as the prevention and rehabilitation (P = .044) subcategories. Additionally, Gemini had superior performance completeness scores in the rehabilitation and prevention subcategory (P = .044), but no statistically significant differences were found amongst the other subcategories. The overall IRR was found to be 0.71 (moderate). CONCLUSIONS Both Gemini and ChatGPT-4 demonstrate an overall good ability to generate accurate and relevant responses to question prompts based on the 2022 AAOS CPG for ACLR. However, Gemini demonstrated superior clarity in multiple domains in addition to superior completeness for questions pertaining to rehabilitation and prevention. CLINICAL RELEVANCE The current study addresses a current gap in the LLM and ACLR literature by comparing the performance of ChatGPT-4 to Gemini, which is growing in popularity with more than 300 million individual uses in May 2024 alone. Moreover, the results demonstrated superior performance of Gemini in both clarity and completeness, which are critical elements of a tool being used by patients for educational purposes. Additionally, the current study uses question prompts based on the AAOS CPG, which may be used as a method of standardization for future investigations on performance of LLM platforms. Thus, the results of this study may be of interest to both the readership of Arthroscopy and patients.
Collapse
Affiliation(s)
- Matthew Quinn
- Department of Orthopaedics, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, U.S.A..
| | - John D Milner
- Department of Orthopaedics, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, U.S.A
| | - Phillip Schmitt
- The Warren Alpert Medical School of Brown University, Providence, Rhode Island, U.S.A
| | - Patrick Morrissey
- Department of Orthopaedics, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, U.S.A
| | - Nicholas Lemme
- Department of Orthopaedics, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, U.S.A
| | - Stephen Marcaccio
- Department of Orthopaedic Surgery, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania, U.S.A
| | - Steven DeFroda
- Department of Orthopaedic Surgery, Missouri Orthopaedic Institute, University of Missouri, Columbia, Missouri, U.S.A
| | - Ramin Tabaddor
- Department of Orthopaedics, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, U.S.A
| | - Brett D Owens
- Department of Orthopaedics, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, U.S.A
| |
Collapse
|
29
|
Kunze KN. Editorial Commentary: The Scope of Medical Research Concerning ChatGPT Remains Limited by Lack of Originality. Arthroscopy 2024:S0749-8063(24)00679-0. [PMID: 39278424 DOI: 10.1016/j.arthro.2024.09.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 09/06/2024] [Accepted: 09/07/2024] [Indexed: 09/18/2024]
Abstract
There is no shortage of literature surrounding ChatGPT and whether this large language model can provide accurate and clinically relevant information in response to simulated patient queries. Unfortunately, there is a shortage of literature addressing important considerations beyond these experimental and entertaining uses. Indeed, a trend for redundancy has emerged where most of the literature has applied ChatGPT to the same tasks while simply swapping the subject matter, resulting in a failure to expand the impact and reach of this potentially transformational artificial intelligence (AI) solution. Instead, research addressing pressing health care challenges and a renewed focus on novel use cases will allow for more meaningful research initiatives, product development, and tangible changes at both the system and point-of-care levels. Current target areas of interest in medicine that remain obstacles to patient care include prior authorization, administrative burden, documentation generation, medical triage and diagnosis, and patient communication efficiency. To advance this area of research toward such meaningful applications, a structured framework is necessary. Such frameworks should include problem identification; definition of key performance indicators; multidisciplinary and multi-institutional collaboration of those with domain expertise, including AI engineers and information technology specialists; policy and strategy development driven by executive-level personnel; institutional financial support and investment from key stakeholders for AI infrastructure and maintenance; and critical assessment of AI performance, bias, and equity.
Collapse
|
30
|
Liu J, Liang X, Fang D, Zheng J, Yin C, Xie H, Li Y, Sun X, Tong Y, Che H, Hu P, Yang F, Wang B, Chen Y, Cheng G, Zhang J. The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis. J Med Internet Res 2024; 26:e54985. [PMID: 39255016 PMCID: PMC11422746 DOI: 10.2196/54985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 02/05/2024] [Accepted: 07/24/2024] [Indexed: 09/11/2024] Open
Abstract
BACKGROUND ChatGPT (OpenAI) has shown great potential in clinical diagnosis and could become an excellent auxiliary tool in clinical practice. This study investigates and evaluates ChatGPT in diagnostic capabilities by comparing the performance of GPT-3.5 and GPT-4.0 across model iterations. OBJECTIVE This study aims to evaluate the precise diagnostic ability of GPT-3.5 and GPT-4.0 for colon cancer and its potential as an auxiliary diagnostic tool for surgeons and compare the diagnostic accuracy rates between GTP-3.5 and GPT-4.0. We precisely assess the accuracy of primary and secondary diagnoses and analyze the causes of misdiagnoses in GPT-3.5 and GPT-4.0 according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings. METHODS We retrieved 316 case reports for intestinal cancer from the Chinese Medical Association Publishing House database, of which 286 cases were deemed valid after data cleansing. The cases were translated from Mandarin to English and then input into GPT-3.5 and GPT-4.0 using a simple, direct prompt to elicit primary and secondary diagnoses. We conducted a comparative study to evaluate the diagnostic accuracy of GPT-4.0 and GPT-3.5. Three senior surgeons from the General Surgery Department, specializing in Colorectal Surgery, assessed the diagnostic information at the Chinese PLA (People's Liberation Army) General Hospital. The accuracy of primary and secondary diagnoses was scored based on predefined criteria. Additionally, we analyzed and compared the causes of misdiagnoses in both models according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings. RESULTS Out of 286 cases, GPT-4.0 and GPT-3.5 both demonstrated high diagnostic accuracy for primary diagnoses, but the accuracy rates of GPT-4.0 were significantly higher than GPT-3.5 (mean 0.972, SD 0.137 vs mean 0.855, SD 0.335; t285=5.753; P<.001). For secondary diagnoses, the accuracy rates of GPT-4.0 were also significantly higher than GPT-3.5 (mean 0.908, SD 0.159 vs mean 0.617, SD 0.349; t285=-7.727; P<.001). GPT-3.5 showed limitations in processing patient history, symptom presentation, laboratory tests, and imaging data. While GPT-4.0 improved upon GPT-3.5, it still has limitations in identifying symptoms and laboratory test data. For both primary and secondary diagnoses, there was no significant difference in accuracy related to age, gender, or system group between GPT-4.0 and GPT-3.5. CONCLUSIONS This study demonstrates that ChatGPT, particularly GPT-4.0, possesses significant diagnostic potential, with GPT-4.0 exhibiting higher accuracy than GPT-3.5. However, GPT-4.0 still has limitations, particularly in recognizing patient symptoms and laboratory data, indicating a need for more research in real-world clinical settings to enhance its diagnostic capabilities.
Collapse
Affiliation(s)
- Jiayu Liu
- Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
| | - Xiuting Liang
- Department of Respiratory and Critical Care Medicine, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
| | - Dandong Fang
- Department of Neurosurgery, Sanmenxia Central Hospital, Sanmenxia, China
| | - Jiqi Zheng
- School of Health Humanities, Peking University, Beijing, China
| | - Chengliang Yin
- Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China
- National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
| | - Hui Xie
- Departments of Urology, The First Affiliated Hospital of Fujian Medical University, Fuzhou, China
| | - Yanteng Li
- Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
| | - Xiaochun Sun
- Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China
- National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
| | - Yue Tong
- Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China
- National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
| | - Hebin Che
- Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China
- National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
| | - Ping Hu
- Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China
- National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
| | - Fan Yang
- Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
| | - Bingxian Wang
- Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
| | - Yuanyuan Chen
- Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China
- National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
| | - Gang Cheng
- Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
| | - Jianning Zhang
- Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
| |
Collapse
|
31
|
Si Y, Yang Y, Wang X, Zu J, Chen X, Fan X, An R, Gong S. Quality and Accountability of ChatGPT in Health Care in Low- and Middle-Income Countries: Simulated Patient Study. J Med Internet Res 2024; 26:e56121. [PMID: 39250188 PMCID: PMC11420570 DOI: 10.2196/56121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 04/21/2024] [Accepted: 07/30/2024] [Indexed: 09/10/2024] Open
Abstract
Using simulated patients to mimic 9 established noncommunicable and infectious diseases, we assessed ChatGPT's performance in treatment recommendations for common diseases in low- and middle-income countries. ChatGPT had a high level of accuracy in both correct diagnoses (20/27, 74%) and medication prescriptions (22/27, 82%) but a concerning level of unnecessary or harmful medications (23/27, 85%) even with correct diagnoses. ChatGPT performed better in managing noncommunicable diseases than infectious ones. These results highlight the need for cautious AI integration in health care systems to ensure quality and safety.
Collapse
Affiliation(s)
- Yafei Si
- UNSW Business School and CEPAR, The University of New South Wales, Kensington, Australia
| | - Yuyi Yang
- Division of Computational and Data Sciences, Washington University in St Louis, St. Louis, MO, United States
| | - Xi Wang
- Brown School, Washington University in St Louis, St Louis, MT, United States
| | - Jiaqi Zu
- Global Health Research Center, Duke Kunshan University, Kunshan, China
| | - Xi Chen
- Department of Health Policy and Management, Yale University, New Haven, CT, United States
- Department of Economics, Yale University, New Haven, CT, United States
| | - Xiaojing Fan
- School of Public Policy and Administration, Xi'an Jiaotong University, Xi'an, China
| | - Ruopeng An
- Brown School, Washington University in St Louis, St Louis, MT, United States
- Silver School of Social Work, New York University, New York, NY, United States
| | - Sen Gong
- Centre for International Studies on Development and Governance, Zhejiang University, Hangzhou, China
| |
Collapse
|
32
|
Warrier A, Singh R, Haleem A, Zaki H, Eloy JA. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology. Laryngoscope 2024; 134:3997-4002. [PMID: 38563415 DOI: 10.1002/lary.31434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/05/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
OBJECTIVES Evaluate and compare the ability of large language models (LLMs) to diagnose various ailments in otolaryngology. METHODS We collected all 100 clinical vignettes from the second edition of Otolaryngology Cases-The University of Cincinnati Clinical Portfolio by Pensak et al. With the addition of the prompt "Provide a diagnosis given the following history," we prompted ChatGPT-3.5, Google Bard, and Bing-GPT4 to provide a diagnosis for each vignette. These diagnoses were compared to the portfolio for accuracy and recorded. All queries were run in June 2023. RESULTS ChatGPT-3.5 was the most accurate model (89% success rate), followed by Google Bard (82%) and Bing GPT (74%). A chi-squared test revealed a significant difference between the three LLMs in providing correct diagnoses (p = 0.023). Of the 100 vignettes, seven require additional testing results (i.e., biopsy, non-contrast CT) for accurate clinical diagnosis. When omitting these vignettes, the revised success rates were 95.7% for ChatGPT-3.5, 88.17% for Google Bard, and 78.72% for Bing-GPT4 (p = 0.002). CONCLUSIONS ChatGPT-3.5 offers the most accurate diagnoses when given established clinical vignettes as compared to Google Bard and Bing-GPT4. LLMs may accurately offer assessments for common otolaryngology conditions but currently require detailed prompt information and critical supervision from clinicians. There is vast potential in the clinical applicability of LLMs; however, practitioners should be wary of possible "hallucinations" and misinformation in responses. LEVEL OF EVIDENCE 3 Laryngoscope, 134:3997-4002, 2024.
Collapse
Affiliation(s)
- Akshay Warrier
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Rohan Singh
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Afash Haleem
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Haider Zaki
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Jean Anderson Eloy
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
- Center for Skull Base and Pituitary Surgery, Neurological Institute of New Jersey, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| |
Collapse
|
33
|
Hwai H, Ho YJ, Wang CH, Huang CH. Large language model application in emergency medicine and critical care. J Formos Med Assoc 2024:S0929-6646(24)00400-5. [PMID: 39198112 DOI: 10.1016/j.jfma.2024.08.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 08/13/2024] [Accepted: 08/23/2024] [Indexed: 09/01/2024] Open
Abstract
In the rapidly evolving healthcare landscape, artificial intelligence (AI), particularly the large language models (LLMs), like OpenAI's Chat Generative Pretrained Transformer (ChatGPT), has shown transformative potential in emergency medicine and critical care. This review article highlights the advancement and applications of ChatGPT, from diagnostic assistance to clinical documentation and patient communication, demonstrating its ability to perform comparably to human professionals in medical examinations. ChatGPT could assist clinical decision-making and medication selection in critical care, showcasing its potential to optimize patient care management. However, integrating LLMs into healthcare raises legal, ethical, and privacy concerns, including data protection and the necessity for informed consent. Finally, we addressed the challenges related to the accuracy of LLMs, such as the risk of providing incorrect medical advice. These concerns underscore the importance of ongoing research and regulation to ensure their ethical and practical use in healthcare.
Collapse
Affiliation(s)
- Haw Hwai
- Department of Emergency Medicine, National Taiwan University Hospital, National Taiwan University Medical College, Taipei, Taiwan.
| | - Yi-Ju Ho
- Department of Emergency Medicine, National Taiwan University Hospital, National Taiwan University Medical College, Taipei, Taiwan.
| | - Chih-Hung Wang
- Department of Emergency Medicine, National Taiwan University Hospital, National Taiwan University Medical College, Taipei, Taiwan.
| | - Chien-Hua Huang
- Department of Emergency Medicine, National Taiwan University Hospital, National Taiwan University Medical College, Taipei, Taiwan.
| |
Collapse
|
34
|
Goumas G, Dardavesis TI, Syrigos K, Syrigos N, Simou E. Chatbots in Cancer Applications, Advantages and Disadvantages: All that Glitters Is Not Gold. J Pers Med 2024; 14:877. [PMID: 39202068 PMCID: PMC11355580 DOI: 10.3390/jpm14080877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 08/12/2024] [Accepted: 08/14/2024] [Indexed: 09/03/2024] Open
Abstract
The emergence of digitalization and artificial intelligence has had a profound impact on society, especially in the field of medicine. Digital health is now a reality, with an increasing number of people using chatbots for prognostic or diagnostic purposes, therapeutic planning, and monitoring, as well as for nutritional and mental health support. Initially designed for various purposes, chatbots have demonstrated significant advantages in the medical field, as indicated by multiple sources. However, there are conflicting views in the current literature, with some sources highlighting their drawbacks and limitations, particularly in their use in oncology. This state-of-the-art review article seeks to present both the benefits and the drawbacks of chatbots in the context of medicine and cancer, while also addressing the challenges in their implementation, offering expert insights on the subject.
Collapse
Affiliation(s)
- Georgios Goumas
- Department of Public Health Policy, School of Public Health, University of West Attica, 115 21 Athens, Greece;
| | - Theodoros I. Dardavesis
- Laboratory of Hygiene, Social & Preventive Medicine and Medical Statistics, School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, 541 24 Thessaloniki, Greece;
| | - Konstantinos Syrigos
- Oncology Unit, 3rd Department of Medicine, “Sotiria” Hospital for Diseases of the Chest, National and Kapodistrian University of Athens, 115 27 Athens, Greece; (K.S.); (N.S.)
| | - Nikolaos Syrigos
- Oncology Unit, 3rd Department of Medicine, “Sotiria” Hospital for Diseases of the Chest, National and Kapodistrian University of Athens, 115 27 Athens, Greece; (K.S.); (N.S.)
- Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - Effie Simou
- Department of Public Health Policy, School of Public Health, University of West Attica, 115 21 Athens, Greece;
| |
Collapse
|
35
|
Langston E, Charness N, Boot W. Are Virtual Assistants Trustworthy for Medicare Information: An Examination of Accuracy and Reliability. THE GERONTOLOGIST 2024; 64:gnae062. [PMID: 38832398 PMCID: PMC11258897 DOI: 10.1093/geront/gnae062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND AND OBJECTIVES Advances in artificial intelligence (AI)-based virtual assistants provide a potential opportunity for older adults to use this technology in the context of health information-seeking. Meta-analysis on trust in AI shows that users are influenced by the accuracy and reliability of the AI trustee. We evaluated these dimensions for responses to Medicare queries. RESEARCH DESIGN AND METHODS During the summer of 2023, we assessed the accuracy and reliability of Alexa, Google Assistant, Bard, and ChatGPT-4 on Medicare terminology and general content from a large, standardized question set. We compared the accuracy of these AI systems to that of a large representative sample of Medicare beneficiaries who were queried twenty years prior. RESULTS Alexa and Google Assistant were found to be highly inaccurate when compared to beneficiaries' mean accuracy of 68.4% on terminology queries and 53.0% on general Medicare content. Bard and ChatGPT-4 answered Medicare terminology queries perfectly and performed much better on general Medicare content queries (Bard = 96.3%, ChatGPT-4 = 92.6%) than the average Medicare beneficiary. About one month to a month-and-a-half later, we found that Bard and Alexa's accuracy stayed the same, whereas ChatGPT-4's performance nominally decreased, and Google Assistant's performance nominally increased. DISCUSSION AND IMPLICATIONS LLM-based assistants generate trustworthy information in response to carefully phrased queries about Medicare, in contrast to Alexa and Google Assistant. Further studies will be needed to determine what factors beyond accuracy and reliability influence the adoption and use of such technology for Medicare decision-making.
Collapse
Affiliation(s)
- Emily Langston
- Department of Psychology, Florida State University, Tallahassee, Florida, USA
| | - Neil Charness
- Department of Psychology, Florida State University, Tallahassee, Florida, USA
| | - Walter Boot
- Department of Psychology, Florida State University, Tallahassee, Florida, USA
| |
Collapse
|
36
|
Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res 2024; 26:e60807. [PMID: 39052324 PMCID: PMC11310649 DOI: 10.2196/60807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 06/11/2024] [Accepted: 06/15/2024] [Indexed: 07/27/2024] Open
Abstract
BACKGROUND Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT's performance on different medical licensing examinations. OBJECTIVE In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education. METHODS We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses. RESULTS A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5's performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non-English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5's (P=.03) and GPT-4's (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT's accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs. CONCLUSIONS GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education. TRIAL REGISTRATION PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687.
Collapse
Affiliation(s)
- Mingxin Liu
- Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Tsuyoshi Okuhara
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - XinYi Chang
- Department of Industrial Engineering and Economics, School of Engineering, Tokyo Institute of Technology, Tokyo, Japan
| | - Ritsuko Shirabe
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Yuriko Nishiie
- Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Hiroko Okada
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Takahiro Kiuchi
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
37
|
Huo W, He M, Zeng Z, Bao X, Lu Y, Tian W, Feng J, Feng R. Impact Analysis of COVID-19 Pandemic on Hospital Reviews on Dianping Website in Shanghai, China: Empirical Study. J Med Internet Res 2024; 26:e52992. [PMID: 38954461 PMCID: PMC11252617 DOI: 10.2196/52992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 01/24/2024] [Accepted: 05/21/2024] [Indexed: 07/04/2024] Open
Abstract
BACKGROUND In the era of the internet, individuals have increasingly accustomed themselves to gathering necessary information and expressing their opinions on public web-based platforms. The health care sector is no exception, as these comments, to a certain extent, influence people's health care decisions. During the onset of the COVID-19 pandemic, how the medical experience of Chinese patients and their evaluations of hospitals have changed remains to be studied. Therefore, we plan to collect patient medical visit data from the internet to reflect the current status of medical relationships under specific circumstances. OBJECTIVE This study aims to explore the differences in patient comments across various stages (during, before, and after) of the COVID-19 pandemic, as well as among different types of hospitals (children's hospitals, maternity hospitals, and tumor hospitals). Additionally, by leveraging ChatGPT (OpenAI), the study categorizes the elements of negative hospital evaluations. An analysis is conducted on the acquired data, and potential solutions that could improve patient satisfaction are proposed. This study is intended to assist hospital managers in providing a better experience for patients who are seeking care amid an emergent public health crisis. METHODS Selecting the top 50 comprehensive hospitals nationwide and the top specialized hospitals (children's hospitals, tumor hospitals, and maternity hospitals), we collected patient reviews from these hospitals on the Dianping website. Using ChatGPT, we classified the content of negative reviews. Additionally, we conducted statistical analysis using SPSS (IBM Corp) to examine the scoring and composition of negative evaluations. RESULTS A total of 30,317 pieces of effective comment information were collected from January 1, 2018, to August 15, 2023, including 7696 pieces of negative comment information. Manual inspection results indicated that ChatGPT had an accuracy rate of 92.05%. The F1-score was 0.914. The analysis of this data revealed a significant correlation between the comments and ratings received by hospitals during the pandemic. Overall, there was a significant increase in average comment scores during the outbreak (P<.001). Furthermore, there were notable differences in the composition of negative comments among different types of hospitals (P<.001). Children's hospitals received sensitive feedback regarding waiting times and treatment effectiveness, while patients at maternity hospitals showed a greater concern for the attitude of health care providers. Patients at tumor hospitals expressed a desire for timely examinations and treatments, especially during the pandemic period. CONCLUSIONS The COVID-19 pandemic had some association with patient comment scores. There were variations in the scores and content of comments among different types of specialized hospitals. Using ChatGPT to analyze patient comment content represents an innovative approach for statistically assessing factors contributing to patient dissatisfaction. The findings of this study could provide valuable insights for hospital administrators to foster more harmonious physician-patient relationships and enhance hospital performance during public health emergencies.
Collapse
Affiliation(s)
- Weixue Huo
- Department of Vascular Surgery, Shanghai General Hospital, Shanghai Jiaotong University, Shanghai, China
| | - Mengwei He
- Department of Vascular Surgery, Shanghai General Hospital, Shanghai Jiaotong University, Shanghai, China
| | - Zhaoxiang Zeng
- Department of Vascular Surgery, Changhai Hospital, Navy Medical University, Shanghai, China
| | - Xianhao Bao
- Department of Vascular Surgery, Shanghai General Hospital, Shanghai Jiaotong University, Shanghai, China
| | - Ye Lu
- Department of Vascular Surgery, Shanghai General Hospital, Shanghai Jiaotong University, Shanghai, China
| | - Wen Tian
- Department of Vascular Surgery, Shanghai General Hospital, Shanghai Jiaotong University, Shanghai, China
| | - Jiaxuan Feng
- Vascular Surgery Department, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Rui Feng
- Department of Vascular Surgery, Shanghai General Hospital, Shanghai Jiaotong University, Shanghai, China
| |
Collapse
|
38
|
Arora V, Silburt J, Phillips M, Khan M, Petrisor B, Chaudhry H, Mundi R, Bhandari M. A Blinded Comparison of Three Generative Artificial Intelligence Chatbots for Orthopaedic Surgery Therapeutic Questions. Cureus 2024; 16:e65343. [PMID: 39184692 PMCID: PMC11344479 DOI: 10.7759/cureus.65343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/22/2024] [Indexed: 08/27/2024] Open
Abstract
Objective To compare the quality of responses from three chatbots (ChatGPT, Bing Chat, and AskOE) across various orthopaedic surgery therapeutic treatment questions. Design We identified a series of treatment-related questions across a range of subspecialties in orthopaedic surgery. Questions were "identically" entered into one of three chatbots (ChatGPT, Bing Chat, and AskOE) and reviewed using a standardized rubric. Participants Orthopaedic surgery experts associated with McMaster University and the University of Toronto blindly reviewed all responses. Outcomes The primary outcomes were scores on a five-item assessment tool assessing clinical correctness, clinical completeness, safety, usefulness, and references. The secondary outcome was the reviewers' preferred response for each question. We performed a mixed effects logistic regression to identify factors associated with selecting a preferred chatbot. Results Across all questions and answers, AskOE was preferred by reviewers to a significantly greater extent than both ChatGPT (P<0.001) and Bing (P<0.001). AskOE also received significantly higher total evaluation scores than both ChatGPT (P<0.001) and Bing (P<0.001). Further regression analysis showed that clinical correctness, clinical completeness, usefulness, and references were significantly associated with a preference for AskOE. Across all responses, there were four considered as having major errors in response, with three occurring with ChatGPT and one occurring with AskOE. Conclusions Reviewers significantly preferred AskOE over ChatGPT and Bing Chat across a variety of variables in orthopaedic therapy questions. This technology has important implications in a healthcare setting as it provides access to trustworthy answers in orthopaedic surgery.
Collapse
Affiliation(s)
- Vikram Arora
- Department of Surgery, McMaster University, Hamilton, CAN
| | - Joseph Silburt
- Department of Surgery, McMaster University, Hamilton, CAN
| | - Mark Phillips
- Department of Surgery, McMaster University, Hamilton, CAN
| | - Moin Khan
- Department of Surgery, McMaster University, Hamilton, CAN
| | - Brad Petrisor
- Department of Surgery, McMaster University, Hamilton, CAN
| | - Harman Chaudhry
- Department of Orthopaedic Surgery, University of Toronto, Toronto, CAN
| | - Raman Mundi
- Department of Orthopaedic Surgery, University of Toronto, Toronto, CAN
| | - Mohit Bhandari
- Department of Surgery, McMaster University, Hamilton, CAN
| |
Collapse
|
39
|
Ghanem D, Zhu AR, Kagabo W, Osgood G, Shafiq B. ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source. JB JS Open Access 2024; 9:e24.00099. [PMID: 39238880 PMCID: PMC11368215 DOI: 10.2106/jbjs.oa.24.00099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 09/07/2024] Open
Abstract
Introduction The artificial intelligence language model Chat Generative Pretrained Transformer (ChatGPT) has shown potential as a reliable and accessible educational resource in orthopaedic surgery. Yet, the accuracy of the references behind the provided information remains elusive, which poses a concern for maintaining the integrity of medical content. This study aims to examine the accuracy of the references provided by ChatGPT-4 concerning the Airway, Breathing, Circulation, Disability, Exposure (ABCDE) approach in trauma surgery. Methods Two independent reviewers critically assessed 30 ChatGPT-4-generated references supporting the well-established ABCDE approach to trauma protocol, grading them as 0 (nonexistent), 1 (inaccurate), or 2 (accurate). All discrepancies between the ChatGPT-4 and PubMed references were carefully reviewed and bolded. Cohen's Kappa coefficient was used to examine the agreement of the accuracy scores of the ChatGPT-4-generated references between reviewers. Descriptive statistics were used to summarize the mean reference accuracy scores. To compare the variance of the means across the 5 categories, one-way analysis of variance was used. Results ChatGPT-4 had an average reference accuracy score of 66.7%. Of the 30 references, only 43.3% were accurate and deemed "true" while 56.7% were categorized as "false" (43.3% inaccurate and 13.3% nonexistent). The accuracy was consistent across the 5 trauma protocol categories, with no significant statistical difference (p = 0.437). Discussion With 57% of references being inaccurate or nonexistent, ChatGPT-4 has fallen short in providing reliable and reproducible references-a concerning finding for the safety of using ChatGPT-4 for professional medical decision making without thorough verification. Only if used cautiously, with cross-referencing, can this language model act as an adjunct learning tool that can enhance comprehensiveness as well as knowledge rehearsal and manipulation.
Collapse
Affiliation(s)
- Diane Ghanem
- Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland
| | - Alexander R Zhu
- School of Medicine, The Johns Hopkins University, Baltimore, Maryland
| | - Whitney Kagabo
- Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland
| | - Greg Osgood
- Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland
| | - Babar Shafiq
- Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland
| |
Collapse
|
40
|
Kumar RP, Sivan V, Bachir H, Sarwar SA, Ruzicka F, O'Malley GR, Lobo P, Morales IC, Cassimatis ND, Hundal JS, Patel NV. Can Artificial Intelligence Mitigate Missed Diagnoses by Generating Differential Diagnoses for Neurosurgeons? World Neurosurg 2024; 187:e1083-e1088. [PMID: 38759788 DOI: 10.1016/j.wneu.2024.05.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 05/08/2024] [Accepted: 05/09/2024] [Indexed: 05/19/2024]
Abstract
BACKGROUND/OBJECTIVE Neurosurgery emphasizes the criticality of accurate differential diagnoses, with diagnostic delays posing significant health and economic challenges. As large language models (LLMs) emerge as transformative tools in healthcare, this study seeks to elucidate their role in assisting neurosurgeons with the differential diagnosis process, especially during preliminary consultations. METHODS This study employed 3 chat-based LLMs, ChatGPT (versions 3.5 and 4.0), Perplexity AI, and Bard AI, to evaluate their diagnostic accuracy. Each LLM was prompted using clinical vignettes, and their responses were recorded to generate differential diagnoses for 20 common and uncommon neurosurgical disorders. Disease-specific prompts were crafted using Dynamed, a clinical reference tool. The accuracy of the LLMs was determined based on their ability to identify the target disease within their top differential diagnoses correctly. RESULTS For the initial differential, ChatGPT 3.5 achieved an accuracy of 52.63%, while ChatGPT 4.0 performed slightly better at 53.68%. Perplexity AI and Bard AI demonstrated 40.00% and 29.47% accuracy, respectively. As the number of considered differentials increased from 2 to 5, ChatGPT 3.5 reached its peak accuracy of 77.89% for the top 5 differentials. Bard AI and Perplexity AI had varied performances, with Bard AI improving in the top 5 differentials at 62.11%. On a disease-specific note, the LLMs excelled in diagnosing conditions like epilepsy and cervical spine stenosis but faced challenges with more complex diseases such as Moyamoya disease and amyotrophic lateral sclerosis. CONCLUSIONS LLMs showcase the potential to enhance diagnostic accuracy and decrease the incidence of missed diagnoses in neurosurgery.
Collapse
Affiliation(s)
- Rohit Prem Kumar
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA.
| | - Vijay Sivan
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Hanin Bachir
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Syed A Sarwar
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Francis Ruzicka
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Geoffrey R O'Malley
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Paulo Lobo
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Ilona Cazorla Morales
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Nicholas D Cassimatis
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Jasdeep S Hundal
- Department of Neurology, HMH-Jersey Shore University Medical Center, Neptune, New Jersey, USA
| | - Nitesh V Patel
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA; Department of Neurosurgery, HMH-Jersey Shore University Medical Center, Neptune, New Jersey, USA
| |
Collapse
|
41
|
Yao JJ, Aggarwal M, Lopez RD, Namdari S. Current Concepts Review: Large Language Models in Orthopaedics: Definitions, Uses, and Limitations. J Bone Joint Surg Am 2024:00004623-990000000-01136. [PMID: 38896652 DOI: 10.2106/jbjs.23.01417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
➤ Large language models are a subset of artificial intelligence. Large language models are powerful tools that excel in natural language text processing and generation.➤ There are many potential clinical, research, and educational applications of large language models in orthopaedics, but the development of these applications needs to be focused on patient safety and the maintenance of high standards.➤ There are numerous methodological, ethical, and regulatory concerns with regard to the use of large language models. Orthopaedic surgeons need to be aware of the controversies and advocate for an alignment of these models with patient and caregiver priorities.
Collapse
Affiliation(s)
- Jie J Yao
- Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, Pennsylvania
| | | | - Ryan D Lopez
- Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Surena Namdari
- Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, Pennsylvania
| |
Collapse
|
42
|
MohanaSundaram A, Patil B, Praticò D. ChatGPT's Inconsistency in the Diagnosis of Alzheimer's Disease. J Alzheimers Dis Rep 2024; 8:923-925. [PMID: 38910941 PMCID: PMC11191643 DOI: 10.3233/adr-240069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Accepted: 05/04/2024] [Indexed: 06/25/2024] Open
Abstract
A recent article by El Haj et al. provided evidence that ChatGPT could be a potential tool that complements the clinical diagnosis of various stages of Alzheimer's Disease (AD) as well as mild cognitive impairment (MCI). To reassess the accuracy and reproducibility of ChatGPT in the diagnosis of AD and MCI, we used the same prompt used by the authors. Surprisingly, we found that some of the responses of ChatGPT in the diagnoses of various stages of AD and MCI were different. In this commentary we discuss the possible reasons for these different results and propose strategies for future studies.
Collapse
Affiliation(s)
| | - Bhushan Patil
- MannSparsh Neuropsychiatric Hospital, Kalyan, India
- Manasa Rehabilitation and De-Addiction Center, Titwala, India
| | - Domenico Praticò
- Alzheimer’s Center at Temple, Lewis Katz School of Medicine, Temple University, Philadelphia, PA, USA
| |
Collapse
|
43
|
Croxford E, Gao Y, Patterson B, To D, Tesch S, Dligach D, Mayampurath A, Churpek MM, Afshar M. Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.20.24304620. [PMID: 38562730 PMCID: PMC10984060 DOI: 10.1101/2024.03.20.24304620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality remains challenging, as existing methods often overlook generative task complexities. This work aimed to examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well-validated baseline with which to examine the alignment of these metrics, we created a comprehensive human evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score-a Unified Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics, particularly focusing on refining the SapBERT score for improved assessments.
Collapse
Affiliation(s)
- Emma Croxford
- Department of Medicine, School of Medicine and Public Health, University of Wisconsin Madison
| | - Yanjun Gao
- Department of Medicine, School of Medicine and Public Health, University of Wisconsin Madison
| | - Brian Patterson
- Department of Emergency Medicine, School of Medicine and Public Health, University of Wisconsin Madison
| | - Daniel To
- Department of Medicine, School of Medicine and Public Health, University of Wisconsin Madison
| | - Samuel Tesch
- Department of Medicine, School of Medicine and Public Health, University of Wisconsin Madison
| | | | - Anoop Mayampurath
- Biostatistics and Medical Informatics, School of Medicine and Public Health, University of Wisconsin Madison
| | - Matthew M Churpek
- Department of Medicine, School of Medicine and Public Health, University of Wisconsin Madison
| | - Majid Afshar
- Department of Medicine, School of Medicine and Public Health, University of Wisconsin Madison
| |
Collapse
|
44
|
Koga S. The double-edged nature of ChatGPT in self-diagnosis. Wien Klin Wochenschr 2024; 136:243-244. [PMID: 38504058 DOI: 10.1007/s00508-024-02343-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Accepted: 02/27/2024] [Indexed: 03/21/2024]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, 19104, Philadelphia, PA, USA.
| |
Collapse
|
45
|
Parekh AS, McCahon JAS, Nghe A, Pedowitz DI, Daniel JN, Parekh SG. Foot and Ankle Patient Education Materials and Artificial Intelligence Chatbots: A Comparative Analysis. Foot Ankle Spec 2024:19386400241235834. [PMID: 38504411 DOI: 10.1177/19386400241235834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/21/2024]
Abstract
BACKGROUND The purpose of this study was to perform a comparative analysis of foot and ankle patient education material generated by the AI chatbots, as they compare to the American Orthopaedic Foot and Ankle Society (AOFAS)-recommended patient education website, FootCareMD.org. METHODS ChatGPT, Google Bard, and Bing AI were used to generate patient educational materials on 10 of the most common foot and ankle conditions. The content from these AI language model platforms was analyzed and compared with that in FootCareMD.org for accuracy of included information. Accuracy was determined for each of the 10 conditions on a basis of included information regarding background, symptoms, causes, diagnosis, treatments, surgical options, recovery procedures, and risks or preventions. RESULTS When compared to the reference standard of the AOFAS website FootCareMD.org, the AI language model platforms consistently scored below 60% in accuracy rates in all categories of the articles analyzed. ChatGPT was found to contain an average of 46.2% of key content across all included conditions when compared to FootCareMD.org. Comparatively, Google Bard and Bing AI contained 36.5% and 28.0% of information included on FootCareMD.org, respectively (P < .005). CONCLUSION Patient education regarding common foot and ankle conditions generated by AI language models provides limited content accuracy across all 3 AI chatbot platforms. LEVEL OF EVIDENCE Level IV.
Collapse
Affiliation(s)
- Aarav S Parekh
- Rothman Orthopaedic Institute, Philadelphia, Pennsylvania
| | | | - Amy Nghe
- Rothman Orthopaedic Institute, Philadelphia, Pennsylvania
| | | | | | | |
Collapse
|
46
|
Li J, Dada A, Puladi B, Kleesiek J, Egger J. ChatGPT in healthcare: A taxonomy and systematic review. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 245:108013. [PMID: 38262126 DOI: 10.1016/j.cmpb.2024.108013] [Citation(s) in RCA: 61] [Impact Index Per Article: 61.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 12/29/2023] [Accepted: 01/08/2024] [Indexed: 01/25/2024]
Abstract
The recent release of ChatGPT, a chat bot research project/product of natural language processing (NLP) by OpenAI, stirs up a sensation among both the general public and medical professionals, amassing a phenomenally large user base in a short time. This is a typical example of the 'productization' of cutting-edge technologies, which allows the general public without a technical background to gain firsthand experience in artificial intelligence (AI), similar to the AI hype created by AlphaGo (DeepMind Technologies, UK) and self-driving cars (Google, Tesla, etc.). However, it is crucial, especially for healthcare researchers, to remain prudent amidst the hype. This work provides a systematic review of existing publications on the use of ChatGPT in healthcare, elucidating the 'status quo' of ChatGPT in medical applications, for general readers, healthcare professionals as well as NLP scientists. The large biomedical literature database PubMed is used to retrieve published works on this topic using the keyword 'ChatGPT'. An inclusion criterion and a taxonomy are further proposed to filter the search results and categorize the selected publications, respectively. It is found through the review that the current release of ChatGPT has achieved only moderate or 'passing' performance in a variety of tests, and is unreliable for actual clinical deployment, since it is not intended for clinical applications by design. We conclude that specialized NLP models trained on (bio)medical datasets still represent the right direction to pursue for critical clinical applications.
Collapse
Affiliation(s)
- Jianning Li
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany
| | - Amin Dada
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany
| | - Behrus Puladi
- Institute of Medical Informatics, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074 Aachen, Germany; Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074 Aachen, Germany
| | - Jens Kleesiek
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany; TU Dortmund University, Department of Physics, Otto-Hahn-Straße 4, 44227 Dortmund, Germany
| | - Jan Egger
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany; Center for Virtual and Extended Reality in Medicine (ZvRM), University Hospital Essen, University Medicine Essen, Hufelandstraße 55, 45147 Essen, Germany.
| |
Collapse
|
47
|
Nacher M, Françoise U, Adenis A. ChatGPT neglects a neglected disease. THE LANCET. INFECTIOUS DISEASES 2024; 24:e76. [PMID: 38211603 DOI: 10.1016/s1473-3099(23)00750-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 11/29/2023] [Accepted: 11/29/2023] [Indexed: 01/13/2024]
Affiliation(s)
- Mathieu Nacher
- CIC Inserm 1424, Amazonian Institute of Population Health, Centre Hospitalier de Cayenne, 97300 Cayenne, French Guiana.
| | - Ugo Françoise
- CIC Inserm 1424, Amazonian Institute of Population Health, Centre Hospitalier de Cayenne, 97300 Cayenne, French Guiana
| | - Antoine Adenis
- CIC Inserm 1424, Amazonian Institute of Population Health, Centre Hospitalier de Cayenne, 97300 Cayenne, French Guiana
| |
Collapse
|
48
|
Thirunavukarasu AJ. How Can the Clinical Aptitude of AI Assistants Be Assayed? J Med Internet Res 2023; 25:e51603. [PMID: 38051572 PMCID: PMC10731545 DOI: 10.2196/51603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 09/28/2023] [Accepted: 11/20/2023] [Indexed: 12/07/2023] Open
Abstract
Large language models (LLMs) are exhibiting remarkable performance in clinical contexts, with exemplar results ranging from expert-level attainment in medical examination questions to superior accuracy and relevance when responding to patient queries compared to real doctors replying to queries on social media. The deployment of LLMs in conventional health care settings is yet to be reported, and there remains an open question as to what evidence should be required before such deployment is warranted. Early validation studies use unvalidated surrogate variables to represent clinical aptitude, and it may be necessary to conduct prospective randomized controlled trials to justify the use of an LLM for clinical advice or assistance, as potential pitfalls and pain points cannot be exhaustively predicted. This viewpoint states that as LLMs continue to revolutionize the field, there is an opportunity to improve the rigor of artificial intelligence (AI) research to reward innovation, conferring real benefits to real patients.
Collapse
Affiliation(s)
- Arun James Thirunavukarasu
- Oxford University Clinical Academic Graduate School, University of Oxford, Oxford, United Kingdom
- School of Clinical Medicine, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
49
|
Sallam M, Barakat M, Sallam M. Pilot Testing of a Tool to Standardize the Assessment of the Quality of Health Information Generated by Artificial Intelligence-Based Models. Cureus 2023; 15:e49373. [PMID: 38024074 PMCID: PMC10674084 DOI: 10.7759/cureus.49373] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/24/2023] [Indexed: 12/01/2023] Open
Abstract
Background Artificial intelligence (AI)-based conversational models, such as Chat Generative Pre-trained Transformer (ChatGPT), Microsoft Bing, and Google Bard, have emerged as valuable sources of health information for lay individuals. However, the accuracy of the information provided by these AI models remains a significant concern. This pilot study aimed to test a new tool with key themes for inclusion as follows: Completeness of content, Lack of false information in the content, Evidence supporting the content, Appropriateness of the content, and Relevance, referred to as "CLEAR", designed to assess the quality of health information delivered by AI-based models. Methods Tool development involved a literature review on health information quality, followed by the initial establishment of the CLEAR tool, which comprised five items that aimed to assess the following: completeness, lack of false information, evidence support, appropriateness, and relevance. Each item was scored on a five-point Likert scale from excellent to poor. Content validity was checked by expert review. Pilot testing involved 32 healthcare professionals using the CLEAR tool to assess content on eight different health topics deliberately designed with varying qualities. The internal consistency was checked with Cronbach's alpha (α). Feedback from the pilot test resulted in language modifications to improve the clarity of the items. The final CLEAR tool was used to assess the quality of health information generated by four distinct AI models on five health topics. The AI models were ChatGPT 3.5, ChatGPT 4, Microsoft Bing, and Google Bard, and the content generated was scored by two independent raters with Cohen's kappa (κ) for inter-rater agreement. Results The final five CLEAR items were: (1) Is the content sufficient?; (2) Is the content accurate?; (3) Is the content evidence-based?; (4) Is the content clear, concise, and easy to understand?; and (5) Is the content free from irrelevant information? Pilot testing on the eight health topics revealed acceptable internal consistency with a Cronbach's α range of 0.669-0.981. The use of the final CLEAR tool yielded the following average scores: Microsoft Bing (mean=24.4±0.42), ChatGPT-4 (mean=23.6±0.96), Google Bard (mean=21.2±1.79), and ChatGPT-3.5 (mean=20.6±5.20). The inter-rater agreement revealed the following Cohen κ values: for ChatGPT-3.5 (κ=0.875, P<.001), ChatGPT-4 (κ=0.780, P<.001), Microsoft Bing (κ=0.348, P=.037), and Google Bard (κ=.749, P<.001). Conclusions The CLEAR tool is a brief yet helpful tool that can aid in standardizing testing of the quality of health information generated by AI-based models. Future studies are recommended to validate the utility of the CLEAR tool in the quality assessment of AI-generated health-related content using a larger sample across various complex health topics.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology, and Forensic Medicine, School of Medicine, University of Jordan, Amman, JOR
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, JOR
| | - Muna Barakat
- Department of Clinical Pharmacy and Therapeutics, School of Pharmacy, Applied Science Private University, Amman, JOR
- Department of Research, Middle East University, Amman, JOR
| | - Mohammed Sallam
- Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, ARE
| |
Collapse
|