Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Kuroiwa T, Sarcon A, Ibara T, Yamada E, Yamamoto A, Tsukamoto K, Fujita K. The Potential of ChatGPT as a Self-Diagnostic Tool in Common Orthopedic Diseases: Exploratory Study. J Med Internet Res 2023;25:e47621. [PMID: 37713254 PMCID: PMC10541638 DOI: 10.2196/47621] [Citation(s) in RCA: 44] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 05/17/2023] [Accepted: 08/17/2023] [Indexed: 09/16/2023] Open

For:	Kuroiwa T, Sarcon A, Ibara T, Yamada E, Yamamoto A, Tsukamoto K, Fujita K. The Potential of ChatGPT as a Self-Diagnostic Tool in Common Orthopedic Diseases: Exploratory Study. J Med Internet Res 2023;25:e47621. [PMID: 37713254 PMCID: PMC10541638 DOI: 10.2196/47621] [Citation(s) in RCA: 44] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 05/17/2023] [Accepted: 08/17/2023] [Indexed: 09/16/2023] Open

Number

Cited by Other Article(s)

Unadkat KD, Abdulwadood I, Hiredesai AN, Howlett CP, Geldmaker LE, Noland SS. ChatGPT 4.0's efficacy in the self-diagnosis of non-traumatic hand conditions. J Hand Microsurg 2025;17:100217. [PMID: 40007763 PMCID: PMC11849648 DOI: 10.1016/j.jham.2025.100217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 12/20/2024] [Accepted: 01/22/2025] [Indexed: 02/27/2025] Open

Abstract

Background

With advancements in artificial intelligence, patients increasingly turn to generative AI models like ChatGPT for medical advice. This study explores the utility of ChatGPT 4.0 (GPT-4.0), the most recent version of ChatGPT, as an interim diagnostician for common hand conditions. Secondarily, the study evaluates the terminology GPT-4.0 associates with each condition by assessing its ability to generate condition-specific questions from a patient's perspective.

Methods

Five common hand conditions were identified: trigger finger (TF), Dupuytren's Contracture (DC), carpal tunnel syndrome (CTS), de Quervain's tenosynovitis (DQT), and thumb carpometacarpal osteoarthritis (CMC). GPT-4.0 was queried with author-generated questions. The frequency of correct diagnoses, differential diagnoses, and recommendations were recorded. Chi-squared and pairwise Fisher's exact tests were used to compare response accuracy between conditions. GPT-4.0 was prompted to produce its own questions. Common terms in responses were recorded.

Results

GPT-4.0's diagnostic accuracy significantly differed between conditions (p < 0.005). While GPT-4.0 diagnosed CTS, TF, DQT, and DC with >95 % accuracy, 60 % (n = 15) of CMC queries were correctly diagnosed. Additionally, there were significant differences in providing of differential diagnoses (p < 0.005), diagnostic tests (p < 0.005), and risk factors (p < 0.05). GPT-4.0 recommended visiting a healthcare provider for 97 % (n = 121) of the questions. Analysis of ChatGPT-generated questions showed four of the ten most used terms were shared between DQT and CMC.

Conclusions

The results suggest that GPT-4.0 has potential preliminary diagnostic utility. Future studies should further investigate factors that improve or worsen AI's diagnostic power and consider the implications of patient utilization.

Collapse

Keyßer G, Pfeil A, Reuß-Borst M, Frohne I, Schultz O, Sander O. [What is the potential of ChatGPT for qualified patient information? : Attempt of a structured analysis on the basis of a survey regarding complementary and alternative medicine (CAM) in rheumatology]. Z Rheumatol 2025;84:179-187. [PMID: 38985176 DOI: 10.1007/s00393-024-01535-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/18/2024] [Indexed: 07/11/2024]

Abstract

INTRODUCTION

The chatbot ChatGPT represents a milestone in the interaction between humans and large databases that are accessible via the internet. It facilitates the answering of complex questions by enabling a communication in everyday language. Therefore, it is a potential source of information for those who are affected by rheumatic diseases. The aim of our investigation was to find out whether ChatGPT (version 3.5) is capable of giving qualified answers regarding the application of specific methods of complementary and alternative medicine (CAM) in three rheumatic diseases: rheumatoid arthritis (RA), systemic lupus erythematosus (SLE) and granulomatosis with polyangiitis (GPA). In addition, it was investigated how the answers of the chatbot were influenced by the wording of the question.

METHODS

The questioning of ChatGPT was performed in three parts. Part A consisted of an open question regarding the best way of treatment of the respective disease. In part B, the questions were directed towards possible indications for the application of CAM in general in one of the three disorders. In part C, the chatbot was asked for specific recommendations regarding one of three CAM methods: homeopathy, ayurvedic medicine and herbal medicine. Questions in parts B and C were expressed in two modifications: firstly, it was asked whether the specific CAM was applicable at all in certain rheumatic diseases. The second question asked which procedure of the respective CAM method worked best in the specific disease. The validity of the answers was checked by using the ChatGPT reliability score, a Likert scale ranging from 1 (lowest validity) to 7 (highest validity).

RESULTS

The answers to the open questions of part A had the highest validity. In parts B and C, ChatGPT suggested a variety of CAM applications that lacked scientific evidence. The validity of the answers depended on the wording of the questions. If the question suggested the inclination to apply a certain CAM, the answers often lacked the information of missing evidence and were graded with lower score values.

CONCLUSION

The answers of ChatGPT (version 3.5) regarding the applicability of CAM in selected rheumatic diseases are not convincingly based on scientific evidence. In addition, the wording of the questions affects the validity of the information. Currently, an uncritical application of ChatGPT as an instrument for patient information cannot be recommended.

Collapse

Yun HS, Bickmore T. Online Health Information-Seeking in the Era of Large Language Models: Cross-Sectional Web-Based Survey Study. J Med Internet Res 2025;27:e68560. [PMID: 40163112 DOI: 10.2196/68560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Revised: 02/20/2025] [Accepted: 03/01/2025] [Indexed: 04/02/2025] Open

Abstract

BACKGROUND

As large language model (LLM)-based chatbots such as ChatGPT (OpenAI) grow in popularity, it is essential to understand their role in delivering online health information compared to other resources. These chatbots often generate inaccurate content, posing potential safety risks. This motivates the need to examine how users perceive and act on health information provided by LLM-based chatbots.

OBJECTIVE

This study investigates the patterns, perceptions, and actions of users seeking health information online, including LLM-based chatbots. The relationships between online health information-seeking behaviors and important sociodemographic characteristics are examined as well.

METHODS

A web-based survey of crowd workers was conducted via Prolific. The questionnaire covered sociodemographic information, trust in health care providers, eHealth literacy, artificial intelligence (AI) attitudes, chronic health condition status, online health information source types, perceptions, and actions, such as cross-checking or adherence. Quantitative and qualitative analyses were applied.

RESULTS

Most participants consulted search engines (291/297, 98%) and health-related websites (203/297, 68.4%) for their health information, while 21.2% (63/297) used LLM-based chatbots, with ChatGPT and Microsoft Copilot being the most popular. Most participants (268/297, 90.2%) sought information on health conditions, with fewer seeking advice on medication (179/297, 60.3%), treatments (137/297, 46.1%), and self-diagnosis (62/297, 23.2%). Perceived information quality and trust varied little across source types. The preferred source for validating information from the internet was consulting health care professionals (40/132, 30.3%), while only a very small percentage of participants (5/214, 2.3%) consulted AI tools to cross-check information from search engines and health-related websites. For information obtained from LLM-based chatbots, 19.4% (12/63) of participants cross-checked the information, while 48.4% (30/63) of participants followed the advice. Both of these rates were lower than information from search engines, health-related websites, forums, or social media. Furthermore, use of LLM-based chatbots for health information was negatively correlated with age (ρ=-0.16, P=.006). In contrast, attitudes surrounding AI for medicine had significant positive correlations with the number of source types consulted for health advice (ρ=0.14, P=.01), use of LLM-based chatbots for health information (ρ=0.31, P<.001), and number of health topics searched (ρ=0.19, P<.001).

CONCLUSIONS

Although traditional online sources remain dominant, LLM-based chatbots are emerging as a resource for health information for some users, specifically those who are younger and have a higher trust in AI. The perceived quality and trustworthiness of health information varied little across source types. However, the adherence to health information from LLM-based chatbots seemed more cautious compared to search engines or health-related websites. As LLMs continue to evolve, enhancing their accuracy and transparency will be essential in mitigating any potential risks by supporting responsible information-seeking while maximizing the potential of AI in health contexts.

Collapse

Keasler PM, Chan JCY, Sng BL. Effectiveness of artificial intelligence (AI) chatbots in providing labor epidural analgesia information: are we there yet? Int J Obstet Anesth 2025;62:104353. [PMID: 40174425 DOI: 10.1016/j.ijoa.2025.104353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/18/2025] [Revised: 03/05/2025] [Accepted: 03/06/2025] [Indexed: 04/04/2025]

Alessandro L, Bianciotti N, Salama L, Volmaro S, Navarrine V, Ameghino L, Arena J, Bestoso S, Bruno V, Castillo Torres S, Chamorro M, Couto B, De La Riestra T, Echeverria F, Genco J, Gonzalez Del Boca F, Guarnaschelli M, Giugni JC, Laffue A, Martinez Villota V, Medina Escobar A, Paez Maggio M, Rauek S, Rodriguez Quiroga S, Tela M, Villa C, Sanguinetti O, Kauffman M, Fernandez Slezak D, Farez MF, Rossi M. Artificial Intelligence-Based Virtual Assistant for the Diagnostic Approach of Chronic Ataxias. Mov Disord 2025. [PMID: 40119570 DOI: 10.1002/mds.30168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Revised: 02/09/2025] [Accepted: 02/19/2025] [Indexed: 03/24/2025] Open

Affiliation(s)

Lucas Alessandro Departmento de Neurologia, Fleni, Buenos Aires, Argentina Entelai, Buenos Aires, Argentina
Nicolas Bianciotti Facultad de Medicina, Universidad de Buenos Aires, Buenos Aires, Argentina
Luciana Salama Facultad de Medicina, Universidad de Buenos Aires, Buenos Aires, Argentina
Santiago Volmaro Facultad de Medicina, Universidad de Buenos Aires, Buenos Aires, Argentina
Veronica Navarrine Facultad de Medicina, Universidad de Buenos Aires, Buenos Aires, Argentina
Lucia Ameghino Sección de Movimientos Anormales, Departamento de Neurología, Fleni, Buenos Aires, Argentina
Julieta Arena Sección de Movimientos Anormales, Departamento de Neurología, Fleni, Buenos Aires, Argentina
Santiago Bestoso Sección Parkinson y Trastornos del Movimiento del Hospital Italiano Buenos Aires (HIBA)
Veronica Bruno Department of Clinical Neurosciences, University of Calgary Hotchkiss Brain Institute, Calgary, Alberta, Canada
Sergio Castillo Torres Servicio de Neurología, Hospital Universitario Dr. Jose Eleuterio Gonzalez, Universidad Autónoma de Nuevo Leon, Monterrey, Mexico
Mauricio Chamorro Sanatorio Parque, Servicio de Neurología. INECO Neurociencias Oroño. Clínica de Movimientos Anormales, Unidad de DBS, Rosario, Argentina
Blas Couto Instituto de Neurociencia Cognitiva y Traslacional (INECO-CONICET-Favaloro), Ciudad de Buenos Aires, Argentina
Tomas De La Riestra Ineco Neurociencias Oroño, Rosario, Argentina
Florencia Echeverria Hospital Privado de Rosario, Rosario, Argentina
Juan Genco Consultorio de Trastornos del Movimiento, Servicio de Neurología y Neurocirugía, Hospital Luis Carlos Lagomaggiore, Mendoza, Argentina
Federico Gonzalez Del Boca Departamento de Movimientos Anormales, Sanatorio Allende, Servicio de Neurología, Córdoba, Argentina
Marlene Guarnaschelli Facultad de Ciencias de la Salud, Universidad Adventista del Plata, Libertador de San Martin, Entre Rios, Argentina
Juan Carlos Giugni Hospital Dr. Guillermo Rawson, San Juan, Argentina
Alfredo Laffue Departmento de Neurologia, Fleni, Buenos Aires, Argentina
Viviana Martinez Villota Servicio de Neurología Clínica, Hospital Universitario Departamental de Nariño, Pasto, Colombia
Alex Medina Escobar Moncton Interdisciplinary Neurodegenerative Diseases Clinic, Horizon Health Network, Moncton, New Brunswick, Canada
Mauricio Paez Maggio Seccion de Movimientos anormales. Departamento de Neurología, Hospital Britanico, Buenos Aires, Argentina
Sebastian Rauek Hospital Universitario de Mendoza, Mendoza, Argentina
Sergio Rodriguez Quiroga Departamento de Neurología Hospital J.M. Ramos Mejia, Unidad de Movimientos Anormales y Neurogenética, Buenos Aires, Argentina
Marcela Tela Sección de Movimientos Anormales, Departamento de Neurología, Fleni, Buenos Aires, Argentina
Carolina Villa Hospital San Bernardo, Salta, Argentina
Olivia Sanguinetti Instituto Tecnológico de Buenos Aires, Buenos Aires, Argentina
Marcelo Kauffman IIMT-FCB-Universidad Austral-CONICET, Buenos Aires, Argentina Consultorio y Laboratorio de Neurogenética Hospital J.M Ramos Mejia, Buenos Aires, Argentina
Diego Fernandez Slezak Entelai, Buenos Aires, Argentina Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina Instituto de Investigacion en Ciencias de la Computación, CONICET-UBA, Buenos Aires, Argentina
Mauricio F Farez Entelai, Buenos Aires, Argentina Centro de Investigación en Enfermedades Neuroinmunológicas (CIEN), Buenos Aires, Argentina
Malco Rossi Sección de Movimientos Anormales, Departamento de Neurología, Fleni, Buenos Aires, Argentina Instituto Fleni-CONICET (INEU) Buenos Aire, Buenos Aires, Argentina

Collapse

Wang R, Situ X, Sun X, Zhan J, Liu X. Assessing AI in Various Elements of Enhanced Recovery After Surgery (ERAS)-Guided Ankle Fracture Treatment: A Comparative Analysis with Expert Agreement. J Multidiscip Healthc 2025;18:1629-1638. [PMID: 40130076 PMCID: PMC11930842 DOI: 10.2147/jmdh.s508511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Accepted: 03/06/2025] [Indexed: 03/26/2025] Open

Abstract

Objective

This study aimed to assess and compare the performance of ChatGPT and iFlytek Spark, two AI-powered large language models (LLMs), in generating clinical recommendations aligned with expert consensus on Enhanced Recovery After Surgery (ERAS)-guided ankle fracture treatment. This study aims to determine the applicability and reliability of AI in supporting ERAS protocols for optimized patient outcomes.

Methods

A qualitative comparative analysis was conducted using 35 structured clinical questions derived from the Expert Consensus on Optimizing Ankle Fracture Treatment Protocols under ERAS Principles. Questions covered preoperative preparation, intraoperative management, postoperative pain control and rehabilitation, and complication management. Responses from ChatGPT and iFlytek Spark were independently evaluated by two experienced trauma orthopedic specialists based on clinical relevance, consistency with expert consensus, and depth of reasoning.

Results

ChatGPT demonstrated higher alignment with expert consensus (29/35 questions, 82.9%), particularly in comprehensive perioperative recommendations, detailed medical rationales, and structured treatment plans. However, discrepancies were noted in intraoperative blood pressure management and preoperative antiemetic selection. iFlytek Spark aligned with expert consensus in 22/35 questions (62.9%), but responses were often more generalized, less clinically detailed, and occasionally inconsistent with best practices. Agreement between ChatGPT and iFlytek Spark was observed in 23/35 questions (65.7%), with ChatGPT generally exhibiting greater specificity, timeliness, and precision in its recommendations.

Conclusion

AI-powered LLMs, particularly ChatGPT, show promise in supporting clinical decision-making for ERAS-guided ankle fracture management. While ChatGPT provided more accurate and contextually relevant responses, inconsistencies with expert consensus highlight the need for further refinement, validation, and clinical integration. iFlytek Spark's lower conformity suggests potential differences in training data and underlying algorithms, underscoring the variability in AI-generated medical advice. To optimize AI's role in orthopedic care, future research should focus on enhancing AI alignment with medical guidelines, improving model transparency, and integrating physician oversight to ensure safe and effective clinical applications.

Collapse

Kunze KN, Nwachukwu BU, Cote MP, Ramkumar PN. Large Language Models Applied to Health Care Tasks May Improve Clinical Efficiency, Value of Care Rendered, Research, and Medical Education. Arthroscopy 2025;41:547-556. [PMID: 39694303 DOI: 10.1016/j.arthro.2024.12.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/19/2024] [Revised: 12/01/2024] [Accepted: 12/02/2024] [Indexed: 12/20/2024]

Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025;8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open

Abstract

Importance

There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain.

Objective

To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART).

Evidence Review

A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies.

Findings

A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs.

Conclusions and Relevance

In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.

Collapse

Yang X, Li T, Su Q, Liu Y, Kang C, Lyu Y, Zhao L, Nie Y, Pan Y. Application of large language models in disease diagnosis and treatment. Chin Med J (Engl) 2025;138:130-142. [PMID: 39722188 PMCID: PMC11745858 DOI: 10.1097/cm9.0000000000003456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Indexed: 12/28/2024] Open

Abstract

ABSTRACT

Large language models (LLMs) such as ChatGPT, Claude, Llama, and Qwen are emerging as transformative technologies for the diagnosis and treatment of various diseases. With their exceptional long-context reasoning capabilities, LLMs are proficient in clinically relevant tasks, particularly in medical text analysis and interactive dialogue. They can enhance diagnostic accuracy by processing vast amounts of patient data and medical literature and have demonstrated their utility in diagnosing common diseases and facilitating the identification of rare diseases by recognizing subtle patterns in symptoms and test results. Building on their image-recognition abilities, multimodal LLMs (MLLMs) show promising potential for diagnosis based on radiography, chest computed tomography (CT), electrocardiography (ECG), and common pathological images. These models can also assist in treatment planning by suggesting evidence-based interventions and improving clinical decision support systems through integrated analysis of patient records. Despite these promising developments, significant challenges persist regarding the use of LLMs in medicine, including concerns regarding algorithmic bias, the potential for hallucinations, and the need for rigorous clinical validation. Ethical considerations also underscore the importance of maintaining the function of supervision in clinical practice. This paper highlights the rapid advancements in research on the diagnostic and therapeutic applications of LLMs across different medical disciplines and emphasizes the importance of policymaking, ethical supervision, and multidisciplinary collaboration in promoting more effective and safer clinical applications of LLMs. Future directions include the integration of proprietary clinical knowledge, the investigation of open-source and customized models, and the evaluation of real-time effects in clinical diagnosis and treatment practices.

Collapse

Affiliation(s)

Xintian Yang State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
Tongxin Li State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
Qin Su State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
Yaling Liu State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
Chenxi Kang State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
Yong Lyu State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
Lina Zhao Department of Radiotherapy, Xijing Hospital, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
Yongzhan Nie State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
Yanglin Pan State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China

Collapse

Shan K, Patel MA, McCreary M, Punnen TG, Villalobos F, Tardo LM, Horton LA, Sguigna PV, Blackburn KM, Munoz SB, Burgess KW, Moog TM, Smith AD, Okuda DT. Faster and better than a physician?: Assessing diagnostic proficiency of ChatGPT in misdiagnosed individuals with neuromyelitis optica spectrum disorder. J Neurol Sci 2025;468:123360. [PMID: 39733714 DOI: 10.1016/j.jns.2024.123360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 11/27/2024] [Accepted: 12/15/2024] [Indexed: 12/31/2024]

Abstract

BACKGROUND

Neuromyelitis optica spectrum disorder (NMOSD) is a commonly misdiagnosed condition. Driven by cost-consciousness and technological fluency, distinct generations may gravitate towards healthcare alternatives, including artificial intelligence (AI) models, such as ChatGPT (Generative Pre-trained Transformer). Our objective was to evaluate the speed and accuracy of ChatGPT-3.5 (GPT-3.5) in the diagnosis of people with NMOSD (PwNMOSD) initially misdiagnosed.

METHODS

Misdiagnosed PwNMOSD were retrospectively identified with clinical symptoms and time line of medically related events processed through GPT-3.5. For each subject, seven digital derivatives representing different races, ethnicities, and sexes were created and processed identically to evaluate the impact of these variables on accuracy. Scoresheets were used to track diagnostic success and time to diagnosis. Diagnostic speed of GPT-3.5 was evaluated against physicians using a Cox proportional hazards model, clustered by subject. Logistical regression was used to estimate the diagnostic accuracy of GPT-3.5 compared with the estimated accuracy of physicians.

RESULTS

Clinical time lines for 68 individuals (59 female, 42 Black/African American, 13 White, 11 Hispanic, 2 Asian; mean age at first symptoms 34.4 years (y) (standard deviation = 15.5y)) were analyzed and 476 digital simulations created, yielding 544 conversations for analysis. The instantaneous probability of correct diagnosis was 70.65% less for physicians relative to GPT-3.5 within 240 days of symptom onset (p < 0.0001). The estimated probability of correct diagnosis for GPT-3.5 was 80.88% [95% CI = (76.35%, 99.81%)].

CONCLUSION

GPT-3.5 may be of value in recognizing NMOSD. However, the manner in which medical information is conveyed, combined with the potential for inaccuracies may result in unnecessary psychological stress.

Collapse

Affiliation(s)

Kevin Shan The University of Texas Southwestern Medical Center, School of Medicine, Dallas, TX, USA
Mahi A Patel The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Morgan McCreary The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Tom G Punnen The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Francisco Villalobos The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Lauren M Tardo The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Lindsay A Horton The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Peter V Sguigna The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Kyle M Blackburn The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Shanan B Munoz The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Katy W Burgess The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Tatum M Moog The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA
Alexander D Smith Texas Tech University Health Sciences Center, School of Medicine, Lubbock, TX, USA
Darin T Okuda The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, TX, USA.

Collapse

Tanaka C, Kinoshita T, Okada Y, Satoh K, Homma Y, Suzuki K, Yokobori S, Oda J, Otomo Y, Tagami T. Medical validity and layperson interpretation of emergency visit recommendations by the GPT model: A cross-sectional study. Acute Med Surg 2025;12:e70042. [PMID: 40078650 PMCID: PMC11897724 DOI: 10.1002/ams2.70042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Revised: 01/08/2025] [Accepted: 01/26/2025] [Indexed: 03/14/2025] Open

Su Z, Jin K, Wu H, Luo Z, Grzybowski A, Ye J. Assessment of Large Language Models in Cataract Care Information Provision: A Quantitative Comparison. Ophthalmol Ther 2025;14:103-116. [PMID: 39516445 PMCID: PMC11724831 DOI: 10.1007/s40123-024-01066-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Accepted: 10/24/2024] [Indexed: 11/16/2024] Open

Abstract

INTRODUCTION

Cataracts are a significant cause of blindness. While individuals frequently turn to the Internet for medical advice, distinguishing reliable information can be challenging. Large language models (LLMs) have attracted attention for generating accurate, human-like responses that may be used for medical consultation. However, a comprehensive assessment of LLMs' accuracy within specific medical domains is still lacking.

METHODS

We compiled 46 commonly inquired questions related to cataract care, categorized into six domains. Each question was presented to the LLMs, and three consultant-level ophthalmologists independently assessed the accuracy of their responses on a three-point scale (poor, borderline, good) and their comprehensiveness on a five-point scale. A majority consensus approach established the final rating for each response. Responses rated as 'Poor' were prompted for self-correction and reassessed.

RESULTS

For accuracy, ChatGPT-4o and Google Bard both achieved average sum scores of 8.7 (out of 9), followed by ChatGPT-3.5, Bing Chat, Llama 2, and Wenxin Yiyan. In consensus-based ratings, ChatGPT-4o outperformed Google Bard in the 'Good' rating. For completeness, ChatGPT-4o had the highest average sum score of 13.22 (out of 15), followed by Google Bard, ChatGPT-3.5, Llama 2, Bing Chat, and Wenxin Yiyan. Detailed performance data reveal nuanced differences in model capabilities. In the 'Prevention' domain, apart from Wenxin Yiyan, all other models were rated as 'Good'. All models showed improvement in self-correction. Bard and Bing improved 1/1 from 'Poor' to better, Llama improved 3/4, and Wenxin Yiyan improved 4/5.

CONCLUSIONS

Our findings emphasize the potential of LLMs, particularly ChatGPT-4o, to deliver accurate and comprehensive responses to cataract-related queries, especially in prevention, indicating potential for medical consultation. Continuous efforts to enhance LLMs' accuracy through ongoing strategies and evaluations are essential.

Collapse

Chang Y, Yin JM, Li JM, Liu C, Cao LY, Lin SY. Applications and Future Prospects of Medical LLMs: A Survey Based on the M-KAT Conceptual Framework. J Med Syst 2024;48:112. [PMID: 39725770 DOI: 10.1007/s10916-024-02132-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 12/10/2024] [Indexed: 12/28/2024]

Kusaka S, Akitomo T, Hamada M, Asao Y, Iwamoto Y, Tachikake M, Mitsuhata C, Nomura R. Usefulness of Generative Artificial Intelligence (AI) Tools in Pediatric Dentistry. Diagnostics (Basel) 2024;14:2818. [PMID: 39767179 PMCID: PMC11674453 DOI: 10.3390/diagnostics14242818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Revised: 12/11/2024] [Accepted: 12/12/2024] [Indexed: 01/11/2025] Open

Yu H, Fan L, Li L, Zhou J, Ma Z, Xian L, Hua W, He S, Jin M, Zhang Y, Gandhi A, Ma X. Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024;8:658-711. [PMID: 39463859 PMCID: PMC11499577 DOI: 10.1007/s41666-024-00171-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 08/16/2024] [Accepted: 08/22/2024] [Indexed: 10/29/2024]

Rotem R, Zamstein O, Rottenstreich M, O'Sullivan OE, O'reilly BA, Weintraub AY. The future of patient education: A study on AI-driven responses to urinary incontinence inquiries. Int J Gynaecol Obstet 2024;167:1004-1009. [PMID: 38944693 DOI: 10.1002/ijgo.15751] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 05/30/2024] [Accepted: 06/14/2024] [Indexed: 07/01/2024]

Abstract

OBJECTIVE

To evaluate the effectiveness of ChatGPT in providing insights into common urinary incontinence concerns within urogynecology. By analyzing the model's responses against established benchmarks of accuracy, completeness, and safety, the study aimed to quantify its usefulness for informing patients and aiding healthcare providers.

METHODS

An expert-driven questionnaire was developed, inviting urogynecologists worldwide to assess ChatGPT's answers to 10 carefully selected questions on urinary incontinence (UI). These assessments focused on the accuracy of the responses, their comprehensiveness, and whether they raised any safety issues. Subsequent statistical analyses determined the average consensus among experts and identified the proportion of responses receiving favorable evaluations (a score of 4 or higher).

RESULTS

Of 50 urogynecologists that were approached worldwide, 37 responded, offering insights into ChatGPT's responses on UI. The overall feedback averaged a score of 4.0, indicating a positive acceptance. Accuracy scores averaged 3.9 with 71% rated favorably, whereas comprehensiveness scored slightly higher at 4 with 74% favorable ratings. Safety assessments also averaged 4 with 74% favorable responses.

CONCLUSION

This investigation underlines ChatGPT's favorable performance across the evaluated domains of accuracy, comprehensiveness, and safety within the context of UI queries. However, despite this broadly positive reception, the study also signals a clear avenue for improvement, particularly in the precision of the provided information. Refining ChatGPT's accuracy and ensuring the delivery of more pinpointed responses are essential steps forward, aiming to bolster its utility as a comprehensive educational resource for patients and a supportive tool for healthcare practitioners.

Collapse

Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, Freire Y. Beyond the Scalpel: Assessing ChatGPT's potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J 2024;24:46-52. [PMID: 38162955 PMCID: PMC10755495 DOI: 10.1016/j.csbj.2023.11.058] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/28/2023] [Accepted: 11/28/2023] [Indexed: 01/03/2024] Open

Lee J, Park S, Shin J, Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 2024;24:366. [PMID: 39614219 PMCID: PMC11606129 DOI: 10.1186/s12911-024-02709-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/03/2024] [Indexed: 12/01/2024] Open

Cao M, Wang Q, Zhang X, Lang Z, Qiu J, Yung PSH, Ong MTY. Large language models' performances regarding common patient questions about osteoarthritis: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Perplexity. JOURNAL OF SPORT AND HEALTH SCIENCE 2024:101016. [PMID: 39613294 DOI: 10.1016/j.jshs.2024.101016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 06/19/2024] [Accepted: 09/23/2024] [Indexed: 12/01/2024]

Abstract

BACKGROUND

Large Language Models (LLMs) have gained much attention and, in part, have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses. Osteoarthritis (OA) is a common topic in skeletal muscle disorders, and patients often seek information about it online. Our study evaluated the ability of 3 LLMs (ChatGPT-3.5, ChatGPT-4.0, and Perplexity) to accurately answer common OA-related queries.

METHODS

We defined 6 themes (pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis) based on a generalization of 25 frequently asked questions about OA. Three consultant-level orthopedic specialists independently rated the LLMs' replies on a 4-point accuracy scale. The final ratings for each response were determined using a majority consensus approach. Responses classified as "satisfactory" were evaluated for comprehensiveness on a 5-point scale.

RESULTS

ChatGPT-4.0 demonstrated superior accuracy, with 64% of responses rated as "excellent", compared to 40% for ChatGPT-3.5 and 28% for Perplexity (Pearson's chi-squared test with Fisher's exact test, all p < 0.001). All 3 LLM-chatbots had high mean comprehensiveness ratings (Perplexity = 3.88; ChatGPT-4.0 = 4.56; ChatGPT-3.5 = 3.96, out of a maximum score of 5). The LLM-chatbots performed reliably across domains, except for "treatment and prevention" However, ChatGPT-4.0 still outperformed ChatGPT-3.5 and Perplexity, garnering 53.8% "excellent" ratings (Pearson's chi-squared test with Fisher's exact test, all p < 0.001).

CONCLUSION

Our findings underscore the potential of LLMs, specifically ChatGPT-4.0 and Perplexity, to deliver accurate and thorough responses to OA-related queries. Targeted correction of specific misconceptions to improve the accuracy of LLMs remains crucial.

Collapse

Zhou Y, Li SJ, Tang XY, He YC, Ma HM, Wang AQ, Pei RY, Piao MH. Using ChatGPT in Nursing: Scoping Review of Current Opinions. JMIR MEDICAL EDUCATION 2024;10:e54297. [PMID: 39622702 PMCID: PMC11611787 DOI: 10.2196/54297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Revised: 07/25/2024] [Accepted: 08/19/2024] [Indexed: 12/06/2024]

Abstract

Background

Since the release of ChatGPT in November 2022, this emerging technology has garnered a lot of attention in various fields, and nursing is no exception. However, to date, no study has comprehensively summarized the status and opinions of using ChatGPT across different nursing fields.

Objective

We aim to synthesize the status and opinions of using ChatGPT according to different nursing fields, as well as assess ChatGPT's strengths, weaknesses, and the potential impacts it may cause.

Methods

This scoping review was conducted following the framework of Arksey and O'Malley and guided by the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). A comprehensive literature research was conducted in 4 web-based databases (PubMed, Embase, Web of Science, and CINHAL) to identify studies reporting the opinions of using ChatGPT in nursing fields from 2022 to September 3, 2023. The references of the included studies were screened manually to further identify relevant studies. Two authors conducted studies screening, eligibility assessments, and data extraction independently.

Results

A total of 30 studies were included. The United States (7 studies), Canada (5 studies), and China (4 studies) were countries with the most publications. In terms of fields of concern, studies mainly focused on "ChatGPT and nursing education" (20 studies), "ChatGPT and nursing practice" (10 studies), and "ChatGPT and nursing research, writing, and examination" (6 studies). Six studies addressed the use of ChatGPT in multiple nursing fields.

Conclusions

As an emerging artificial intelligence technology, ChatGPT has great potential to revolutionize nursing education, nursing practice, and nursing research. However, researchers, institutions, and administrations still need to critically examine its accuracy, safety, and privacy, as well as academic misconduct and potential ethical issues that it may lead to before applying ChatGPT to practice.

Collapse

Zhang C, Liu S, Zhou X, Zhou S, Tian Y, Wang S, Xu N, Li W. Examining the Role of Large Language Models in Orthopedics: Systematic Review. J Med Internet Res 2024;26:e59607. [PMID: 39546795 DOI: 10.2196/59607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 08/01/2024] [Accepted: 09/11/2024] [Indexed: 11/17/2024] Open

Abstract

BACKGROUND

Large language models (LLMs) can understand natural language and generate corresponding text, images, and even videos based on prompts, which holds great potential in medical scenarios. Orthopedics is a significant branch of medicine, and orthopedic diseases contribute to a significant socioeconomic burden, which could be alleviated by the application of LLMs. Several pioneers in orthopedics have conducted research on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies, and a systematic summary of existing research is absent.

OBJECTIVE

The objective of this review was to comprehensively summarize research findings on the application of LLMs in the field of orthopedics and explore the potential opportunities and challenges.

METHODS

PubMed, Embase, and Cochrane Library databases were searched from January 1, 2014, to February 22, 2024, with the language limited to English. The terms, which included variants of "large language model," "generative artificial intelligence," "ChatGPT," and "orthopaedics," were divided into 2 categories: large language model and orthopedics. After completing the search, the study selection process was conducted according to the inclusion and exclusion criteria. The quality of the included studies was assessed using the revised Cochrane risk-of-bias tool for randomized trials and CONSORT-AI (Consolidated Standards of Reporting Trials-Artificial Intelligence) guidance. Data extraction and synthesis were conducted after the quality assessment.

RESULTS

A total of 68 studies were selected. The application of LLMs in orthopedics involved the fields of clinical practice, education, research, and management. Of these 68 studies, 47 (69%) focused on clinical practice, 12 (18%) addressed orthopedic education, 8 (12%) were related to scientific research, and 1 (1%) pertained to the field of management. Of the 68 studies, only 8 (12%) recruited patients, and only 1 (1%) was a high-quality randomized controlled trial. ChatGPT was the most commonly mentioned LLM tool. There was considerable heterogeneity in the definition, measurement, and evaluation of the LLMs' performance across the different studies. For diagnostic tasks alone, the accuracy ranged from 55% to 93%. When performing disease classification tasks, ChatGPT with GPT-4's accuracy ranged from 2% to 100%. With regard to answering questions in orthopedic examinations, the scores ranged from 45% to 73.6% due to differences in models and test selections.

CONCLUSIONS

LLMs cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. More high-quality clinical trials are needed in the future, aiming to identify optimal applications of LLMs and advance orthopedics toward higher efficiency and precision.

Collapse

Affiliation(s)

Cheng Zhang Department of Orthopaedics, Peking University Third Hospital, Beijing, China Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China Beijing Key Laboratory of Spinal Disease Research, Beijing, China
Shanshan Liu Department of Orthopaedics, Peking University Third Hospital, Beijing, China Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China Beijing Key Laboratory of Spinal Disease Research, Beijing, China
Xingyu Zhou Peking University Health Science Center, Beijing, China
Siyu Zhou Department of Orthopaedics, Peking University Third Hospital, Beijing, China Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China Beijing Key Laboratory of Spinal Disease Research, Beijing, China
Yinglun Tian Department of Orthopaedics, Peking University Third Hospital, Beijing, China Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China Beijing Key Laboratory of Spinal Disease Research, Beijing, China
Shenglin Wang Department of Orthopaedics, Peking University Third Hospital, Beijing, China Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China Beijing Key Laboratory of Spinal Disease Research, Beijing, China
Nanfang Xu Department of Orthopaedics, Peking University Third Hospital, Beijing, China Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China Beijing Key Laboratory of Spinal Disease Research, Beijing, China
Weishi Li Department of Orthopaedics, Peking University Third Hospital, Beijing, China Engineering Research Center of Bone and Joint Precision Medicine, Ministry of Education, Beijing, China Beijing Key Laboratory of Spinal Disease Research, Beijing, China

Collapse

Ezanno AC, Fougerousse AC, Pruvost-Balland C, Maccari F, Fite C. AI in Hidradenitis Suppurativa: Expert Evaluation of Patient-Facing Information. Clin Cosmet Investig Dermatol 2024;17:2459-2464. [PMID: 39507766 PMCID: PMC11539865 DOI: 10.2147/ccid.s478309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Accepted: 10/26/2024] [Indexed: 11/08/2024]

Wang K, Tan X, Nan S, Sang L, Chen H, Duan H. OLR-Net: Object Label Retrieval Network for principal diagnosis extraction. Comput Biol Med 2024;182:109130. [PMID: 39288555 DOI: 10.1016/j.compbiomed.2024.109130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 08/26/2024] [Accepted: 09/06/2024] [Indexed: 09/19/2024]

Abstract

BACKGROUND

Extracting principal diagnosis from patient discharge summaries is an essential task for the meaningful use of medical data. The extraction process, usually by medical staff, is laborious and time-consuming. Although automatic models have been proposed to retrieve principal diagnoses from medical records, many rare diagnoses and a small amount of training data per rare diagnosis provide significant statistical and computational challenges.

OBJECTIVE

In this study, we aimed to extract principal diagnoses with limited available data.

METHODS

We proposed the OLR-Net, Object Label Retrieval Network, to extract principal diagnoses for discharge summaries. Our approach included semantic extraction, label localization, label retrieval, and recommendation. The semantic information of discharge summaries was mapped into the diagnoses set. Then, one-dimensional convolutional neural networks slid into the bottom-up region for diagnosis localization to enrich rare diagnoses. Finally, OLR-Net detected the principal diagnosis in the localized region. The evaluation metrics focus on the hit ratio, mean reciprocal rank, and the area under the receiver operating characteristic curve (AUROC).

RESULTS

12,788 desensitized discharge summary records were collected from the oncology department at Hainan Hospital of Chinese People's Liberation Army General Hospital. We designed five distinct settings based on the number of training data per diagnosis: the full dataset, the top-50 dataset, the few-shot dataset, the one-shot dataset, and the zero-shot dataset. The performance of our model had the highest HR@5 of 0.8778 and macro-AUROC of 0.9851. In the limited available (few-shot and one-shot) dataset, the macro-AUROC were 0.9833 and 0.9485, respectively.

CONCLUSIONS

OLR-Net has great potential for extracting principal diagnosis with limited available data through label localization and retrieval.

Collapse

Lim B, Seth I, Cuomo R, Kenney PS, Ross RJ, Sofiadellis F, Pentangelo P, Ceccaroni A, Alfano C, Rozen WM. Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients. Aesthetic Plast Surg 2024;48:4712-4724. [PMID: 38898239 PMCID: PMC11645314 DOI: 10.1007/s00266-024-04157-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Accepted: 05/21/2024] [Indexed: 06/21/2024]

Abstract

BACKGROUND

Abdominoplasty is a common operation, used for a range of cosmetic and functional issues, often in the context of divarication of recti, significant weight loss, and after pregnancy. Despite this, patient-surgeon communication gaps can hinder informed decision-making. The integration of large language models (LLMs) in healthcare offers potential for enhancing patient information. This study evaluated the feasibility of using LLMs for answering perioperative queries.

METHODS

This study assessed the efficacy of four leading LLMs-OpenAI's ChatGPT-3.5, Anthropic's Claude, Google's Gemini, and Bing's CoPilot-using fifteen unique prompts. All outputs were evaluated using the Flesch-Kincaid, Flesch Reading Ease score, and Coleman-Liau index for readability assessment. The DISCERN score and a Likert scale were utilized to evaluate quality. Scores were assigned by two plastic surgical residents and then reviewed and discussed until a consensus was reached by five plastic surgeon specialists.

RESULTS

ChatGPT-3.5 required the highest level for comprehension, followed by Gemini, Claude, then CoPilot. Claude provided the most appropriate and actionable advice. In terms of patient-friendliness, CoPilot outperformed the rest, enhancing engagement and information comprehensiveness. ChatGPT-3.5 and Gemini offered adequate, though unremarkable, advice, employing more professional language. CoPilot uniquely included visual aids and was the only model to use hyperlinks, although they were not very helpful and acceptable, and it faced limitations in responding to certain queries.

CONCLUSION

ChatGPT-3.5, Gemini, Claude, and Bing's CoPilot showcased differences in readability and reliability. LLMs offer unique advantages for patient care but require careful selection. Future research should integrate LLM strengths and address weaknesses for optimal patient education.

LEVEL OF EVIDENCE V

This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .

Collapse

Küchemann S, Rau M, Schmidt A, Kuhn J. ChatGPT's quality: Reliability and validity of concept inventory items. Front Psychol 2024;15:1426209. [PMID: 39439749 PMCID: PMC11493723 DOI: 10.3389/fpsyg.2024.1426209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 09/19/2024] [Indexed: 10/25/2024] Open

Abstract

Introduction

The recent advances of large language models (LLMs) have opened a wide range of opportunities, but at the same time, they pose numerous challenges and questions that research needs to answer. One of the main challenges are the quality and correctness of the output of LLMs as well as the overreliance of students on the output without critically reflecting on it. This poses the question of the quality of the output of LLMs in educational tasks and what students and teachers need to consider when using LLMs for creating educational items. In this work, we focus on the quality and characteristics of conceptual items developed using ChatGPT without user-generated improvements.

Methods

For this purpose, we optimized prompts and created 30 conceptual items in kinematics, which is a standard topic in high-school level physics. The items were rated by two independent experts. Those 15 items that received the highest rating were included in a conceptual survey. The dimensions were designed to align with the ones in the most commonly used concept inventory, the Force Concept Inventory (FCI). We administered the designed items together with the FCI to 172 first-year university students. The results show that ChatGPT items have a medium difficulty and discriminatory index but they overall exhibit a slightly lower average values as the FCI. Moreover, a confirmatory factor analysis confirmed a three factor model that is closely aligned with a previously suggested expert model.

Results and discussion

In this way, after careful prompt engineering, thorough analysis and selection of fully automatically generated items by ChatGPT, we were able to create concept items that had only a slightly lower quality than carefully human-generated concept items. The procedures to create and select such a high-quality set of items that is fully automatically generated require large efforts and point towards cognitive demands of teachers when using LLMs to create items. Moreover, the results demonstrate that human oversight or student interviews are necessary when creating one-dimensional assessments and distractors that are closely aligned with students' difficulties.

Collapse

Armbruster J, Bussmann F, Rothhaas C, Titze N, Grützner PA, Freischmidt H. "Doctor ChatGPT, Can You Help Me?" The Patient's Perspective: Cross-Sectional Study. J Med Internet Res 2024;26:e58831. [PMID: 39352738 PMCID: PMC11480680 DOI: 10.2196/58831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 07/12/2024] [Accepted: 08/01/2024] [Indexed: 10/03/2024] Open

Abstract

BACKGROUND

Artificial intelligence and the language models derived from it, such as ChatGPT, offer immense possibilities, particularly in the field of medicine. It is already evident that ChatGPT can provide adequate and, in some cases, expert-level responses to health-related queries and advice for patients. However, it is currently unknown how patients perceive these capabilities, whether they can derive benefit from them, and whether potential risks, such as harmful suggestions, are detected by patients.

OBJECTIVE

This study aims to clarify whether patients can get useful and safe health care advice from an artificial intelligence chatbot assistant.

METHODS

This cross-sectional study was conducted using 100 publicly available health-related questions from 5 medical specialties (trauma, general surgery, otolaryngology, pediatrics, and internal medicine) from a web-based platform for patients. Responses generated by ChatGPT-4.0 and by an expert panel (EP) of experienced physicians from the aforementioned web-based platform were packed into 10 sets consisting of 10 questions each. The blinded evaluation was carried out by patients regarding empathy and usefulness (assessed through the question: "Would this answer have helped you?") on a scale from 1 to 5. As a control, evaluation was also performed by 3 physicians in each respective medical specialty, who were additionally asked about the potential harm of the response and its correctness.

RESULTS

In total, 200 sets of questions were submitted by 64 patients (mean 45.7, SD 15.9 years; 29/64, 45.3% male), resulting in 2000 evaluated answers of ChatGPT and the EP each. ChatGPT scored higher in terms of empathy (4.18 vs 2.7; P<.001) and usefulness (4.04 vs 2.98; P<.001). Subanalysis revealed a small bias in terms of levels of empathy given by women in comparison with men (4.46 vs 4.14; P=.049). Ratings of ChatGPT were high regardless of the participant's age. The same highly significant results were observed in the evaluation of the respective specialist physicians. ChatGPT outperformed significantly in correctness (4.51 vs 3.55; P<.001). Specialists rated the usefulness (3.93 vs 4.59) and correctness (4.62 vs 3.84) significantly lower in potentially harmful responses from ChatGPT (P<.001). This was not the case among patients.

CONCLUSIONS

The results indicate that ChatGPT is capable of supporting patients in health-related queries better than physicians, at least in terms of written advice through a web-based platform. In this study, ChatGPT's responses had a lower percentage of potentially harmful advice than the web-based EP. However, it is crucial to note that this finding is based on a specific study design and may not generalize to all health care settings. Alarmingly, patients are not able to independently recognize these potential dangers.

Collapse

Ghilzai U, Fiedler B, Ghali A, Singh A, Cass B, Young A, Ahmed AS. ChatGPT provides acceptable responses to patient questions regarding common shoulder pathology. Shoulder Elbow 2024:17585732241283971. [PMID: 39545009 PMCID: PMC11559869 DOI: 10.1177/17585732241283971] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 07/28/2024] [Accepted: 08/28/2024] [Indexed: 11/17/2024]

Quinn M, Milner JD, Schmitt P, Morrissey P, Lemme N, Marcaccio S, DeFroda S, Tabaddor R, Owens BD. Artificial Intelligence Large Language Models Address Anterior Cruciate Ligament Reconstruction: Superior Clarity and Completeness by Gemini Compared With ChatGPT-4 in Response to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines. Arthroscopy 2024:S0749-8063(24)00736-9. [PMID: 39313138 DOI: 10.1016/j.arthro.2024.09.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Revised: 08/31/2024] [Accepted: 09/05/2024] [Indexed: 09/25/2024]

Abstract

PURPOSE

To assess the ability of ChatGPT-4 and Gemini to generate accurate and relevant responses to the 2022 American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPG) for anterior cruciate ligament reconstruction (ACLR).

METHODS

Responses from ChatGPT-4 and Gemini to prompts derived from all 15 AAOS guidelines were evaluated by 7 fellowship-trained orthopaedic sports medicine surgeons using a structured questionnaire assessing 5 key characteristics on a scale from 1 to 5. The prompts were categorized into 3 areas: diagnosis and preoperative management, surgical timing and technique, and rehabilitation and prevention. Statistical analysis included mean scoring, standard deviation, and 2-sided t tests to compare the performance between the 2 large language models (LLMs). Scores were then evaluated for inter-rater reliability (IRR).

RESULTS

Overall, both LLMs performed well with mean scores >4 for the 5 key characteristics. Gemini demonstrated superior performance in overall clarity (4.848 ± 0.36 vs 4.743 ± 0.481, P = .034), but all other characteristics demonstrated nonsignificant differences (P > .05). Gemini also demonstrated superior clarity in the surgical timing and technique (P = .038) as well as the prevention and rehabilitation (P = .044) subcategories. Additionally, Gemini had superior performance completeness scores in the rehabilitation and prevention subcategory (P = .044), but no statistically significant differences were found amongst the other subcategories. The overall IRR was found to be 0.71 (moderate).

CONCLUSIONS

Both Gemini and ChatGPT-4 demonstrate an overall good ability to generate accurate and relevant responses to question prompts based on the 2022 AAOS CPG for ACLR. However, Gemini demonstrated superior clarity in multiple domains in addition to superior completeness for questions pertaining to rehabilitation and prevention.

CLINICAL RELEVANCE

The current study addresses a current gap in the LLM and ACLR literature by comparing the performance of ChatGPT-4 to Gemini, which is growing in popularity with more than 300 million individual uses in May 2024 alone. Moreover, the results demonstrated superior performance of Gemini in both clarity and completeness, which are critical elements of a tool being used by patients for educational purposes. Additionally, the current study uses question prompts based on the AAOS CPG, which may be used as a method of standardization for future investigations on performance of LLM platforms. Thus, the results of this study may be of interest to both the readership of Arthroscopy and patients.

Collapse

Kunze KN. Editorial Commentary: The Scope of Medical Research Concerning ChatGPT Remains Limited by Lack of Originality. Arthroscopy 2024:S0749-8063(24)00679-0. [PMID: 39278424 DOI: 10.1016/j.arthro.2024.09.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 09/06/2024] [Accepted: 09/07/2024] [Indexed: 09/18/2024]

Liu J, Liang X, Fang D, Zheng J, Yin C, Xie H, Li Y, Sun X, Tong Y, Che H, Hu P, Yang F, Wang B, Chen Y, Cheng G, Zhang J. The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis. J Med Internet Res 2024;26:e54985. [PMID: 39255016 PMCID: PMC11422746 DOI: 10.2196/54985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 02/05/2024] [Accepted: 07/24/2024] [Indexed: 09/11/2024] Open

Abstract

BACKGROUND

ChatGPT (OpenAI) has shown great potential in clinical diagnosis and could become an excellent auxiliary tool in clinical practice. This study investigates and evaluates ChatGPT in diagnostic capabilities by comparing the performance of GPT-3.5 and GPT-4.0 across model iterations.

OBJECTIVE

This study aims to evaluate the precise diagnostic ability of GPT-3.5 and GPT-4.0 for colon cancer and its potential as an auxiliary diagnostic tool for surgeons and compare the diagnostic accuracy rates between GTP-3.5 and GPT-4.0. We precisely assess the accuracy of primary and secondary diagnoses and analyze the causes of misdiagnoses in GPT-3.5 and GPT-4.0 according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings.

METHODS

We retrieved 316 case reports for intestinal cancer from the Chinese Medical Association Publishing House database, of which 286 cases were deemed valid after data cleansing. The cases were translated from Mandarin to English and then input into GPT-3.5 and GPT-4.0 using a simple, direct prompt to elicit primary and secondary diagnoses. We conducted a comparative study to evaluate the diagnostic accuracy of GPT-4.0 and GPT-3.5. Three senior surgeons from the General Surgery Department, specializing in Colorectal Surgery, assessed the diagnostic information at the Chinese PLA (People's Liberation Army) General Hospital. The accuracy of primary and secondary diagnoses was scored based on predefined criteria. Additionally, we analyzed and compared the causes of misdiagnoses in both models according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings.

RESULTS

Out of 286 cases, GPT-4.0 and GPT-3.5 both demonstrated high diagnostic accuracy for primary diagnoses, but the accuracy rates of GPT-4.0 were significantly higher than GPT-3.5 (mean 0.972, SD 0.137 vs mean 0.855, SD 0.335; t285=5.753; P<.001). For secondary diagnoses, the accuracy rates of GPT-4.0 were also significantly higher than GPT-3.5 (mean 0.908, SD 0.159 vs mean 0.617, SD 0.349; t285=-7.727; P<.001). GPT-3.5 showed limitations in processing patient history, symptom presentation, laboratory tests, and imaging data. While GPT-4.0 improved upon GPT-3.5, it still has limitations in identifying symptoms and laboratory test data. For both primary and secondary diagnoses, there was no significant difference in accuracy related to age, gender, or system group between GPT-4.0 and GPT-3.5.

CONCLUSIONS

This study demonstrates that ChatGPT, particularly GPT-4.0, possesses significant diagnostic potential, with GPT-4.0 exhibiting higher accuracy than GPT-3.5. However, GPT-4.0 still has limitations, particularly in recognizing patient symptoms and laboratory data, indicating a need for more research in real-world clinical settings to enhance its diagnostic capabilities.

Collapse

Affiliation(s)

Jiayu Liu Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
Xiuting Liang Department of Respiratory and Critical Care Medicine, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
Dandong Fang Department of Neurosurgery, Sanmenxia Central Hospital, Sanmenxia, China
Jiqi Zheng School of Health Humanities, Peking University, Beijing, China
Chengliang Yin Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
Hui Xie Departments of Urology, The First Affiliated Hospital of Fujian Medical University, Fuzhou, China
Yanteng Li Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
Xiaochun Sun Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
Yue Tong Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
Hebin Che Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
Ping Hu Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
Fan Yang Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
Bingxian Wang Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
Yuanyuan Chen Medical Innovation Research Division, Chinese People's Liberation Army General Hospital, Beijing, China National Engineering Research Center for Medical Big Data Application Technology, Chinese People's Liberation Army General Hospital, Beijing, China
Gang Cheng Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China
Jianning Zhang Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, Beijing, China

Collapse

Si Y, Yang Y, Wang X, Zu J, Chen X, Fan X, An R, Gong S. Quality and Accountability of ChatGPT in Health Care in Low- and Middle-Income Countries: Simulated Patient Study. J Med Internet Res 2024;26:e56121. [PMID: 39250188 PMCID: PMC11420570 DOI: 10.2196/56121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 04/21/2024] [Accepted: 07/30/2024] [Indexed: 09/10/2024] Open

Warrier A, Singh R, Haleem A, Zaki H, Eloy JA. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology. Laryngoscope 2024;134:3997-4002. [PMID: 38563415 DOI: 10.1002/lary.31434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/05/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]

Hwai H, Ho YJ, Wang CH, Huang CH. Large language model application in emergency medicine and critical care. J Formos Med Assoc 2024:S0929-6646(24)00400-5. [PMID: 39198112 DOI: 10.1016/j.jfma.2024.08.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 08/13/2024] [Accepted: 08/23/2024] [Indexed: 09/01/2024] Open

Goumas G, Dardavesis TI, Syrigos K, Syrigos N, Simou E. Chatbots in Cancer Applications, Advantages and Disadvantages: All that Glitters Is Not Gold. J Pers Med 2024;14:877. [PMID: 39202068 PMCID: PMC11355580 DOI: 10.3390/jpm14080877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 08/12/2024] [Accepted: 08/14/2024] [Indexed: 09/03/2024] Open

Langston E, Charness N, Boot W. Are Virtual Assistants Trustworthy for Medicare Information: An Examination of Accuracy and Reliability. THE GERONTOLOGIST 2024;64:gnae062. [PMID: 38832398 PMCID: PMC11258897 DOI: 10.1093/geront/gnae062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Indexed: 06/05/2024] Open

Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res 2024;26:e60807. [PMID: 39052324 PMCID: PMC11310649 DOI: 10.2196/60807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 06/11/2024] [Accepted: 06/15/2024] [Indexed: 07/27/2024] Open

Abstract

BACKGROUND

Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT's performance on different medical licensing examinations.

OBJECTIVE

In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education.

METHODS

We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses.

RESULTS

A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5's performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non-English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5's (P=.03) and GPT-4's (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT's accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs.

CONCLUSIONS

GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education.

TRIAL REGISTRATION

PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687.

Collapse

Huo W, He M, Zeng Z, Bao X, Lu Y, Tian W, Feng J, Feng R. Impact Analysis of COVID-19 Pandemic on Hospital Reviews on Dianping Website in Shanghai, China: Empirical Study. J Med Internet Res 2024;26:e52992. [PMID: 38954461 PMCID: PMC11252617 DOI: 10.2196/52992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 01/24/2024] [Accepted: 05/21/2024] [Indexed: 07/04/2024] Open

Abstract

BACKGROUND

In the era of the internet, individuals have increasingly accustomed themselves to gathering necessary information and expressing their opinions on public web-based platforms. The health care sector is no exception, as these comments, to a certain extent, influence people's health care decisions. During the onset of the COVID-19 pandemic, how the medical experience of Chinese patients and their evaluations of hospitals have changed remains to be studied. Therefore, we plan to collect patient medical visit data from the internet to reflect the current status of medical relationships under specific circumstances.

OBJECTIVE

This study aims to explore the differences in patient comments across various stages (during, before, and after) of the COVID-19 pandemic, as well as among different types of hospitals (children's hospitals, maternity hospitals, and tumor hospitals). Additionally, by leveraging ChatGPT (OpenAI), the study categorizes the elements of negative hospital evaluations. An analysis is conducted on the acquired data, and potential solutions that could improve patient satisfaction are proposed. This study is intended to assist hospital managers in providing a better experience for patients who are seeking care amid an emergent public health crisis.

METHODS

Selecting the top 50 comprehensive hospitals nationwide and the top specialized hospitals (children's hospitals, tumor hospitals, and maternity hospitals), we collected patient reviews from these hospitals on the Dianping website. Using ChatGPT, we classified the content of negative reviews. Additionally, we conducted statistical analysis using SPSS (IBM Corp) to examine the scoring and composition of negative evaluations.

RESULTS

A total of 30,317 pieces of effective comment information were collected from January 1, 2018, to August 15, 2023, including 7696 pieces of negative comment information. Manual inspection results indicated that ChatGPT had an accuracy rate of 92.05%. The F1-score was 0.914. The analysis of this data revealed a significant correlation between the comments and ratings received by hospitals during the pandemic. Overall, there was a significant increase in average comment scores during the outbreak (P<.001). Furthermore, there were notable differences in the composition of negative comments among different types of hospitals (P<.001). Children's hospitals received sensitive feedback regarding waiting times and treatment effectiveness, while patients at maternity hospitals showed a greater concern for the attitude of health care providers. Patients at tumor hospitals expressed a desire for timely examinations and treatments, especially during the pandemic period.

CONCLUSIONS

The COVID-19 pandemic had some association with patient comment scores. There were variations in the scores and content of comments among different types of specialized hospitals. Using ChatGPT to analyze patient comment content represents an innovative approach for statistically assessing factors contributing to patient dissatisfaction. The findings of this study could provide valuable insights for hospital administrators to foster more harmonious physician-patient relationships and enhance hospital performance during public health emergencies.

Collapse

Arora V, Silburt J, Phillips M, Khan M, Petrisor B, Chaudhry H, Mundi R, Bhandari M. A Blinded Comparison of Three Generative Artificial Intelligence Chatbots for Orthopaedic Surgery Therapeutic Questions. Cureus 2024;16:e65343. [PMID: 39184692 PMCID: PMC11344479 DOI: 10.7759/cureus.65343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/22/2024] [Indexed: 08/27/2024] Open

Ghanem D, Zhu AR, Kagabo W, Osgood G, Shafiq B. ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source. JB JS Open Access 2024;9:e24.00099. [PMID: 39238880 PMCID: PMC11368215 DOI: 10.2106/jbjs.oa.24.00099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 09/07/2024] Open

Abstract

Introduction

The artificial intelligence language model Chat Generative Pretrained Transformer (ChatGPT) has shown potential as a reliable and accessible educational resource in orthopaedic surgery. Yet, the accuracy of the references behind the provided information remains elusive, which poses a concern for maintaining the integrity of medical content. This study aims to examine the accuracy of the references provided by ChatGPT-4 concerning the Airway, Breathing, Circulation, Disability, Exposure (ABCDE) approach in trauma surgery.

Methods

Two independent reviewers critically assessed 30 ChatGPT-4-generated references supporting the well-established ABCDE approach to trauma protocol, grading them as 0 (nonexistent), 1 (inaccurate), or 2 (accurate). All discrepancies between the ChatGPT-4 and PubMed references were carefully reviewed and bolded. Cohen's Kappa coefficient was used to examine the agreement of the accuracy scores of the ChatGPT-4-generated references between reviewers. Descriptive statistics were used to summarize the mean reference accuracy scores. To compare the variance of the means across the 5 categories, one-way analysis of variance was used.

Results

ChatGPT-4 had an average reference accuracy score of 66.7%. Of the 30 references, only 43.3% were accurate and deemed "true" while 56.7% were categorized as "false" (43.3% inaccurate and 13.3% nonexistent). The accuracy was consistent across the 5 trauma protocol categories, with no significant statistical difference (p = 0.437).

Discussion

With 57% of references being inaccurate or nonexistent, ChatGPT-4 has fallen short in providing reliable and reproducible references-a concerning finding for the safety of using ChatGPT-4 for professional medical decision making without thorough verification. Only if used cautiously, with cross-referencing, can this language model act as an adjunct learning tool that can enhance comprehensiveness as well as knowledge rehearsal and manipulation.

Collapse

Kumar RP, Sivan V, Bachir H, Sarwar SA, Ruzicka F, O'Malley GR, Lobo P, Morales IC, Cassimatis ND, Hundal JS, Patel NV. Can Artificial Intelligence Mitigate Missed Diagnoses by Generating Differential Diagnoses for Neurosurgeons? World Neurosurg 2024;187:e1083-e1088. [PMID: 38759788 DOI: 10.1016/j.wneu.2024.05.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 05/08/2024] [Accepted: 05/09/2024] [Indexed: 05/19/2024]

Yao JJ, Aggarwal M, Lopez RD, Namdari S. Current Concepts Review: Large Language Models in Orthopaedics: Definitions, Uses, and Limitations. J Bone Joint Surg Am 2024:00004623-990000000-01136. [PMID: 38896652 DOI: 10.2106/jbjs.23.01417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]

MohanaSundaram A, Patil B, Praticò D. ChatGPT's Inconsistency in the Diagnosis of Alzheimer's Disease. J Alzheimers Dis Rep 2024;8:923-925. [PMID: 38910941 PMCID: PMC11191643 DOI: 10.3233/adr-240069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Accepted: 05/04/2024] [Indexed: 06/25/2024] Open

Croxford E, Gao Y, Patterson B, To D, Tesch S, Dligach D, Mayampurath A, Churpek MM, Afshar M. Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.20.24304620. [PMID: 38562730 PMCID: PMC10984060 DOI: 10.1101/2024.03.20.24304620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]

Koga S. The double-edged nature of ChatGPT in self-diagnosis. Wien Klin Wochenschr 2024;136:243-244. [PMID: 38504058 DOI: 10.1007/s00508-024-02343-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Accepted: 02/27/2024] [Indexed: 03/21/2024]

Parekh AS, McCahon JAS, Nghe A, Pedowitz DI, Daniel JN, Parekh SG. Foot and Ankle Patient Education Materials and Artificial Intelligence Chatbots: A Comparative Analysis. Foot Ankle Spec 2024:19386400241235834. [PMID: 38504411 DOI: 10.1177/19386400241235834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/21/2024]

Li J, Dada A, Puladi B, Kleesiek J, Egger J. ChatGPT in healthcare: A taxonomy and systematic review. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024;245:108013. [PMID: 38262126 DOI: 10.1016/j.cmpb.2024.108013] [Citation(s) in RCA: 61] [Impact Index Per Article: 61.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 12/29/2023] [Accepted: 01/08/2024] [Indexed: 01/25/2024]

Nacher M, Françoise U, Adenis A. ChatGPT neglects a neglected disease. THE LANCET. INFECTIOUS DISEASES 2024;24:e76. [PMID: 38211603 DOI: 10.1016/s1473-3099(23)00750-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 11/29/2023] [Accepted: 11/29/2023] [Indexed: 01/13/2024]

Thirunavukarasu AJ. How Can the Clinical Aptitude of AI Assistants Be Assayed? J Med Internet Res 2023;25:e51603. [PMID: 38051572 PMCID: PMC10731545 DOI: 10.2196/51603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 09/28/2023] [Accepted: 11/20/2023] [Indexed: 12/07/2023] Open

Sallam M, Barakat M, Sallam M. Pilot Testing of a Tool to Standardize the Assessment of the Quality of Health Information Generated by Artificial Intelligence-Based Models. Cureus 2023;15:e49373. [PMID: 38024074 PMCID: PMC10674084 DOI: 10.7759/cureus.49373] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/24/2023] [Indexed: 12/01/2023] Open

Abstract

Background Artificial intelligence (AI)-based conversational models, such as Chat Generative Pre-trained Transformer (ChatGPT), Microsoft Bing, and Google Bard, have emerged as valuable sources of health information for lay individuals. However, the accuracy of the information provided by these AI models remains a significant concern. This pilot study aimed to test a new tool with key themes for inclusion as follows: Completeness of content, Lack of false information in the content, Evidence supporting the content, Appropriateness of the content, and Relevance, referred to as "CLEAR", designed to assess the quality of health information delivered by AI-based models. Methods Tool development involved a literature review on health information quality, followed by the initial establishment of the CLEAR tool, which comprised five items that aimed to assess the following: completeness, lack of false information, evidence support, appropriateness, and relevance. Each item was scored on a five-point Likert scale from excellent to poor. Content validity was checked by expert review. Pilot testing involved 32 healthcare professionals using the CLEAR tool to assess content on eight different health topics deliberately designed with varying qualities. The internal consistency was checked with Cronbach's alpha (α). Feedback from the pilot test resulted in language modifications to improve the clarity of the items. The final CLEAR tool was used to assess the quality of health information generated by four distinct AI models on five health topics. The AI models were ChatGPT 3.5, ChatGPT 4, Microsoft Bing, and Google Bard, and the content generated was scored by two independent raters with Cohen's kappa (κ) for inter-rater agreement. Results The final five CLEAR items were: (1) Is the content sufficient?; (2) Is the content accurate?; (3) Is the content evidence-based?; (4) Is the content clear, concise, and easy to understand?; and (5) Is the content free from irrelevant information? Pilot testing on the eight health topics revealed acceptable internal consistency with a Cronbach's α range of 0.669-0.981. The use of the final CLEAR tool yielded the following average scores: Microsoft Bing (mean=24.4±0.42), ChatGPT-4 (mean=23.6±0.96), Google Bard (mean=21.2±1.79), and ChatGPT-3.5 (mean=20.6±5.20). The inter-rater agreement revealed the following Cohen κ values: for ChatGPT-3.5 (κ=0.875, P<.001), ChatGPT-4 (κ=0.780, P<.001), Microsoft Bing (κ=0.348, P=.037), and Google Bard (κ=.749, P<.001). Conclusions The CLEAR tool is a brief yet helpful tool that can aid in standardizing testing of the quality of health information generated by AI-based models. Future studies are recommended to validate the utility of the CLEAR tool in the quality assessment of AI-generated health-related content using a larger sample across various complex health topics.

Collapse