Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. Int J Environ Res Public Health 2023;20:3378. [PMID: 36834073 PMCID: PMC9967747 DOI: 10.3390/ijerph20043378] [Citation(s) in RCA: 114] [Impact Index Per Article: 114.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 02/09/2023] [Accepted: 02/13/2023] [Indexed: 06/01/2023]

For:	Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. Int J Environ Res Public Health 2023;20:3378. [PMID: 36834073 PMCID: PMC9967747 DOI: 10.3390/ijerph20043378] [Citation(s) in RCA: 114] [Impact Index Per Article: 114.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 02/09/2023] [Accepted: 02/13/2023] [Indexed: 06/01/2023]

Number

Cited by Other Article(s)

Ramasubramanian S, Balaji S, Kannan T, Jeyaraman N, Sharma S, Migliorini F, Balasubramaniam S, Jeyaraman M. Comparative evaluation of artificial intelligence systems' accuracy in providing medical drug dosages: A methodological study. World J Methodol 2024;14:92802. [DOI: 10.5662/wjm.v14.i4.92802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 05/29/2024] [Accepted: 06/25/2024] [Indexed: 07/26/2024] Open

Xu X, Yang Y, Tan X, Zhang Z, Wang B, Yang X, Weng C, Yu R, Zhao Q, Quan S. Hepatic encephalopathy post-TIPS: Current status and prospects in predictive assessment. Comput Struct Biotechnol J 2024;24:493-506. [PMID: 39076168 PMCID: PMC11284497 DOI: 10.1016/j.csbj.2024.07.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2024] [Revised: 07/05/2024] [Accepted: 07/05/2024] [Indexed: 07/31/2024] Open

Hua R, Dong X, Wei Y, Shu Z, Yang P, Hu Y, Zhou S, Sun H, Yan K, Yan X, Chang K, Li X, Bai Y, Zhang R, Wang W, Zhou X. Lingdan: enhancing encoding of traditional Chinese medicine knowledge for clinical reasoning tasks with large language models. J Am Med Inform Assoc 2024;31:2019-2029. [PMID: 39038795 PMCID: PMC11339528 DOI: 10.1093/jamia/ocae087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/22/2024] [Accepted: 04/06/2024] [Indexed: 07/24/2024] Open

Abstract

OBJECTIVE

The recent surge in large language models (LLMs) across various fields has yet to be fully realized in traditional Chinese medicine (TCM). This study aims to bridge this gap by developing a large language model tailored to TCM knowledge, enhancing its performance and accuracy in clinical reasoning tasks such as diagnosis, treatment, and prescription recommendations.

MATERIALS AND METHODS

This study harnessed a wide array of TCM data resources, including TCM ancient books, textbooks, and clinical data, to create 3 key datasets: the TCM Pre-trained Dataset, the Traditional Chinese Patent Medicine (TCPM) Question Answering Dataset, and the Spleen and Stomach Herbal Prescription Recommendation Dataset. These datasets underpinned the development of the Lingdan Pre-trained LLM and 2 specialized models: the Lingdan-TCPM-Chat Model, which uses a Chain-of-Thought process for symptom analysis and TCPM recommendation, and a Lingdan Prescription Recommendation model (Lingdan-PR) that proposes herbal prescriptions based on electronic medical records.

RESULTS

The Lingdan-TCPM-Chat and the Lingdan-PR Model, fine-tuned on the Lingdan Pre-trained LLM, demonstrated state-of-the art performances for the tasks of TCM clinical knowledge answering and herbal prescription recommendation. Notably, Lingdan-PR outperformed all state-of-the-art baseline models, achieving an improvement of 18.39% in the Top@20 F1-score compared with the best baseline.

CONCLUSION

This study marks a pivotal step in merging advanced LLMs with TCM, showcasing the potential of artificial intelligence to help improve clinical decision-making of medical diagnostics and treatment strategies. The success of the Lingdan Pre-trained LLM and its derivative models, Lingdan-TCPM-Chat and Lingdan-PR, not only revolutionizes TCM practices but also opens new avenues for the application of artificial intelligence in other specialized medical fields. Our project is available at https://github.com/TCMAI-BJTU/LingdanLLM.

Collapse

Affiliation(s)

Rui Hua Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
Xin Dong Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
Yu Wei Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
Zixin Shu Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
Pengcheng Yang Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
Yunhui Hu Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
Shuiping Zhou Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
He Sun Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
Kaijing Yan Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
Xijun Yan Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
Kai Chang Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
Xiaodong Li Affiliated Hospital of Hubei University of Chinese Medicine, Wuhan 430065, China Hubei Academy of Chinese Medicine, Wuhan 430061, China Institute of Liver Diseases, Hubei Key Laboratory of Theoretical and Applied Research of Liver and Kidney in Traditional Chinese Medicine, Hubei Provincial Hospital of Traditional Chinese Medicine, Wuhan 430061, China
Yuning Bai Department of Gastroenterology, Guang’anmen Hospital, China Academy of Chinese Medical Sciences, Beijing 100053, China
Runshun Zhang Department of Gastroenterology, Guang’anmen Hospital, China Academy of Chinese Medical Sciences, Beijing 100053, China
Wenjia Wang Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
Xuezhong Zhou Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China

Collapse

Andreadis K, Newman DR, Twan C, Shunk A, Mann DM, Stevens ER. Mixed methods assessment of the influence of demographics on medical advice of ChatGPT. J Am Med Inform Assoc 2024;31:2002-2009. [PMID: 38679900 PMCID: PMC11339520 DOI: 10.1093/jamia/ocae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 03/22/2024] [Accepted: 04/03/2024] [Indexed: 05/01/2024] Open

Abstract

OBJECTIVES

To evaluate demographic biases in diagnostic accuracy and health advice between generative artificial intelligence (AI) (ChatGPT GPT-4) and traditional symptom checkers like WebMD.

MATERIALS AND METHODS

Combination symptom and demographic vignettes were developed for 27 most common symptom complaints. Standardized prompts, written from a patient perspective, with varying demographic permutations of age, sex, and race/ethnicity were entered into ChatGPT (GPT-4) between July and August 2023. In total, 3 runs of 540 ChatGPT prompts were compared to the corresponding WebMD Symptom Checker output using a mixed-methods approach. In addition to diagnostic correctness, the associated text generated by ChatGPT was analyzed for readability (using Flesch-Kincaid Grade Level) and qualitative aspects like disclaimers and demographic tailoring.

RESULTS

ChatGPT matched WebMD in 91% of diagnoses, with a 24% top diagnosis match rate. Diagnostic accuracy was not significantly different across demographic groups, including age, race/ethnicity, and sex. ChatGPT's urgent care recommendations and demographic tailoring were presented significantly more to 75-year-olds versus 25-year-olds (P < .01) but were not statistically different among race/ethnicity and sex groups. The GPT text was suitable for college students, with no significant demographic variability.

DISCUSSION

The use of non-health-tailored generative AI, like ChatGPT, for simple symptom-checking functions provides comparable diagnostic accuracy to commercially available symptom checkers and does not demonstrate significant demographic bias in this setting. The text accompanying differential diagnoses, however, suggests demographic tailoring that could potentially introduce bias.

CONCLUSION

These results highlight the need for continued rigorous evaluation of AI-driven medical platforms, focusing on demographic biases to ensure equitable care.

Collapse

Warrier A, Singh R, Haleem A, Zaki H, Eloy JA. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology. Laryngoscope 2024;134:3997-4002. [PMID: 38563415 DOI: 10.1002/lary.31434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/05/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]

Wu G, Lee DA, Zhao W, Wong A, Jhangiani R, Kurniawan S. ChatGPT and Google Assistant as a Source of Patient Education for Patients With Amblyopia: Content Analysis. J Med Internet Res 2024;26:e52401. [PMID: 39146013 DOI: 10.2196/52401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 04/24/2024] [Accepted: 04/30/2024] [Indexed: 08/16/2024] Open

Abstract

BACKGROUND

We queried ChatGPT (OpenAI) and Google Assistant about amblyopia and compared their answers with the keywords found on the American Association for Pediatric Ophthalmology and Strabismus (AAPOS) website, specifically the section on amblyopia. Out of the 26 keywords chosen from the website, ChatGPT included 11 (42%) in its responses, while Google included 8 (31%).

OBJECTIVE

Our study investigated the adherence of ChatGPT-3.5 and Google Assistant to the guidelines of the AAPOS for patient education on amblyopia.

METHODS

ChatGPT-3.5 was used. The four questions taken from the AAPOS website, specifically its glossary section for amblyopia, are as follows: (1) What is amblyopia? (2) What causes amblyopia? (3) How is amblyopia treated? (4) What happens if amblyopia is untreated? Approved and selected by ophthalmologists (GW and DL), the keywords from AAPOS were words or phrases that deemed significant for the education of patients with amblyopia. The "Flesch-Kincaid Grade Level" formula, approved by the US Department of Education, was used to evaluate the reading comprehension level for the responses from ChatGPT, Google Assistant, and AAPOS.

RESULTS

In their responses, ChatGPT did not mention the term "ophthalmologist," whereas Google Assistant and AAPOS both mentioned the term once and twice, respectively. ChatGPT did, however, use the term "eye doctors" once. According to the Flesch-Kincaid test, the average reading level of AAPOS was 11.4 (SD 2.1; the lowest level) while that of Google was 13.1 (SD 4.8; the highest required reading level), also showing the greatest variation in grade level in its responses. ChatGPT's answers, on average, scored 12.4 (SD 1.1) grade level. They were all similar in terms of difficulty level in reading. For the keywords, out of the 4 responses, ChatGPT used 42% (11/26) of the keywords, whereas Google Assistant used 31% (8/26).

CONCLUSIONS

ChatGPT trains on texts and phrases and generates new sentences, while Google Assistant automatically copies website links. As ophthalmologists, we should consider including "see an ophthalmologist" on our websites and journals. While ChatGPT is here to stay, we, as physicians, need to monitor its answers.

Collapse

Takahashi H, Shikino K, Kondo T, Komori A, Yamada Y, Saita M, Naito T. Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study. JMIR MEDICAL EDUCATION 2024;10:e59133. [PMID: 39137031 DOI: 10.2196/59133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 05/22/2024] [Accepted: 06/27/2024] [Indexed: 08/15/2024]

Abstract

BACKGROUND

Evaluating the accuracy and educational utility of artificial intelligence-generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored.

OBJECTIVE

This study aimed to assess the educational utility of ChatGPT-4-generated clinical vignettes and their applicability in educational settings.

METHODS

Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians' experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases.

RESULTS

Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations.

CONCLUSIONS

ChatGPT-4-generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4's value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application.

Collapse

Fatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: A cross-disciplinary systematic review of ChatGPT's (artificial intelligence) role in research, clinical practice, education, and patient interaction. Medicine (Baltimore) 2024;103:e39250. [PMID: 39121303 PMCID: PMC11315549 DOI: 10.1097/md.0000000000039250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 07/19/2024] [Indexed: 08/11/2024] Open

Zhang X, Zhang D, Zhang X, Zhang X. Artificial intelligence applications in the diagnosis and treatment of bacterial infections. Front Microbiol 2024;15:1449844. [PMID: 39165576 PMCID: PMC11334354 DOI: 10.3389/fmicb.2024.1449844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Accepted: 07/04/2024] [Indexed: 08/22/2024] Open

Abstract

The diagnosis and treatment of bacterial infections in the medical and public health field in the 21st century remain significantly challenging. Artificial Intelligence (AI) has emerged as a powerful new tool in diagnosing and treating bacterial infections. AI is rapidly revolutionizing epidemiological studies of infectious diseases, providing effective early warning, prevention, and control of outbreaks. Machine learning models provide a highly flexible way to simulate and predict the complex mechanisms of pathogen-host interactions, which is crucial for a comprehensive understanding of the nature of diseases. Machine learning-based pathogen identification technology and antimicrobial drug susceptibility testing break through the limitations of traditional methods, significantly shorten the time from sample collection to the determination of result, and greatly improve the speed and accuracy of laboratory testing. In addition, AI technology application in treating bacterial infections, particularly in the research and development of drugs and vaccines, and the application of innovative therapies such as bacteriophage, provides new strategies for improving therapy and curbing bacterial resistance. Although AI has a broad application prospect in diagnosing and treating bacterial infections, significant challenges remain in data quality and quantity, model interpretability, clinical integration, and patient privacy protection. To overcome these challenges and, realize widespread application in clinical practice, interdisciplinary cooperation, technology innovation, and policy support are essential components of the joint efforts required. In summary, with continuous advancements and in-depth application of AI technology, AI will enable doctors to more effectivelyaddress the challenge of bacterial infection, promoting the development of medical practice toward precision, efficiency, and personalization; optimizing the best nursing and treatment plans for patients; and providing strong support for public health safety.

Collapse

Patel MA, Villalobos F, Shan K, Tardo LM, Horton LA, Sguigna PV, Blackburn KM, Munoz SB, Moog TM, Smith AD, Burgess KW, McCreary M, Okuda DT. Generative artificial intelligence versus clinicians: Who diagnoses multiple sclerosis faster and with greater accuracy? Mult Scler Relat Disord 2024;90:105791. [PMID: 39146892 DOI: 10.1016/j.msard.2024.105791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 07/05/2024] [Accepted: 07/27/2024] [Indexed: 08/17/2024]

Abstract

BACKGROUND

Those receiving the diagnosis of multiple sclerosis (MS) over the next ten years will predominantly be part of Generation Z (Gen Z). Recent observations within our clinic suggest that younger people with MS utilize online generative artificial intelligence (AI) platforms for personalized medical advice prior to their first visit with a specialist in neuroimmunology. The use of such platforms is anticipated to increase given the technology driven nature, desire for instant communication, and cost-conscious nature of Gen Z. Our objective was to determine if ChatGPT (Generative Pre-trained Transformer) could diagnose MS in individuals earlier than their clinical timeline, and to assess if the accuracy differed based on age, sex, and race/ethnicity.

METHODS

People with MS between 18 and 59 years of age were studied. The clinical timeline for people diagnosed with MS was retrospectively identified and simulated using ChatGPT-3.5 (GPT-3.5). Chats were conducted using both actual and derivatives of their age, sex, and race/ethnicity to test diagnostic accuracy. A Kaplan-Meier survival curve was estimated for time to diagnosis, clustered by subject. The p-value testing for differences in time to diagnosis was accomplished using a general Wilcoxon test. Logistic regression (subject-specific intercept) was used to capture intra-subject correlation to test the accuracy prior to and after the inclusion of MRI data.

RESULTS

The study cohort included 100 unique people with MS. Of those, 50 were members of Gen Z (38 female; 22 White; mean age at first symptom was 20.6 years (y) (standard deviation (SD)=2.2y)), and 50 were non-Gen Z (34 female; 27 White; mean age at first symptom was 37.0y (SD=10.4y)). In addition, a total of 529 people that represented digital simulations of the original cohort of 100 people (333 female; 166 White; 136 Black/African American; 107 Asian; 120 Hispanic, mean age at first symptom was 31.6y (SD=12.4y)) were generated allowing for 629 scripted conversations to be analyzed. The estimated median time to diagnosis in clinic was significantly longer at 0.35y (95% CI=[0.28, 0.48]) versus that by ChatGPT at 0.08y (95% CI=[0.04, 0.24]) (p<0.0001). There was no difference in the diagnostic accuracy between ages and by race/ethnicity prior to the inclusion of MRI data. However, prior to including the MRI data, males had a 47% less likely chance of a correct diagnosis relative to females (p=0.05). Post-MRI data inclusion within GPT-3.5, the odds of an accurate diagnosis was 4.0-fold greater for Gen Z participants, relative to non-Gen Z participants (p=0.01) with the diagnostic accuracy being 68% less in males relative to females (p=0.009), and 75% less for White subjects, relative to non-White subjects (p=0.0004).

CONCLUSION

Although generative AI platforms enable rapid information access and are not principally designed for use in healthcare, an increase in use by Gen Z is anticipated. However, the obtained responses may not be generalizable to all users and bias may exist in select groups.

Collapse

Affiliation(s)

Mahi A Patel The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
Francisco Villalobos The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
Kevin Shan The University of Texas Southwestern Medical Center, School of Medicine, Dallas, Texas, USA
Lauren M Tardo The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
Lindsay A Horton The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
Peter V Sguigna The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
Kyle M Blackburn The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
Shanan B Munoz The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
Tatum M Moog The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
Alexander D Smith Texas Tech University Health Sciences Center, School of Medicine, Lubbock, Texas, USA
Katy W Burgess The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
Morgan McCreary The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
Darin T Okuda The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA.

Collapse

Gleber C, Fear K. Diagnostic reasoning in the age of artificial intelligence: Synergy or opposition? J Hosp Med 2024;19:749-752. [PMID: 38340350 DOI: 10.1002/jhm.13295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 01/21/2024] [Accepted: 01/28/2024] [Indexed: 02/12/2024]

Palenzuela DL, Mullen JT, Phitayakorn R. AI Versus MD: Evaluating the surgical decision-making accuracy of ChatGPT-4. Surgery 2024;176:241-245. [PMID: 38769038 DOI: 10.1016/j.surg.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/22/2024] [Accepted: 04/03/2024] [Indexed: 05/22/2024]

Abstract

BACKGROUND

ChatGPT-4 is a large language model with possible applications to surgery education The aim of this study was to investigate the accuracy of ChatGPT-4's surgical decision-making compared with general surgery residents and attending surgeons.

METHODS

Five clinical scenarios were created from actual patient data based on common general surgery diagnoses. Scripts were developed to sequentially provide clinical information and ask decision-making questions. Responses to the prompts were scored based on a standardized rubric for a total of 50 points. Each clinical scenario was run through Chat GPT-4 and sent electronically to all general surgery residents and attendings at a single institution. Scores were compared using Wilcoxon rank sum tests.

RESULTS

On average, ChatGPT-4 scored 39.6 points (79.2%, standard deviation ± 0.89 points). A total of five junior residents, 12 senior residents, and five attendings completed the clinical scenarios (resident response rate = 15.9%; attending response rate = 13.8%). On average, the junior residents scored a total of 33.4 (66.8%, standard deviation ± 3.29), senior residents 38.0 (76.0%, standard deviation ± 4.75), and attendings 38.8 (77.6%, standard deviation ± 5.45). ChatGPT-4 scored significantly better than junior residents (P = .009) but was not significantly different from senior residents or attendings. ChatGPT-4 was significantly better than junior residents at identifying the correct operation to perform (P = .0182) and recommending additional workup for postoperative complications (P = .012).

CONCLUSION

ChatGPT-4 performed superior to junior residents and equivalent to senior residents and attendings when faced with surgical patient scenarios. Large language models, such as ChatGPT, may have the potential to be an educational resource for junior residents to develop surgical decision-making skills.

Collapse

Mizuta K, Hirosawa T, Harada Y, Shimizu T. Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician? Diagnosis (Berl) 2024;11:321-324. [PMID: 38465399 DOI: 10.1515/dx-2024-0027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Accepted: 02/22/2024] [Indexed: 03/12/2024]

Petrella RJ. The AI Future of Emergency Medicine. Ann Emerg Med 2024;84:139-153. [PMID: 38795081 DOI: 10.1016/j.annemergmed.2024.01.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 01/23/2024] [Accepted: 01/24/2024] [Indexed: 05/27/2024]

García-Méndez S, de Arriba-Pérez F. Large Language Models and Healthcare Alliance: Potential and Challenges of Two Representative Use Cases. Ann Biomed Eng 2024;52:1928-1931. [PMID: 38310159 DOI: 10.1007/s10439-024-03454-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 01/15/2024] [Indexed: 02/05/2024]

Scott IA, Miller T, Crock C. Using conversant artificial intelligence to improve diagnostic reasoning: ready for prime time? Med J Aust 2024. [PMID: 39086025 DOI: 10.5694/mja2.52401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Accepted: 04/22/2024] [Indexed: 08/02/2024]

Kim P, Seo B, De Silva H. Concordance of clinician, Chat-GPT4, and ORAD diagnoses against histopathology in Odontogenic Keratocysts and tumours: a 15-Year New Zealand retrospective study. Oral Maxillofac Surg 2024:10.1007/s10006-024-01284-5. [PMID: 39060850 DOI: 10.1007/s10006-024-01284-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Accepted: 07/19/2024] [Indexed: 07/28/2024]

Burke HB, Hoang A, Lopreiato JO, King H, Hemmer P, Montgomery M, Gagarin V. Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study. JMIR MEDICAL EDUCATION 2024;10:e56342. [PMID: 39118469 PMCID: PMC11327632 DOI: 10.2196/56342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 02/22/2024] [Accepted: 05/06/2024] [Indexed: 08/10/2024]

Abstract

Background

Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes.

Objective

The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students' free-text history and physical notes.

Methods

This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students' notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct.

Results

The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002).

Conclusions

ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students' standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice.

Collapse

Pesapane F, Cuocolo R, Sardanelli F. The Picasso's skepticism on computer science and the dawn of generative AI: questions after the answers to keep "machines-in-the-loop". Eur Radiol Exp 2024;8:81. [PMID: 39046535 PMCID: PMC11269548 DOI: 10.1186/s41747-024-00485-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Accepted: 06/16/2024] [Indexed: 07/25/2024] Open

Kämmer JE, Hautz WE, Krummrey G, Sauter TC, Penders D, Birrenbach T, Bienefeld N. Effects of interacting with a large language model compared with a human coach on the clinical diagnostic process and outcomes among fourth-year medical students: study protocol for a prospective, randomised experiment using patient vignettes. BMJ Open 2024;14:e087469. [PMID: 39025818 PMCID: PMC11261684 DOI: 10.1136/bmjopen-2024-087469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Accepted: 07/02/2024] [Indexed: 07/20/2024] Open

Abstract

INTRODUCTION

Versatile large language models (LLMs) have the potential to augment diagnostic decision-making by assisting diagnosticians, thanks to their ability to engage in open-ended, natural conversations and their comprehensive knowledge access. Yet the novelty of LLMs in diagnostic decision-making introduces uncertainties regarding their impact. Clinicians unfamiliar with the use of LLMs in their professional context may rely on general attitudes towards LLMs more broadly, potentially hindering thoughtful use and critical evaluation of their input, leading to either over-reliance and lack of critical thinking or an unwillingness to use LLMs as diagnostic aids. To address these concerns, this study examines the influence on the diagnostic process and outcomes of interacting with an LLM compared with a human coach, and of prior training vs no training for interacting with either of these 'coaches'. Our findings aim to illuminate the potential benefits and risks of employing artificial intelligence (AI) in diagnostic decision-making.

METHODS AND ANALYSIS

We are conducting a prospective, randomised experiment with N=158 fourth-year medical students from Charité Medical School, Berlin, Germany. Participants are asked to diagnose patient vignettes after being assigned to either a human coach or ChatGPT and after either training or no training (both between-subject factors). We are specifically collecting data on the effects of using either of these 'coaches' and of additional training on information search, number of hypotheses entertained, diagnostic accuracy and confidence. Statistical methods will include linear mixed effects models. Exploratory analyses of the interaction patterns and attitudes towards AI will also generate more generalisable knowledge about the role of AI in medicine.

ETHICS AND DISSEMINATION

The Bern Cantonal Ethics Committee considered the study exempt from full ethical review (BASEC No: Req-2023-01396). All methods will be conducted in accordance with relevant guidelines and regulations. Participation is voluntary and informed consent will be obtained. Results will be published in peer-reviewed scientific medical journals. Authorship will be determined according to the International Committee of Medical Journal Editors guidelines.

Collapse

Wada A, Akashi T, Shih G, Hagiwara A, Nishizawa M, Hayakawa Y, Kikuta J, Shimoji K, Sano K, Kamagata K, Nakanishi A, Aoki S. Optimizing GPT-4 Turbo Diagnostic Accuracy in Neuroradiology through Prompt Engineering and Confidence Thresholds. Diagnostics (Basel) 2024;14:1541. [PMID: 39061677 PMCID: PMC11276551 DOI: 10.3390/diagnostics14141541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 07/02/2024] [Accepted: 07/10/2024] [Indexed: 07/28/2024] Open

Yazaki M, Maki S, Furuya T, Inoue K, Nagai K, Nagashima Y, Maruyama J, Toki Y, Kitagawa K, Iwata S, Kitamura T, Gushiken S, Noguchi Y, Inoue M, Shiga Y, Inage K, Orita S, Nakada T, Ohtori S. Emergency Patient Triage Improvement through a Retrieval-Augmented Generation Enhanced Large-Scale Language Model. PREHOSP EMERG CARE 2024:1-7. [PMID: 38950135 DOI: 10.1080/10903127.2024.2374400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Accepted: 06/17/2024] [Indexed: 07/03/2024]

Affiliation(s)

Megumi Yazaki Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan Tertiary Emergency Medical Center, Tokyo Metropolitan Bokutoh Hospital, Tokyo, Japan Department of Emergency and Critical Care Medicine, Chiba University, Chiba, Japan
Satoshi Maki Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan Center for Frontier Medical Engineering, Chiba University, Chiba, Japan
Takeo Furuya Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Ken Inoue Tertiary Emergency Medical Center, Tokyo Metropolitan Bokutoh Hospital, Tokyo, Japan
Ko Nagai Tertiary Emergency Medical Center, Tokyo Metropolitan Bokutoh Hospital, Tokyo, Japan
Yuki Nagashima Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Juntaro Maruyama Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Yasunori Toki Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Kyota Kitagawa Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Shuhei Iwata Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Takaki Kitamura Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Sho Gushiken Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Yuji Noguchi Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Masahiro Inoue Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Yasuhiro Shiga Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Kazuhide Inage Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
Sumihisa Orita Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan Center for Frontier Medical Engineering, Chiba University, Chiba, Japan
Takaaki Nakada Department of Emergency and Critical Care Medicine, Chiba University, Chiba, Japan
Seiji Ohtori Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan

Collapse

Sheerah HA, AlSalamah S, Alsalamah SA, Lu CT, Arafa A, Zaatari E, Alhomod A, Pujari S, Labrique A. The Rise of Virtual Health Care: Transforming the Health Care Landscape in the Kingdom of Saudi Arabia: A Review Article. Telemed J E Health 2024. [PMID: 38984415 DOI: 10.1089/tmj.2024.0114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/11/2024] Open

Hoppe JM, Auer MK, Strüven A, Massberg S, Stremmel C. ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis. J Med Internet Res 2024;26:e56110. [PMID: 38976865 PMCID: PMC11263899 DOI: 10.2196/56110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 04/08/2024] [Accepted: 05/08/2024] [Indexed: 07/10/2024] Open

Abstract

BACKGROUND

OpenAI's ChatGPT is a pioneering artificial intelligence (AI) in the field of natural language processing, and it holds significant potential in medicine for providing treatment advice. Additionally, recent studies have demonstrated promising results using ChatGPT for emergency medicine triage. However, its diagnostic accuracy in the emergency department (ED) has not yet been evaluated.

OBJECTIVE

This study compares the diagnostic accuracy of ChatGPT with GPT-3.5 and GPT-4 and primary treating resident physicians in an ED setting.

METHODS

Among 100 adults admitted to our ED in January 2023 with internal medicine issues, the diagnostic accuracy was assessed by comparing the diagnoses made by ED resident physicians and those made by ChatGPT with GPT-3.5 or GPT-4 against the final hospital discharge diagnosis, using a point system for grading accuracy.

RESULTS

The study enrolled 100 patients with a median age of 72 (IQR 58.5-82.0) years who were admitted to our internal medicine ED primarily for cardiovascular, endocrine, gastrointestinal, or infectious diseases. GPT-4 outperformed both GPT-3.5 (P<.001) and ED resident physicians (P=.01) in diagnostic accuracy for internal medicine emergencies. Furthermore, across various disease subgroups, GPT-4 consistently outperformed GPT-3.5 and resident physicians. It demonstrated significant superiority in cardiovascular (GPT-4 vs ED physicians: P=.03) and endocrine or gastrointestinal diseases (GPT-4 vs GPT-3.5: P=.01). However, in other categories, the differences were not statistically significant.

CONCLUSIONS

In this study, which compared the diagnostic accuracy of GPT-3.5, GPT-4, and ED resident physicians against a discharge diagnosis gold standard, GPT-4 outperformed both the resident physicians and its predecessor, GPT-3.5. Despite the retrospective design of the study and its limited sample size, the results underscore the potential of AI as a supportive diagnostic tool in ED settings.

Collapse

Keshavarz P, Bagherieh S, Nabipoorashrafi SA, Chalian H, Rahsepar AA, Kim GHJ, Hassani C, Raman SS, Bedayat A. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024;105:251-265. [PMID: 38679540 DOI: 10.1016/j.diii.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/11/2024] [Accepted: 04/16/2024] [Indexed: 05/01/2024]

Abstract

PURPOSE

The purpose of this study was to systematically review the reported performances of ChatGPT, identify potential limitations, and explore future directions for its integration, optimization, and ethical considerations in radiology applications.

MATERIALS AND METHODS

After a comprehensive review of PubMed, Web of Science, Embase, and Google Scholar databases, a cohort of published studies was identified up to January 1, 2024, utilizing ChatGPT for clinical radiology applications.

RESULTS

Out of 861 studies derived, 44 studies evaluated the performance of ChatGPT; among these, 37 (37/44; 84.1%) demonstrated high performance, and seven (7/44; 15.9%) indicated it had a lower performance in providing information on diagnosis and clinical decision support (6/44; 13.6%) and patient communication and educational content (1/44; 2.3%). Twenty-four (24/44; 54.5%) studies reported the proportion of ChatGPT's performance. Among these, 19 (19/24; 79.2%) studies recorded a median accuracy of 70.5%, and in five (5/24; 20.8%) studies, there was a median agreement of 83.6% between ChatGPT outcomes and reference standards [radiologists' decision or guidelines], generally confirming ChatGPT's high accuracy in these studies. Eleven studies compared two recent ChatGPT versions, and in ten (10/11; 90.9%), ChatGPTv4 outperformed v3.5, showing notable enhancements in addressing higher-order thinking questions, better comprehension of radiology terms, and improved accuracy in describing images. Risks and concerns about using ChatGPT included biased responses, limited originality, and the potential for inaccurate information leading to misinformation, hallucinations, improper citations and fake references, cybersecurity vulnerabilities, and patient privacy risks.

CONCLUSION

Although ChatGPT's effectiveness has been shown in 84.1% of radiology studies, there are still multiple pitfalls and limitations to address. It is too soon to confirm its complete proficiency and accuracy, and more extensive multicenter studies utilizing diverse datasets and pre-training techniques are required to verify ChatGPT's role in radiology.

Collapse

Law S, Oldfield B, Yang W. ChatGPT/GPT-4 (large language models): Opportunities and challenges of perspective in bariatric healthcare professionals. Obes Rev 2024;25:e13746. [PMID: 38613164 DOI: 10.1111/obr.13746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/10/2023] [Revised: 03/14/2024] [Accepted: 03/15/2024] [Indexed: 04/14/2024]

Kumar RP, Sivan V, Bachir H, Sarwar SA, Ruzicka F, O'Malley GR, Lobo P, Morales IC, Cassimatis ND, Hundal JS, Patel NV. Can Artificial Intelligence Mitigate Missed Diagnoses by Generating Differential Diagnoses for Neurosurgeons? World Neurosurg 2024;187:e1083-e1088. [PMID: 38759788 DOI: 10.1016/j.wneu.2024.05.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 05/08/2024] [Accepted: 05/09/2024] [Indexed: 05/19/2024]

Alshutayli AAM, Asiri FM, Abutaleb YBA, Alomair BA, Almasaud AK, Almaqhawi A. Assessing Public Knowledge and Acceptance of Using Artificial Intelligence Doctors as a Partial Alternative to Human Doctors in Saudi Arabia: A Cross-Sectional Study. Cureus 2024;16:e64461. [PMID: 39135842 PMCID: PMC11318498 DOI: 10.7759/cureus.64461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/13/2024] [Indexed: 08/15/2024] Open

Aden D, Zaheer S, Khan S. Possible benefits, challenges, pitfalls, and future perspective of using ChatGPT in pathology. REVISTA ESPANOLA DE PATOLOGIA : PUBLICACION OFICIAL DE LA SOCIEDAD ESPANOLA DE ANATOMIA PATOLOGICA Y DE LA SOCIEDAD ESPANOLA DE CITOLOGIA 2024;57:198-210. [PMID: 38971620 DOI: 10.1016/j.patol.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 02/22/2024] [Accepted: 04/16/2024] [Indexed: 07/08/2024]

Lahat A, Sharif K, Zoabi N, Shneor Patt Y, Sharif Y, Fisher L, Shani U, Arow M, Levin R, Klang E. Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4. J Med Internet Res 2024;26:e54571. [PMID: 38935937 PMCID: PMC11240076 DOI: 10.2196/54571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 02/02/2024] [Accepted: 04/29/2024] [Indexed: 06/29/2024] Open

Abstract

BACKGROUND

Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement.

OBJECTIVE

This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors' and residents' ratings, and specific question types.

METHODS

A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications.

RESULTS

Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5's accuracy, beneficial, and completeness dimensions.

CONCLUSIONS

ChatGPT's potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.

Collapse

Ríos-Hoyo A, Shan NL, Li A, Pearson AT, Pusztai L, Howard FM. Evaluation of large language models as a diagnostic aid for complex medical cases. Front Med (Lausanne) 2024;11:1380148. [PMID: 38966538 PMCID: PMC11222590 DOI: 10.3389/fmed.2024.1380148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 06/10/2024] [Indexed: 07/06/2024] Open

Abstract

Background

The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals.

Objective

To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case.

Design

Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI's GPT3.5 and 4 models.

Results

The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 (p < 0.0001). GPT4 was more frequently able to list the correct diagnosis as first (22% versus 20% with GPT3.5, p = 0.86), provide the correct diagnosis among the top three generated diagnoses (42% versus 24%, p = 0.075). GPT4 was better at providing the correct diagnosis, when the different diagnoses were classified into groups according to the medical specialty and include the correct diagnosis at any point in the differential list (68% versus 48%, p = 0.0063). GPT4 provided a differential list that was more similar to the list provided by the case discussants than GPT3.5 (Jaccard Similarity Index 0.22 versus 0.12, p = 0.001). Inclusion of the correct diagnosis in the generated differential was correlated with PubMed articles matching the diagnosis (OR 1.40, 95% CI 1.25-1.56 for GPT3.5, OR 1.25, 95% CI 1.13-1.40 for GPT4), but not with disease incidence.

Conclusions and relevance

The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.

Collapse

Born C, Schwarz R, Böttcher TP, Hein A, Krcmar H. The role of information systems in emergency department decision-making-a literature review. J Am Med Inform Assoc 2024;31:1608-1621. [PMID: 38781289 PMCID: PMC11187435 DOI: 10.1093/jamia/ocae096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 04/11/2024] [Accepted: 04/15/2024] [Indexed: 05/25/2024] Open

Abstract

OBJECTIVES

Healthcare providers employ heuristic and analytical decision-making to navigate the high-stakes environment of the emergency department (ED). Despite the increasing integration of information systems (ISs), research on their efficacy is conflicting. Drawing on related fields, we investigate how timing and mode of delivery influence IS effectiveness. Our objective is to reconcile previous contradictory findings, shedding light on optimal IS design in the ED.

MATERIALS AND METHODS

We conducted a systematic review following PRISMA across PubMed, Scopus, and Web of Science. We coded the ISs' timing as heuristic or analytical, their mode of delivery as active for automatic alerts and passive when requiring user-initiated information retrieval, and their effect on process, economic, and clinical outcomes.

RESULTS

Our analysis included 83 studies. During early heuristic decision-making, most active interventions were ineffective, while passive interventions generally improved outcomes. In the analytical phase, the effects were reversed. Passive interventions that facilitate information extraction consistently improved outcomes.

DISCUSSION

Our findings suggest that the effectiveness of active interventions negatively correlates with the amount of information received during delivery. During early heuristic decision-making, when information overload is high, physicians are unresponsive to alerts and proactively consult passive resources. In the later analytical phases, physicians show increased receptivity to alerts due to decreased diagnostic uncertainty and information quantity. Interventions that limit information lead to positive outcomes, supporting our interpretation.

CONCLUSION

We synthesize our findings into an integrated model that reveals the underlying reasons for conflicting findings from previous reviews and can guide practitioners in designing ISs in the ED.

Collapse

Masanneck L, Schmidt L, Seifert A, Kölsche T, Huntemann N, Jansen R, Mehsin M, Bernhard M, Meuth SG, Böhm L, Pawlitzki M. Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study. J Med Internet Res 2024;26:e53297. [PMID: 38875696 PMCID: PMC11214027 DOI: 10.2196/53297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 04/17/2024] [Accepted: 05/14/2024] [Indexed: 06/16/2024] Open

Abstract

BACKGROUND

Large language models (LLMs) have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department (ED) triage. This study evaluated the triage proficiency of different LLMs and ChatGPT, an LLM-based chatbot, compared to professionally trained ED staff and untrained personnel. We further explored whether LLM responses could guide untrained staff in effective triage.

OBJECTIVE

This study aimed to assess the efficacy of LLMs and the associated product ChatGPT in ED triage compared to personnel of varying training status and to investigate if the models' responses can enhance the triage proficiency of untrained personnel.

METHODS

A total of 124 anonymized case vignettes were triaged by untrained doctors; different versions of currently available LLMs; ChatGPT; and professionally trained raters, who subsequently agreed on a consensus set according to the Manchester Triage System (MTS). The prototypical vignettes were adapted from cases at a tertiary ED in Germany. The main outcome was the level of agreement between raters' MTS level assignments, measured via quadratic-weighted Cohen κ. The extent of over- and undertriage was also determined. Notably, instances of ChatGPT were prompted using zero-shot approaches without extensive background information on the MTS. The tested LLMs included raw GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b.

RESULTS

GPT-4-based ChatGPT and untrained doctors showed substantial agreement with the consensus triage of professional raters (κ=mean 0.67, SD 0.037 and κ=mean 0.68, SD 0.056, respectively), significantly exceeding the performance of GPT-3.5-based ChatGPT (κ=mean 0.54, SD 0.024; P<.001). When untrained doctors used this LLM for second-opinion triage, there was a slight but statistically insignificant performance increase (κ=mean 0.70, SD 0.047; P=.97). Other tested LLMs performed similar to or worse than GPT-4-based ChatGPT or showed odd triaging behavior with the used parameters. LLMs and ChatGPT models tended toward overtriage, whereas untrained doctors undertriaged.

CONCLUSIONS

While LLMs and the LLM-based product ChatGPT do not yet match professionally trained raters, their best models' triage proficiency equals that of untrained ED doctors. In its current form, LLMs or ChatGPT thus did not demonstrate gold-standard performance in ED triage and, in the setting of this study, failed to significantly improve untrained doctors' triage when used as decision support. Notable performance enhancements in newer LLM versions over older ones hint at future improvements with further technological development and specific training.

Collapse

Harada Y, Suzuki T, Harada T, Sakamoto T, Ishizuka K, Miyagami T, Kawamura R, Kunitomo K, Nagano H, Shimizu T, Watari T. Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors. BMJ Open Qual 2024;13:e002654. [PMID: 38830730 PMCID: PMC11149143 DOI: 10.1136/bmjoq-2023-002654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open

Abstract

BACKGROUND

Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors.

OBJECTIVE

This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations.

METHODS

We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians.

RESULTS

ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP.

CONCLUSION

ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.

Collapse

Barclay KS, You JY, Coleman MJ, Mathews PM, Ray VL, Riaz KM, De Rojas JO, Wang AS, Watson SH, Koo EH, Eghrari AO. Quality and Agreement With Scientific Consensus of ChatGPT Information Regarding Corneal Transplantation and Fuchs Dystrophy. Cornea 2024;43:746-750. [PMID: 38016014 DOI: 10.1097/ico.0000000000003439] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 10/30/2023] [Indexed: 11/30/2023]

Mousavi M, Shafiee S, Harley JM, Cheung JCK, Abbasgholizadeh Rahimi S. Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada. Fam Med Community Health 2024;12:e002626. [PMID: 38806403 PMCID: PMC11138270 DOI: 10.1136/fmch-2023-002626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/30/2024] Open

Abstract

INTRODUCTION

The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical education, and its performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC).

METHOD

Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews' score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds.

RESULT

According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer's scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p<0.001). Similarly, the Reviewers' Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed those of GPT-3.5 (OR: 2.23; 95% CI: 1.22 to 4.06; p=0.009). Running the GPTs after a one week interval, regeneration of the prompt or using or not using the prompt did not significantly change the CFPC score percentage.

CONCLUSION

In our study, we used GPT-3.5 and GPT-4 to answer complex, open-ended sample questions of the CFPC exam and showed that more than 70% of the answers were accurate, and GPT-4 outperformed GPT-3.5 in responding to the questions. Large language models such as GPTs seem promising for assisting candidates of the CFPC exam by providing potential answers. However, their use for family medicine education and exam preparation needs further studies.

Collapse

Pardos ZA, Bhandari S. ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. PLoS One 2024;19:e0304013. [PMID: 38787823 PMCID: PMC11125466 DOI: 10.1371/journal.pone.0304013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 05/03/2024] [Indexed: 05/26/2024] Open

Jindal JA, Lungren MP, Shah NH. Ensuring useful adoption of generative artificial intelligence in healthcare. J Am Med Inform Assoc 2024;31:1441-1444. [PMID: 38452298 PMCID: PMC11105148 DOI: 10.1093/jamia/ocae043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/01/2024] [Accepted: 02/22/2024] [Indexed: 03/09/2024] Open

Harada Y, Sakamoto T, Sugimoto S, Shimizu T. Longitudinal Changes in Diagnostic Accuracy of a Differential Diagnosis List Developed by an AI-Based Symptom Checker: Retrospective Observational Study. JMIR Form Res 2024;8:e53985. [PMID: 38758588 PMCID: PMC11143391 DOI: 10.2196/53985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 03/23/2024] [Accepted: 04/24/2024] [Indexed: 05/18/2024] Open

Abstract

BACKGROUND

Artificial intelligence (AI) symptom checker models should be trained using real-world patient data to improve their diagnostic accuracy. Given that AI-based symptom checkers are currently used in clinical practice, their performance should improve over time. However, longitudinal evaluations of the diagnostic accuracy of these symptom checkers are limited.

OBJECTIVE

This study aimed to assess the longitudinal changes in the accuracy of differential diagnosis lists created by an AI-based symptom checker used in the real world.

METHODS

This was a single-center, retrospective, observational study. Patients who visited an outpatient clinic without an appointment between May 1, 2019, and April 30, 2022, and who were admitted to a community hospital in Japan within 30 days of their index visit were considered eligible. We only included patients who underwent an AI-based symptom checkup at the index visit, and the diagnosis was finally confirmed during follow-up. Final diagnoses were categorized as common or uncommon, and all cases were categorized as typical or atypical. The primary outcome measure was the accuracy of the differential diagnosis list created by the AI-based symptom checker, defined as the final diagnosis in a list of 10 differential diagnoses created by the symptom checker. To assess the change in the symptom checker's diagnostic accuracy over 3 years, we used a chi-square test to compare the primary outcome over 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year).

RESULTS

A total of 381 patients were included. Common diseases comprised 257 (67.5%) cases, and typical presentations were observed in 298 (78.2%) cases. Overall, the accuracy of the differential diagnosis list created by the AI-based symptom checker was 172 (45.1%), which did not differ across the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). The accuracy of the differential diagnosis list created by the symptom checker was low in those with uncommon diseases (30/124, 24.2%) and atypical presentations (12/83, 14.5%). In the multivariate logistic regression model, common disease (P<.001; odds ratio 4.13, 95% CI 2.50-6.98) and typical presentation (P<.001; odds ratio 6.92, 95% CI 3.62-14.2) were significantly associated with the accuracy of the differential diagnosis list created by the symptom checker.

CONCLUSIONS

A 3-year longitudinal survey of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, which has been implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with a lower diagnostic accuracy. In the future, symptom checkers should be trained to recognize uncommon conditions.

Collapse

Yanagita Y, Yokokawa D, Fukuzawa F, Uchida S, Uehara T, Ikusaka M. Expert assessment of ChatGPT's ability to generate illness scripts: an evaluative study. BMC MEDICAL EDUCATION 2024;24:536. [PMID: 38750546 PMCID: PMC11095028 DOI: 10.1186/s12909-024-05534-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 05/08/2024] [Indexed: 05/19/2024]

Makhoul M, Melkane AE, Khoury PE, Hadi CE, Matar N. A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases. Eur Arch Otorhinolaryngol 2024;281:2717-2721. [PMID: 38365990 DOI: 10.1007/s00405-024-08509-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Accepted: 01/24/2024] [Indexed: 02/18/2024]

Guimaraes GR, Figueiredo RG, Silva CS, Arata V, Contreras JCZ, Gomes CM, Tiraboschi RB, Bessa Junior J. Diagnosis in Bytes: Comparing the Diagnostic Accuracy of Google and ChatGPT 3.5 as an Educational Support Tool. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2024;21:580. [PMID: 38791794 PMCID: PMC11120721 DOI: 10.3390/ijerph21050580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 04/27/2024] [Accepted: 04/29/2024] [Indexed: 05/26/2024]

Abstract

BACKGROUND

Adopting advanced digital technologies as diagnostic support tools in healthcare is an unquestionable trend accelerated by the COVID-19 pandemic. However, their accuracy in suggesting diagnoses remains controversial and needs to be explored. We aimed to evaluate and compare the diagnostic accuracy of two free accessible internet search tools: Google and ChatGPT 3.5.

METHODS

To assess the effectiveness of both medical platforms, we conducted evaluations using a sample of 60 clinical cases related to urological pathologies. We organized the urological cases into two distinct categories for our analysis: (i) prevalent conditions, which were compiled using the most common symptoms, as outlined by EAU and UpToDate guidelines, and (ii) unusual disorders, identified through case reports published in the 'Urology Case Reports' journal from 2022 to 2023. The outcomes were meticulously classified into three categories to determine the accuracy of each platform: "correct diagnosis", "likely differential diagnosis", and "incorrect diagnosis". A group of experts evaluated the responses blindly and randomly.

RESULTS

For commonly encountered urological conditions, Google's accuracy was 53.3%, with an additional 23.3% of its results falling within a plausible range of differential diagnoses, and the remaining outcomes were incorrect. ChatGPT 3.5 outperformed Google with an accuracy of 86.6%, provided a likely differential diagnosis in 13.3% of cases, and made no unsuitable diagnosis. In evaluating unusual disorders, Google failed to deliver any correct diagnoses but proposed a likely differential diagnosis in 20% of cases. ChatGPT 3.5 identified the proper diagnosis in 16.6% of rare cases and offered a reasonable differential diagnosis in half of the cases.

CONCLUSION

ChatGPT 3.5 demonstrated higher diagnostic accuracy than Google in both contexts. The platform showed satisfactory accuracy when diagnosing common cases, yet its performance in identifying rare conditions remains limited.

Collapse

Kedia N, Sanjeev S, Ong J, Chhablani J. ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology. Eye (Lond) 2024;38:1252-1261. [PMID: 38172581 PMCID: PMC11076576 DOI: 10.1038/s41433-023-02915-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 11/23/2023] [Accepted: 12/20/2023] [Indexed: 01/05/2024] Open

Farhat F. ChatGPT as a Complementary Mental Health Resource: A Boon or a Bane. Ann Biomed Eng 2024;52:1111-1114. [PMID: 37477707 DOI: 10.1007/s10439-023-03326-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 07/17/2023] [Indexed: 07/22/2023]

Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: A pilot study. J Dent 2024;144:104938. [PMID: 38499280 DOI: 10.1016/j.jdent.2024.104938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Revised: 03/06/2024] [Accepted: 03/11/2024] [Indexed: 03/20/2024] Open

Abstract

OBJECTIVES

Artificial Intelligence has applications such as Large Language Models (LLMs), which simulate human-like conversations. The potential of LLMs in healthcare is not fully evaluated. This pilot study assessed the accuracy and consistency of chatbots and clinicians in answering common questions in pediatric dentistry.

METHODS

Two expert pediatric dentists developed thirty true or false questions involving different aspects of pediatric dentistry. Publicly accessible chatbots (Google Bard, ChatGPT4, ChatGPT 3.5, Llama, Sage, Claude 2 100k, Claude-instant, Claude-instant-100k, and Google Palm) were employed to answer the questions (3 independent new conversations). Three groups of clinicians (general dentists, pediatric specialists, and students; n = 20/group) also answered. Responses were graded by two pediatric dentistry faculty members, along with a third independent pediatric dentist. Resulting accuracies (percentage of correct responses) were compared using analysis of variance (ANOVA), and post-hoc pairwise group comparisons were corrected using Tukey's HSD method. ACronbach's alpha was calculated to determine consistency.

RESULTS

Pediatric dentists were significantly more accurate (mean±SD 96.67 %± 4.3 %) than other clinicians and chatbots (p < 0.001). General dentists (88.0 % ± 6.1 %) also demonstrated significantly higher accuracy than chatbots (p < 0.001), followed by students (80.8 %±6.9 %). ChatGPT showed the highest accuracy (78 %±3 %) among chatbots. All chatbots except ChatGPT3.5 showed acceptable consistency (Cronbach alpha>0.7).

CLINICAL SIGNIFICANCE

Based on this pilot study, chatbots may be valuable adjuncts for educational purposes and for distributing information to patients. However, they are not yet ready to serve as substitutes for human clinicians in diagnostic decision-making.

CONCLUSION

In this pilot study, chatbots showed lower accuracy than dentists. Chatbots may not yet be recommended for clinical pediatric dentistry.

Collapse

Li H, Hayward J, Aguilar LS, Franc JM. Desired clinical applications of artificial intelligence in emergency medicine: A Delphi study. Am J Emerg Med 2024;79:217-220. [PMID: 38458952 DOI: 10.1016/j.ajem.2024.02.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 02/01/2024] [Accepted: 02/08/2024] [Indexed: 03/10/2024] Open

Fisher AD, Fisher G. Evaluating performance of custom GPT in anesthesia practice. J Clin Anesth 2024;93:111371. [PMID: 38154443 DOI: 10.1016/j.jclinane.2023.111371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 12/21/2023] [Indexed: 12/30/2023]

Scott IA, Zuccon G. The new paradigm in machine learning - foundation models, large language models and beyond: a primer for physicians. Intern Med J 2024;54:705-715. [PMID: 38715436 DOI: 10.1111/imj.16393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 03/26/2024] [Indexed: 05/18/2024]

Knebel D, Priglinger S, Scherer N, Klaas J, Siedlecki J, Schworm B. Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies - An Analysis of 10 Fictional Case Vignettes. Klin Monbl Augenheilkd 2024;241:675-681. [PMID: 37890504 DOI: 10.1055/a-2149-0447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/29/2023]

Safrai M, Azaria A. Does small talk with a medical provider affect ChatGPT's medical counsel? Performance of ChatGPT on USMLE with and without distractions. PLoS One 2024;19:e0302217. [PMID: 38687696 PMCID: PMC11060598 DOI: 10.1371/journal.pone.0302217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 03/28/2024] [Indexed: 05/02/2024] Open

Abstract

Efforts are being made to improve the time effectiveness of healthcare providers. Artificial intelligence tools can help transcript and summarize physician-patient encounters and produce medical notes and medical recommendations. However, in addition to medical information, discussion between healthcare and patients includes small talk and other information irrelevant to medical concerns. As Large Language Models (LLMs) are predictive models building their response based on the words in the prompts, there is a risk that small talk and irrelevant information may alter the response and the suggestion given. Therefore, this study aims to investigate the impact of medical data mixed with small talk on the accuracy of medical advice provided by ChatGPT. USMLE step 3 questions were used as a model for relevant medical data. We use both multiple-choice and open-ended questions. First, we gathered small talk sentences from human participants using the Mechanical Turk platform. Second, both sets of USLME questions were arranged in a pattern where each sentence from the original questions was followed by a small talk sentence. ChatGPT 3.5 and 4 were asked to answer both sets of questions with and without the small talk sentences. Finally, a board-certified physician analyzed the answers by ChatGPT and compared them to the formal correct answer. The analysis results demonstrate that the ability of ChatGPT-3.5 to answer correctly was impaired when small talk was added to medical data (66.8% vs. 56.6%; p = 0.025). Specifically, for multiple-choice questions (72.1% vs. 68.9%; p = 0.67) and for the open questions (61.5% vs. 44.3%; p = 0.01), respectively. In contrast, small talk phrases did not impair ChatGPT-4 ability in both types of questions (83.6% and 66.2%, respectively). According to these results, ChatGPT-4 seems more accurate than the earlier 3.5 version, and it appears that small talk does not impair its capability to provide medical recommendations. Our results are an important first step in understanding the potential and limitations of utilizing ChatGPT and other LLMs for physician-patient interactions, which include casual conversations.

Collapse