1
|
Ramasubramanian S, Balaji S, Kannan T, Jeyaraman N, Sharma S, Migliorini F, Balasubramaniam S, Jeyaraman M. Comparative evaluation of artificial intelligence systems' accuracy in providing medical drug dosages: A methodological study. World J Methodol 2024; 14:92802. [DOI: 10.5662/wjm.v14.i4.92802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 05/29/2024] [Accepted: 06/25/2024] [Indexed: 07/26/2024] Open
Abstract
BACKGROUND Medication errors, especially in dosage calculation, pose risks in healthcare. Artificial intelligence (AI) systems like ChatGPT and Google Bard may help reduce errors, but their accuracy in providing medication information remains to be evaluated.
AIM To evaluate the accuracy of AI systems (ChatGPT 3.5, ChatGPT 4, Google Bard) in providing drug dosage information per Harrison's Principles of Internal Medicine.
METHODS A set of natural language queries mimicking real-world medical dosage inquiries was presented to the AI systems. Responses were analyzed using a 3-point Likert scale. The analysis, conducted with Python and its libraries, focused on basic statistics, overall system accuracy, and disease-specific and organ system accuracies.
RESULTS ChatGPT 4 outperformed the other systems, showing the highest rate of correct responses (83.77%) and the best overall weighted accuracy (0.6775). Disease-specific accuracy varied notably across systems, with some diseases being accurately recognized, while others demonstrated significant discrepancies. Organ system accuracy also showed variable results, underscoring system-specific strengths and weaknesses.
CONCLUSION ChatGPT 4 demonstrates superior reliability in medical dosage information, yet variations across diseases emphasize the need for ongoing improvements. These results highlight AI's potential in aiding healthcare professionals, urging continuous development for dependable accuracy in critical medical situations.
Collapse
Affiliation(s)
- Swaminathan Ramasubramanian
- Department of Orthopaedics, Government Medical College, Omandurar Government Estate, Chennai 600002, Tamil Nadu, India
| | - Sangeetha Balaji
- Department of Orthopaedics, Government Medical College, Omandurar Government Estate, Chennai 600002, Tamil Nadu, India
| | - Tejashri Kannan
- Department of Orthopaedics, Government Medical College, Omandurar Government Estate, Chennai 600002, Tamil Nadu, India
| | - Naveen Jeyaraman
- Department of Orthopaedics, ACS Medical College and Hospital, Dr MGR Educational and Research Institute, Chennai 600077, Tamil Nadu, India
| | - Shilpa Sharma
- Department of Paediatric Surgery, All India Institute of Medical Sciences, New Delhi 110029, India
| | - Filippo Migliorini
- Department of Life Sciences, Health, Link Campus University, Rome 00165, Italy
- Department of Orthopaedic and Trauma Surgery, Academic Hospital of Bolzano (SABES-ASDAA), Teaching Hospital of the Paracelsus Medical University, Bolzano 39100, Italy
| | - Suhasini Balasubramaniam
- Department of Radio-Diagnosis, Government Stanley Medical College and Hospital, Chennai 600001, Tamil Nadu, India
| | - Madhan Jeyaraman
- Department of Orthopaedics, ACS Medical College and Hospital, Dr MGR Educational and Research Institute, Chennai 600077, Tamil Nadu, India
| |
Collapse
|
2
|
Xu X, Yang Y, Tan X, Zhang Z, Wang B, Yang X, Weng C, Yu R, Zhao Q, Quan S. Hepatic encephalopathy post-TIPS: Current status and prospects in predictive assessment. Comput Struct Biotechnol J 2024; 24:493-506. [PMID: 39076168 PMCID: PMC11284497 DOI: 10.1016/j.csbj.2024.07.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2024] [Revised: 07/05/2024] [Accepted: 07/05/2024] [Indexed: 07/31/2024] Open
Abstract
Transjugular intrahepatic portosystemic shunt (TIPS) is an essential procedure for the treatment of portal hypertension but can result in hepatic encephalopathy (HE), a serious complication that worsens patient outcomes. Investigating predictors of HE after TIPS is essential to improve prognosis. This review analyzes risk factors and compares predictive models, weighing traditional scores such as Child-Pugh, Model for End-Stage Liver Disease (MELD), and albumin-bilirubin (ALBI) against emerging artificial intelligence (AI) techniques. While traditional scores provide initial insights into HE risk, they have limitations in dealing with clinical complexity. Advances in machine learning (ML), particularly when integrated with imaging and clinical data, offer refined assessments. These innovations suggest the potential for AI to significantly improve the prediction of post-TIPS HE. The study provides clinicians with a comprehensive overview of current prediction methods, while advocating for the integration of AI to increase the accuracy of post-TIPS HE assessments. By harnessing the power of AI, clinicians can better manage the risks associated with TIPS and tailor interventions to individual patient needs. Future research should therefore prioritize the development of advanced AI frameworks that can assimilate diverse data streams to support clinical decision-making. The goal is not only to more accurately predict HE, but also to improve overall patient care and quality of life.
Collapse
Affiliation(s)
- Xiaowei Xu
- Department of Gastroenterology Nursing Unit, Ward 192, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325000, China
| | - Yun Yang
- School of Nursing, Wenzhou Medical University, Wenzhou 325001, China
| | - Xinru Tan
- The First School of Medicine, School of Information and Engineering, Wenzhou Medical University, Wenzhou 325001, China
| | - Ziyang Zhang
- School of Clinical Medicine, Guizhou Medical University, Guiyang 550025, China
| | - Boxiang Wang
- The First School of Medicine, School of Information and Engineering, Wenzhou Medical University, Wenzhou 325001, China
| | - Xiaojie Yang
- Wenzhou Medical University Renji College, Wenzhou 325000, China
| | - Chujun Weng
- The Fourth Affiliated Hospital Zhejiang University School of Medicine, Yiwu 322000, China
| | - Rongwen Yu
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| | - Qi Zhao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan 114051, China
| | - Shichao Quan
- Department of Big Data in Health Science, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325000, China
| |
Collapse
|
3
|
Hua R, Dong X, Wei Y, Shu Z, Yang P, Hu Y, Zhou S, Sun H, Yan K, Yan X, Chang K, Li X, Bai Y, Zhang R, Wang W, Zhou X. Lingdan: enhancing encoding of traditional Chinese medicine knowledge for clinical reasoning tasks with large language models. J Am Med Inform Assoc 2024; 31:2019-2029. [PMID: 39038795 PMCID: PMC11339528 DOI: 10.1093/jamia/ocae087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/22/2024] [Accepted: 04/06/2024] [Indexed: 07/24/2024] Open
Abstract
OBJECTIVE The recent surge in large language models (LLMs) across various fields has yet to be fully realized in traditional Chinese medicine (TCM). This study aims to bridge this gap by developing a large language model tailored to TCM knowledge, enhancing its performance and accuracy in clinical reasoning tasks such as diagnosis, treatment, and prescription recommendations. MATERIALS AND METHODS This study harnessed a wide array of TCM data resources, including TCM ancient books, textbooks, and clinical data, to create 3 key datasets: the TCM Pre-trained Dataset, the Traditional Chinese Patent Medicine (TCPM) Question Answering Dataset, and the Spleen and Stomach Herbal Prescription Recommendation Dataset. These datasets underpinned the development of the Lingdan Pre-trained LLM and 2 specialized models: the Lingdan-TCPM-Chat Model, which uses a Chain-of-Thought process for symptom analysis and TCPM recommendation, and a Lingdan Prescription Recommendation model (Lingdan-PR) that proposes herbal prescriptions based on electronic medical records. RESULTS The Lingdan-TCPM-Chat and the Lingdan-PR Model, fine-tuned on the Lingdan Pre-trained LLM, demonstrated state-of-the art performances for the tasks of TCM clinical knowledge answering and herbal prescription recommendation. Notably, Lingdan-PR outperformed all state-of-the-art baseline models, achieving an improvement of 18.39% in the Top@20 F1-score compared with the best baseline. CONCLUSION This study marks a pivotal step in merging advanced LLMs with TCM, showcasing the potential of artificial intelligence to help improve clinical decision-making of medical diagnostics and treatment strategies. The success of the Lingdan Pre-trained LLM and its derivative models, Lingdan-TCPM-Chat and Lingdan-PR, not only revolutionizes TCM practices but also opens new avenues for the application of artificial intelligence in other specialized medical fields. Our project is available at https://github.com/TCMAI-BJTU/LingdanLLM.
Collapse
Affiliation(s)
- Rui Hua
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
| | - Xin Dong
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
| | - Yu Wei
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Zixin Shu
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
| | - Pengcheng Yang
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Yunhui Hu
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Shuiping Zhou
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - He Sun
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Kaijing Yan
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Xijun Yan
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Kai Chang
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
| | - Xiaodong Li
- Affiliated Hospital of Hubei University of Chinese Medicine, Wuhan 430065, China
- Hubei Academy of Chinese Medicine, Wuhan 430061, China
- Institute of Liver Diseases, Hubei Key Laboratory of Theoretical and Applied Research of Liver and Kidney in Traditional Chinese Medicine, Hubei Provincial Hospital of Traditional Chinese Medicine, Wuhan 430061, China
| | - Yuning Bai
- Department of Gastroenterology, Guang’anmen Hospital, China Academy of Chinese Medical Sciences, Beijing 100053, China
| | - Runshun Zhang
- Department of Gastroenterology, Guang’anmen Hospital, China Academy of Chinese Medical Sciences, Beijing 100053, China
| | - Wenjia Wang
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Xuezhong Zhou
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
| |
Collapse
|
4
|
Andreadis K, Newman DR, Twan C, Shunk A, Mann DM, Stevens ER. Mixed methods assessment of the influence of demographics on medical advice of ChatGPT. J Am Med Inform Assoc 2024; 31:2002-2009. [PMID: 38679900 PMCID: PMC11339520 DOI: 10.1093/jamia/ocae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 03/22/2024] [Accepted: 04/03/2024] [Indexed: 05/01/2024] Open
Abstract
OBJECTIVES To evaluate demographic biases in diagnostic accuracy and health advice between generative artificial intelligence (AI) (ChatGPT GPT-4) and traditional symptom checkers like WebMD. MATERIALS AND METHODS Combination symptom and demographic vignettes were developed for 27 most common symptom complaints. Standardized prompts, written from a patient perspective, with varying demographic permutations of age, sex, and race/ethnicity were entered into ChatGPT (GPT-4) between July and August 2023. In total, 3 runs of 540 ChatGPT prompts were compared to the corresponding WebMD Symptom Checker output using a mixed-methods approach. In addition to diagnostic correctness, the associated text generated by ChatGPT was analyzed for readability (using Flesch-Kincaid Grade Level) and qualitative aspects like disclaimers and demographic tailoring. RESULTS ChatGPT matched WebMD in 91% of diagnoses, with a 24% top diagnosis match rate. Diagnostic accuracy was not significantly different across demographic groups, including age, race/ethnicity, and sex. ChatGPT's urgent care recommendations and demographic tailoring were presented significantly more to 75-year-olds versus 25-year-olds (P < .01) but were not statistically different among race/ethnicity and sex groups. The GPT text was suitable for college students, with no significant demographic variability. DISCUSSION The use of non-health-tailored generative AI, like ChatGPT, for simple symptom-checking functions provides comparable diagnostic accuracy to commercially available symptom checkers and does not demonstrate significant demographic bias in this setting. The text accompanying differential diagnoses, however, suggests demographic tailoring that could potentially introduce bias. CONCLUSION These results highlight the need for continued rigorous evaluation of AI-driven medical platforms, focusing on demographic biases to ensure equitable care.
Collapse
Affiliation(s)
- Katerina Andreadis
- Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Devon R Newman
- Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
- Brown University, Providence, RI 02912, United States
| | - Chelsea Twan
- Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Amelia Shunk
- Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Devin M Mann
- Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
- Medical Center Information Technology, NYU Langone Health, New York, NY 10016, United States
| | - Elizabeth R Stevens
- Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
| |
Collapse
|
5
|
Warrier A, Singh R, Haleem A, Zaki H, Eloy JA. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology. Laryngoscope 2024; 134:3997-4002. [PMID: 38563415 DOI: 10.1002/lary.31434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/05/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
OBJECTIVES Evaluate and compare the ability of large language models (LLMs) to diagnose various ailments in otolaryngology. METHODS We collected all 100 clinical vignettes from the second edition of Otolaryngology Cases-The University of Cincinnati Clinical Portfolio by Pensak et al. With the addition of the prompt "Provide a diagnosis given the following history," we prompted ChatGPT-3.5, Google Bard, and Bing-GPT4 to provide a diagnosis for each vignette. These diagnoses were compared to the portfolio for accuracy and recorded. All queries were run in June 2023. RESULTS ChatGPT-3.5 was the most accurate model (89% success rate), followed by Google Bard (82%) and Bing GPT (74%). A chi-squared test revealed a significant difference between the three LLMs in providing correct diagnoses (p = 0.023). Of the 100 vignettes, seven require additional testing results (i.e., biopsy, non-contrast CT) for accurate clinical diagnosis. When omitting these vignettes, the revised success rates were 95.7% for ChatGPT-3.5, 88.17% for Google Bard, and 78.72% for Bing-GPT4 (p = 0.002). CONCLUSIONS ChatGPT-3.5 offers the most accurate diagnoses when given established clinical vignettes as compared to Google Bard and Bing-GPT4. LLMs may accurately offer assessments for common otolaryngology conditions but currently require detailed prompt information and critical supervision from clinicians. There is vast potential in the clinical applicability of LLMs; however, practitioners should be wary of possible "hallucinations" and misinformation in responses. LEVEL OF EVIDENCE 3 Laryngoscope, 134:3997-4002, 2024.
Collapse
Affiliation(s)
- Akshay Warrier
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Rohan Singh
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Afash Haleem
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Haider Zaki
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Jean Anderson Eloy
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
- Center for Skull Base and Pituitary Surgery, Neurological Institute of New Jersey, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| |
Collapse
|
6
|
Wu G, Lee DA, Zhao W, Wong A, Jhangiani R, Kurniawan S. ChatGPT and Google Assistant as a Source of Patient Education for Patients With Amblyopia: Content Analysis. J Med Internet Res 2024; 26:e52401. [PMID: 39146013 DOI: 10.2196/52401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 04/24/2024] [Accepted: 04/30/2024] [Indexed: 08/16/2024] Open
Abstract
BACKGROUND We queried ChatGPT (OpenAI) and Google Assistant about amblyopia and compared their answers with the keywords found on the American Association for Pediatric Ophthalmology and Strabismus (AAPOS) website, specifically the section on amblyopia. Out of the 26 keywords chosen from the website, ChatGPT included 11 (42%) in its responses, while Google included 8 (31%). OBJECTIVE Our study investigated the adherence of ChatGPT-3.5 and Google Assistant to the guidelines of the AAPOS for patient education on amblyopia. METHODS ChatGPT-3.5 was used. The four questions taken from the AAPOS website, specifically its glossary section for amblyopia, are as follows: (1) What is amblyopia? (2) What causes amblyopia? (3) How is amblyopia treated? (4) What happens if amblyopia is untreated? Approved and selected by ophthalmologists (GW and DL), the keywords from AAPOS were words or phrases that deemed significant for the education of patients with amblyopia. The "Flesch-Kincaid Grade Level" formula, approved by the US Department of Education, was used to evaluate the reading comprehension level for the responses from ChatGPT, Google Assistant, and AAPOS. RESULTS In their responses, ChatGPT did not mention the term "ophthalmologist," whereas Google Assistant and AAPOS both mentioned the term once and twice, respectively. ChatGPT did, however, use the term "eye doctors" once. According to the Flesch-Kincaid test, the average reading level of AAPOS was 11.4 (SD 2.1; the lowest level) while that of Google was 13.1 (SD 4.8; the highest required reading level), also showing the greatest variation in grade level in its responses. ChatGPT's answers, on average, scored 12.4 (SD 1.1) grade level. They were all similar in terms of difficulty level in reading. For the keywords, out of the 4 responses, ChatGPT used 42% (11/26) of the keywords, whereas Google Assistant used 31% (8/26). CONCLUSIONS ChatGPT trains on texts and phrases and generates new sentences, while Google Assistant automatically copies website links. As ophthalmologists, we should consider including "see an ophthalmologist" on our websites and journals. While ChatGPT is here to stay, we, as physicians, need to monitor its answers.
Collapse
Affiliation(s)
- Gloria Wu
- University of California, San Francisco School of Medicine, San Francisco, CA, United States
| | - David A Lee
- McGovern Medical School, University of Texas Health Science Center at Houston, Houston, CA, United States
| | - Weichen Zhao
- College of Biological Sciences, University of California, Davis, Davis, CA, United States
| | - Adrial Wong
- College of Biological Sciences, University of California, Davis, Davis, CA, United States
| | - Rohan Jhangiani
- Department of Computational Media, University of California, Santa Cruz, Santa Cruz, CA, United States
| | - Sri Kurniawan
- Department of Computational Media, University of California, Santa Cruz, Santa Cruz, CA, United States
| |
Collapse
|
7
|
Takahashi H, Shikino K, Kondo T, Komori A, Yamada Y, Saita M, Naito T. Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study. JMIR MEDICAL EDUCATION 2024; 10:e59133. [PMID: 39137031 DOI: 10.2196/59133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 05/22/2024] [Accepted: 06/27/2024] [Indexed: 08/15/2024]
Abstract
BACKGROUND Evaluating the accuracy and educational utility of artificial intelligence-generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored. OBJECTIVE This study aimed to assess the educational utility of ChatGPT-4-generated clinical vignettes and their applicability in educational settings. METHODS Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians' experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases. RESULTS Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations. CONCLUSIONS ChatGPT-4-generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4's value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application.
Collapse
Affiliation(s)
- Hiromizu Takahashi
- Department of General Medicine, Juntendo University Faculty of Medicine, Tokyo, Japan
| | - Kiyoshi Shikino
- Department of Community-Oriented Medical Education, Chiba University Graduate School of Medicine, Chiba, Japan
| | - Takeshi Kondo
- Center for Postgraduate Clinical Training and Career Development, Nagoya University Hospital, Aichi, Japan
| | - Akira Komori
- Department of General Medicine, Juntendo University Faculty of Medicine, Tokyo, Japan
- Department of Emergency and Critical Care Medicine, Tsukuba Memorial Hospital, Tsukuba, Japan
| | - Yuji Yamada
- Brookdale Department of Geriatrics and Palliative Medicine, Icahn School of Medicine at Mount Sinai, NY, NY, United States
| | - Mizue Saita
- Department of General Medicine, Juntendo University Faculty of Medicine, Tokyo, Japan
| | - Toshio Naito
- Department of General Medicine, Juntendo University Faculty of Medicine, Tokyo, Japan
| |
Collapse
|
8
|
Fatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: A cross-disciplinary systematic review of ChatGPT's (artificial intelligence) role in research, clinical practice, education, and patient interaction. Medicine (Baltimore) 2024; 103:e39250. [PMID: 39121303 PMCID: PMC11315549 DOI: 10.1097/md.0000000000039250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 07/19/2024] [Indexed: 08/11/2024] Open
Abstract
BACKGROUND ChatGPT, a powerful AI language model, has gained increasing prominence in medicine, offering potential applications in healthcare, clinical decision support, patient communication, and medical research. This systematic review aims to comprehensively assess the applications of ChatGPT in healthcare education, research, writing, patient communication, and practice while also delineating potential limitations and areas for improvement. METHOD Our comprehensive database search retrieved relevant papers from PubMed, Medline and Scopus. After the screening process, 83 studies met the inclusion criteria. This review includes original studies comprising case reports, analytical studies, and editorials with original findings. RESULT ChatGPT is useful for scientific research and academic writing, and assists with grammar, clarity, and coherence. This helps non-English speakers and improves accessibility by breaking down linguistic barriers. However, its limitations include probable inaccuracy and ethical issues, such as bias and plagiarism. ChatGPT streamlines workflows and offers diagnostic and educational potential in healthcare but exhibits biases and lacks emotional sensitivity. It is useful in inpatient communication, but requires up-to-date data and faces concerns about the accuracy of information and hallucinatory responses. CONCLUSION Given the potential for ChatGPT to transform healthcare education, research, and practice, it is essential to approach its adoption in these areas with caution due to its inherent limitations.
Collapse
Affiliation(s)
- Afia Fatima
- Department of Medicine, Jinnah Sindh Medical University, Karachi, Pakistan
| | | | - Khadija Alam
- Department of Medicine, Liaquat National Medical College, Karachi, Pakistan
| | | | | |
Collapse
|
9
|
Zhang X, Zhang D, Zhang X, Zhang X. Artificial intelligence applications in the diagnosis and treatment of bacterial infections. Front Microbiol 2024; 15:1449844. [PMID: 39165576 PMCID: PMC11334354 DOI: 10.3389/fmicb.2024.1449844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Accepted: 07/04/2024] [Indexed: 08/22/2024] Open
Abstract
The diagnosis and treatment of bacterial infections in the medical and public health field in the 21st century remain significantly challenging. Artificial Intelligence (AI) has emerged as a powerful new tool in diagnosing and treating bacterial infections. AI is rapidly revolutionizing epidemiological studies of infectious diseases, providing effective early warning, prevention, and control of outbreaks. Machine learning models provide a highly flexible way to simulate and predict the complex mechanisms of pathogen-host interactions, which is crucial for a comprehensive understanding of the nature of diseases. Machine learning-based pathogen identification technology and antimicrobial drug susceptibility testing break through the limitations of traditional methods, significantly shorten the time from sample collection to the determination of result, and greatly improve the speed and accuracy of laboratory testing. In addition, AI technology application in treating bacterial infections, particularly in the research and development of drugs and vaccines, and the application of innovative therapies such as bacteriophage, provides new strategies for improving therapy and curbing bacterial resistance. Although AI has a broad application prospect in diagnosing and treating bacterial infections, significant challenges remain in data quality and quantity, model interpretability, clinical integration, and patient privacy protection. To overcome these challenges and, realize widespread application in clinical practice, interdisciplinary cooperation, technology innovation, and policy support are essential components of the joint efforts required. In summary, with continuous advancements and in-depth application of AI technology, AI will enable doctors to more effectivelyaddress the challenge of bacterial infection, promoting the development of medical practice toward precision, efficiency, and personalization; optimizing the best nursing and treatment plans for patients; and providing strong support for public health safety.
Collapse
Affiliation(s)
- Xiaoyu Zhang
- First Department of Infectious Diseases, The First Affiliated Hospital of China Medical University, Shenyang, China
| | - Deng Zhang
- Department of Infectious Diseases, The First Affiliated Hospital of Xiamen University, Xiamen, China
| | - Xifan Zhang
- First Department of Infectious Diseases, The First Affiliated Hospital of China Medical University, Shenyang, China
| | - Xin Zhang
- First Department of Infectious Diseases, The First Affiliated Hospital of China Medical University, Shenyang, China
| |
Collapse
|
10
|
Patel MA, Villalobos F, Shan K, Tardo LM, Horton LA, Sguigna PV, Blackburn KM, Munoz SB, Moog TM, Smith AD, Burgess KW, McCreary M, Okuda DT. Generative artificial intelligence versus clinicians: Who diagnoses multiple sclerosis faster and with greater accuracy? Mult Scler Relat Disord 2024; 90:105791. [PMID: 39146892 DOI: 10.1016/j.msard.2024.105791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 07/05/2024] [Accepted: 07/27/2024] [Indexed: 08/17/2024]
Abstract
BACKGROUND Those receiving the diagnosis of multiple sclerosis (MS) over the next ten years will predominantly be part of Generation Z (Gen Z). Recent observations within our clinic suggest that younger people with MS utilize online generative artificial intelligence (AI) platforms for personalized medical advice prior to their first visit with a specialist in neuroimmunology. The use of such platforms is anticipated to increase given the technology driven nature, desire for instant communication, and cost-conscious nature of Gen Z. Our objective was to determine if ChatGPT (Generative Pre-trained Transformer) could diagnose MS in individuals earlier than their clinical timeline, and to assess if the accuracy differed based on age, sex, and race/ethnicity. METHODS People with MS between 18 and 59 years of age were studied. The clinical timeline for people diagnosed with MS was retrospectively identified and simulated using ChatGPT-3.5 (GPT-3.5). Chats were conducted using both actual and derivatives of their age, sex, and race/ethnicity to test diagnostic accuracy. A Kaplan-Meier survival curve was estimated for time to diagnosis, clustered by subject. The p-value testing for differences in time to diagnosis was accomplished using a general Wilcoxon test. Logistic regression (subject-specific intercept) was used to capture intra-subject correlation to test the accuracy prior to and after the inclusion of MRI data. RESULTS The study cohort included 100 unique people with MS. Of those, 50 were members of Gen Z (38 female; 22 White; mean age at first symptom was 20.6 years (y) (standard deviation (SD)=2.2y)), and 50 were non-Gen Z (34 female; 27 White; mean age at first symptom was 37.0y (SD=10.4y)). In addition, a total of 529 people that represented digital simulations of the original cohort of 100 people (333 female; 166 White; 136 Black/African American; 107 Asian; 120 Hispanic, mean age at first symptom was 31.6y (SD=12.4y)) were generated allowing for 629 scripted conversations to be analyzed. The estimated median time to diagnosis in clinic was significantly longer at 0.35y (95% CI=[0.28, 0.48]) versus that by ChatGPT at 0.08y (95% CI=[0.04, 0.24]) (p<0.0001). There was no difference in the diagnostic accuracy between ages and by race/ethnicity prior to the inclusion of MRI data. However, prior to including the MRI data, males had a 47% less likely chance of a correct diagnosis relative to females (p=0.05). Post-MRI data inclusion within GPT-3.5, the odds of an accurate diagnosis was 4.0-fold greater for Gen Z participants, relative to non-Gen Z participants (p=0.01) with the diagnostic accuracy being 68% less in males relative to females (p=0.009), and 75% less for White subjects, relative to non-White subjects (p=0.0004). CONCLUSION Although generative AI platforms enable rapid information access and are not principally designed for use in healthcare, an increase in use by Gen Z is anticipated. However, the obtained responses may not be generalizable to all users and bias may exist in select groups.
Collapse
Affiliation(s)
- Mahi A Patel
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
| | - Francisco Villalobos
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
| | - Kevin Shan
- The University of Texas Southwestern Medical Center, School of Medicine, Dallas, Texas, USA
| | - Lauren M Tardo
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
| | - Lindsay A Horton
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
| | - Peter V Sguigna
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
| | - Kyle M Blackburn
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
| | - Shanan B Munoz
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
| | - Tatum M Moog
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
| | - Alexander D Smith
- Texas Tech University Health Sciences Center, School of Medicine, Lubbock, Texas, USA
| | - Katy W Burgess
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
| | - Morgan McCreary
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA
| | - Darin T Okuda
- The University of Texas Southwestern Medical Center, Department of Neurology, Neuroinnovation Program, Multiple Sclerosis & Neuroimmunology Imaging Program, Dallas, TX, USA; The University of Texas Southwestern Medical Center, Peter O'Donnell Jr. Brain Institute, Dallas, Texas, USA.
| |
Collapse
|
11
|
Gleber C, Fear K. Diagnostic reasoning in the age of artificial intelligence: Synergy or opposition? J Hosp Med 2024; 19:749-752. [PMID: 38340350 DOI: 10.1002/jhm.13295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 01/21/2024] [Accepted: 01/28/2024] [Indexed: 02/12/2024]
Affiliation(s)
- Conrad Gleber
- University of Rochester Medical Center, Rochester, New York, USA
| | - Kathleen Fear
- UR Health Lab, University of Rochester Medical Center, Rochester, New York, USA
| |
Collapse
|
12
|
Palenzuela DL, Mullen JT, Phitayakorn R. AI Versus MD: Evaluating the surgical decision-making accuracy of ChatGPT-4. Surgery 2024; 176:241-245. [PMID: 38769038 DOI: 10.1016/j.surg.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/22/2024] [Accepted: 04/03/2024] [Indexed: 05/22/2024]
Abstract
BACKGROUND ChatGPT-4 is a large language model with possible applications to surgery education The aim of this study was to investigate the accuracy of ChatGPT-4's surgical decision-making compared with general surgery residents and attending surgeons. METHODS Five clinical scenarios were created from actual patient data based on common general surgery diagnoses. Scripts were developed to sequentially provide clinical information and ask decision-making questions. Responses to the prompts were scored based on a standardized rubric for a total of 50 points. Each clinical scenario was run through Chat GPT-4 and sent electronically to all general surgery residents and attendings at a single institution. Scores were compared using Wilcoxon rank sum tests. RESULTS On average, ChatGPT-4 scored 39.6 points (79.2%, standard deviation ± 0.89 points). A total of five junior residents, 12 senior residents, and five attendings completed the clinical scenarios (resident response rate = 15.9%; attending response rate = 13.8%). On average, the junior residents scored a total of 33.4 (66.8%, standard deviation ± 3.29), senior residents 38.0 (76.0%, standard deviation ± 4.75), and attendings 38.8 (77.6%, standard deviation ± 5.45). ChatGPT-4 scored significantly better than junior residents (P = .009) but was not significantly different from senior residents or attendings. ChatGPT-4 was significantly better than junior residents at identifying the correct operation to perform (P = .0182) and recommending additional workup for postoperative complications (P = .012). CONCLUSION ChatGPT-4 performed superior to junior residents and equivalent to senior residents and attendings when faced with surgical patient scenarios. Large language models, such as ChatGPT, may have the potential to be an educational resource for junior residents to develop surgical decision-making skills.
Collapse
Affiliation(s)
| | | | - Roy Phitayakorn
- Massachusetts General Hospital, Boston, MA. https://www.twitter.com/RoyPhit
| |
Collapse
|
13
|
Mizuta K, Hirosawa T, Harada Y, Shimizu T. Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician? Diagnosis (Berl) 2024; 11:321-324. [PMID: 38465399 DOI: 10.1515/dx-2024-0027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Accepted: 02/22/2024] [Indexed: 03/12/2024]
Abstract
OBJECTIVES The potential of artificial intelligence (AI) chatbots, particularly the fourth-generation chat generative pretrained transformer (ChatGPT-4), in assisting with medical diagnosis is an emerging research area. While there has been significant emphasis on creating lists of differential diagnoses, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in these lists. This short communication aimed to assess the accuracy of ChatGPT-4 in evaluating lists of differential diagnosis compared to medical professionals' assessments. METHODS We used ChatGPT-4 to evaluate whether the final diagnosis was included in the top 10 differential diagnosis lists created by physicians, ChatGPT-3, and ChatGPT-4, using clinical vignettes. Eighty-two clinical vignettes were used, comprising 52 complex case reports published by the authors from the department and 30 mock cases of common diseases created by physicians from the same department. We compared the agreement between ChatGPT-4 and the physicians on whether the final diagnosis was included in the top 10 differential diagnosis lists using the kappa coefficient. RESULTS Three sets of differential diagnoses were evaluated for each of the 82 cases, resulting in a total of 246 lists. The agreement rate between ChatGPT-4 and physicians was 236 out of 246 (95.9 %), with a kappa coefficient of 0.86, indicating very good agreement. CONCLUSIONS ChatGPT-4 demonstrated very good agreement with physicians in evaluating whether the final diagnosis should be included in the differential diagnosis lists.
Collapse
Affiliation(s)
- Kazuya Mizuta
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University 12756 , Simotsuga-gun, Japan
| | - Takanobu Hirosawa
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University 12756 , Simotsuga-gun, Japan
| | - Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University 12756 , Simotsuga-gun, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University 12756 , Simotsuga-gun, Japan
| |
Collapse
|
14
|
Petrella RJ. The AI Future of Emergency Medicine. Ann Emerg Med 2024; 84:139-153. [PMID: 38795081 DOI: 10.1016/j.annemergmed.2024.01.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 01/23/2024] [Accepted: 01/24/2024] [Indexed: 05/27/2024]
Abstract
In the coming years, artificial intelligence (AI) and machine learning will likely give rise to profound changes in the field of emergency medicine, and medicine more broadly. This article discusses these anticipated changes in terms of 3 overlapping yet distinct stages of AI development. It reviews some fundamental concepts in AI and explores their relation to clinical practice, with a focus on emergency medicine. In addition, it describes some of the applications of AI in disease diagnosis, prognosis, and treatment, as well as some of the practical issues that they raise, the barriers to their implementation, and some of the legal and regulatory challenges they create.
Collapse
Affiliation(s)
- Robert J Petrella
- Emergency Departments, CharterCARE Health Partners, Providence and North Providence, RI; Emergency Department, Boston VA Medical Center, Boston, MA; Emergency Departments, Steward Health Care System, Boston and Methuen, MA; Harvard Medical School, Boston, MA; Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA; Department of Medicine, Brigham and Women's Hospital, Boston, MA.
| |
Collapse
|
15
|
García-Méndez S, de Arriba-Pérez F. Large Language Models and Healthcare Alliance: Potential and Challenges of Two Representative Use Cases. Ann Biomed Eng 2024; 52:1928-1931. [PMID: 38310159 DOI: 10.1007/s10439-024-03454-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 01/15/2024] [Indexed: 02/05/2024]
Abstract
Large language models (LLMS) emerge as the most promising Natural Language Processing approach for clinical practice acceleration (i.e., diagnosis, prevention and treatment procedures). Similarly, intelligent conversational systems that leverage LLMS have disruptively become the future of therapy in the era of ChatGPT. Accordingly, this research addresses the application of LLMS in healthcare, paying particular attention to two relevant use cases: cognitive decline and depression, more specifically, postpartum depression. In the end, the most promising opportunities they represent (e.g., clinical tasks augmentation, personalized healthcare, etc.) and related concerns (e.g., data privacy and quality, fairness, etc.) are discussed to contribute to the global debate on their integration in the sanitary system.
Collapse
|
16
|
Scott IA, Miller T, Crock C. Using conversant artificial intelligence to improve diagnostic reasoning: ready for prime time? Med J Aust 2024. [PMID: 39086025 DOI: 10.5694/mja2.52401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Accepted: 04/22/2024] [Indexed: 08/02/2024]
Affiliation(s)
- Ian A Scott
- University of Queensland, Brisbane, QLD
- Princess Alexandra Hospital, Brisbane, QLD
| | | | - Carmel Crock
- Royal Victorian Eye and Ear Hospital, Melbourne, VIC
| |
Collapse
|
17
|
Kim P, Seo B, De Silva H. Concordance of clinician, Chat-GPT4, and ORAD diagnoses against histopathology in Odontogenic Keratocysts and tumours: a 15-Year New Zealand retrospective study. Oral Maxillofac Surg 2024:10.1007/s10006-024-01284-5. [PMID: 39060850 DOI: 10.1007/s10006-024-01284-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Accepted: 07/19/2024] [Indexed: 07/28/2024]
Abstract
BACKGROUND This research aimed to investigate the concordance between clinical impressions and histopathologic diagnoses made by clinicians and artificial intelligence tools for odontogenic keratocyst (OKC) and Odontogenic tumours (OT) in a New Zealand population from 2008 to 2023. METHODS Histopathological records from the Oral Pathology Centre, University of Otago (2008-2023) were examined to identify OKCs and OT. Specimen referral details, histopathologic reports, and clinician differential diagnoses, as well as those provided by ORAD and Chat-GPT4, were documented. Data were analyzed using SPSS, and concordance between provisional and histopathologic diagnoses was ascertained. RESULTS Of the 34,225 biopsies, 302 and 321 samples were identified as OTs and OKCs. Concordance rates were 43.2% for clinicians, 45.6% for ORAD, and 41.4% for Chat-GPT4. Corresponding Kappa value against histological diagnosis were 0.23, 0.13 and 0.14. Surgeons achieved a higher concordance rate (47.7%) compared to non-surgeons (29.82%). Odds ratio of having concordant diagnosis using Chat-GPT4 and ORAD were between 1.4 and 2.8 (p < 0.05). ROC-AUC and PR-AUC were similar between the groups (Clinician 0.62/0.42, ORAD 0.58/0.28, Char-GPT4 0.63/0.37) for ameloblastoma and for OKC (Clinician 0.64/0.78, ORAD 0.66/0.77, Char-GPT4 0.60/0.71). CONCLUSION Clinicians with surgical training achieved higher concordance rate when it comes to OT and OKC. Chat-GPT4 and Bayesian approach (ORAD) have shown potential in enhancing diagnostic capabilities.
Collapse
Affiliation(s)
- Paul Kim
- Oral and Maxillofacial Surgery Registrar, Dunedin Hospital, Dunedin, New Zealand.
| | - Benedict Seo
- Department of Oral Diagnostic and Surgical Sciences, University of Otago, Dunedin, New Zealand
| | - Harsha De Silva
- Department of Oral Diagnostic and Surgical Sciences, University of Otago, Dunedin, New Zealand
| |
Collapse
|
18
|
Burke HB, Hoang A, Lopreiato JO, King H, Hemmer P, Montgomery M, Gagarin V. Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study. JMIR MEDICAL EDUCATION 2024; 10:e56342. [PMID: 39118469 PMCID: PMC11327632 DOI: 10.2196/56342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 02/22/2024] [Accepted: 05/06/2024] [Indexed: 08/10/2024]
Abstract
Background Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students' free-text history and physical notes. Methods This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students' notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002). Conclusions ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students' standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice.
Collapse
Affiliation(s)
- Harry B Burke
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| | - Albert Hoang
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| | - Joseph O Lopreiato
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| | - Heidi King
- Defense Health Agency, Falls Church, VA, United States
| | - Paul Hemmer
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| | - Michael Montgomery
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| | - Viktoria Gagarin
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| |
Collapse
|
19
|
Pesapane F, Cuocolo R, Sardanelli F. The Picasso's skepticism on computer science and the dawn of generative AI: questions after the answers to keep "machines-in-the-loop". Eur Radiol Exp 2024; 8:81. [PMID: 39046535 PMCID: PMC11269548 DOI: 10.1186/s41747-024-00485-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Accepted: 06/16/2024] [Indexed: 07/25/2024] Open
Abstract
Starting from Picasso's quote ("Computers are useless. They can only give you answers"), we discuss the introduction of generative artificial intelligence (AI), including generative adversarial networks (GANs) and transformer-based architectures such as large language models (LLMs) in radiology, where their potential in reporting, image synthesis, and analysis is notable. However, the need for improvements, evaluations, and regulations prior to clinical use is also clear. Integration of LLMs into clinical workflow needs cautiousness, to avoid or at least mitigate risks associated with false diagnostic suggestions. We highlight challenges in synthetic image generation, inherent biases in AI models, and privacy concerns, stressing the importance of diverse training datasets and robust data privacy measures. We examine the regulatory landscape, including the 2023 Executive Order on AI in the United States and the 2024 AI Act in the European Union, which set standards for AI applications in healthcare. This manuscript contributes to the field by emphasizing the necessity of maintaining the human element in medical procedures while leveraging generative AI, advocating for a "machines-in-the-loop" approach.
Collapse
Affiliation(s)
- Filippo Pesapane
- Breast Imaging Division, IEO European Institute of Oncology IRCCS, Milan, Italy.
| | - Renato Cuocolo
- Department of Medicine, Surgery and Dentistry, University of Salerno, Via Salvador Allende 43, Baronissi, 84081, Salerno, Italy
| | - Francesco Sardanelli
- Unit of Radiology, IRCCS Policlinico San Donato, Via Morandi 30, San Donato Milanese, 20097, Milan, Italy
- Lega Italiana Tumori (LILT) Milano Monza Brianza, Piazzale Gorini 22, 20133, Milan, Italy
| |
Collapse
|
20
|
Kämmer JE, Hautz WE, Krummrey G, Sauter TC, Penders D, Birrenbach T, Bienefeld N. Effects of interacting with a large language model compared with a human coach on the clinical diagnostic process and outcomes among fourth-year medical students: study protocol for a prospective, randomised experiment using patient vignettes. BMJ Open 2024; 14:e087469. [PMID: 39025818 PMCID: PMC11261684 DOI: 10.1136/bmjopen-2024-087469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Accepted: 07/02/2024] [Indexed: 07/20/2024] Open
Abstract
INTRODUCTION Versatile large language models (LLMs) have the potential to augment diagnostic decision-making by assisting diagnosticians, thanks to their ability to engage in open-ended, natural conversations and their comprehensive knowledge access. Yet the novelty of LLMs in diagnostic decision-making introduces uncertainties regarding their impact. Clinicians unfamiliar with the use of LLMs in their professional context may rely on general attitudes towards LLMs more broadly, potentially hindering thoughtful use and critical evaluation of their input, leading to either over-reliance and lack of critical thinking or an unwillingness to use LLMs as diagnostic aids. To address these concerns, this study examines the influence on the diagnostic process and outcomes of interacting with an LLM compared with a human coach, and of prior training vs no training for interacting with either of these 'coaches'. Our findings aim to illuminate the potential benefits and risks of employing artificial intelligence (AI) in diagnostic decision-making. METHODS AND ANALYSIS We are conducting a prospective, randomised experiment with N=158 fourth-year medical students from Charité Medical School, Berlin, Germany. Participants are asked to diagnose patient vignettes after being assigned to either a human coach or ChatGPT and after either training or no training (both between-subject factors). We are specifically collecting data on the effects of using either of these 'coaches' and of additional training on information search, number of hypotheses entertained, diagnostic accuracy and confidence. Statistical methods will include linear mixed effects models. Exploratory analyses of the interaction patterns and attitudes towards AI will also generate more generalisable knowledge about the role of AI in medicine. ETHICS AND DISSEMINATION The Bern Cantonal Ethics Committee considered the study exempt from full ethical review (BASEC No: Req-2023-01396). All methods will be conducted in accordance with relevant guidelines and regulations. Participation is voluntary and informed consent will be obtained. Results will be published in peer-reviewed scientific medical journals. Authorship will be determined according to the International Committee of Medical Journal Editors guidelines.
Collapse
Affiliation(s)
- Juliane E Kämmer
- Department of Emergency Medicine, Inselspital University Hospital Bern, University of Bern, Bern, Switzerland
| | - Wolf E Hautz
- Department of Emergency Medicine, Inselspital University Hospital Bern, University of Bern, Bern, Switzerland
| | - Gert Krummrey
- Institute for Medical Informatics (I4MI), Bern University of Applied Sciences, Bern, Switzerland
| | - Thomas C Sauter
- Department of Emergency Medicine, Inselspital University Hospital Bern, University of Bern, Bern, Switzerland
| | - Dorothea Penders
- Department of Anesthesiology and Operative Intensive Care Medicine CCM & CVK, Charité Universitätsmedizin Berlin, Berlin, Germany
- Lernzentrum (Skills Lab), Charité Universitätsmedizin Berlin, Berlin, Germany
| | - Tanja Birrenbach
- Department of Emergency Medicine, Inselspital University Hospital Bern, University of Bern, Bern, Switzerland
| | - Nadine Bienefeld
- Department of Management, Technology, and Economics, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
21
|
Wada A, Akashi T, Shih G, Hagiwara A, Nishizawa M, Hayakawa Y, Kikuta J, Shimoji K, Sano K, Kamagata K, Nakanishi A, Aoki S. Optimizing GPT-4 Turbo Diagnostic Accuracy in Neuroradiology through Prompt Engineering and Confidence Thresholds. Diagnostics (Basel) 2024; 14:1541. [PMID: 39061677 PMCID: PMC11276551 DOI: 10.3390/diagnostics14141541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 07/02/2024] [Accepted: 07/10/2024] [Indexed: 07/28/2024] Open
Abstract
BACKGROUND AND OBJECTIVES Integrating large language models (LLMs) such as GPT-4 Turbo into diagnostic imaging faces a significant challenge, with current misdiagnosis rates ranging from 30-50%. This study evaluates how prompt engineering and confidence thresholds can improve diagnostic accuracy in neuroradiology. METHODS We analyze 751 neuroradiology cases from the American Journal of Neuroradiology using GPT-4 Turbo with customized prompts to improve diagnostic precision. RESULTS Initially, GPT-4 Turbo achieved a baseline diagnostic accuracy of 55.1%. By reformatting responses to list five diagnostic candidates and applying a 90% confidence threshold, the highest precision of the diagnosis increased to 72.9%, with the candidate list providing the correct diagnosis at 85.9%, reducing the misdiagnosis rate to 14.1%. However, this threshold reduced the number of cases that responded. CONCLUSIONS Strategic prompt engineering and high confidence thresholds significantly reduce misdiagnoses and improve the precision of the LLM diagnostic in neuroradiology. More research is needed to optimize these approaches for broader clinical implementation, balancing accuracy and utility.
Collapse
Affiliation(s)
- Akihiko Wada
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Toshiaki Akashi
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - George Shih
- Clinical Radiology, Weill Cornell Medical College, New York, NY 10065, USA
| | - Akifumi Hagiwara
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Mitsuo Nishizawa
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Yayoi Hayakawa
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Junko Kikuta
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Keigo Shimoji
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Katsuhiro Sano
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Koji Kamagata
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Atsushi Nakanishi
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Shigeki Aoki
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| |
Collapse
|
22
|
Yazaki M, Maki S, Furuya T, Inoue K, Nagai K, Nagashima Y, Maruyama J, Toki Y, Kitagawa K, Iwata S, Kitamura T, Gushiken S, Noguchi Y, Inoue M, Shiga Y, Inage K, Orita S, Nakada T, Ohtori S. Emergency Patient Triage Improvement through a Retrieval-Augmented Generation Enhanced Large-Scale Language Model. PREHOSP EMERG CARE 2024:1-7. [PMID: 38950135 DOI: 10.1080/10903127.2024.2374400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Accepted: 06/17/2024] [Indexed: 07/03/2024]
Abstract
OBJECTIVES Emergency medical triage is crucial for prioritizing patient care in emergency situations, yet its effectiveness can vary significantly based on the experience and training of the personnel involved. This study aims to evaluate the efficacy of integrating Retrieval Augmented Generation (RAG) with Large Language Models (LLMs), specifically OpenAI's GPT models, to standardize triage procedures and reduce variability in emergency care. METHODS We created 100 simulated triage scenarios based on modified cases from the Japanese National Examination for Emergency Medical Technicians. These scenarios were processed by the RAG-enhanced LLMs, and the models were given patient vital signs, symptoms, and observations from emergency medical services (EMS) teams as inputs. The primary outcome was the accuracy of triage classifications, which was used to compare the performance of the RAG-enhanced LLMs with that of emergency medical technicians and emergency physicians. Secondary outcomes included the rates of under-triage and over-triage. RESULTS The Generative Pre-trained Transformer 3.5 (GPT-3.5) with RAG model achieved a correct triage rate of 70%, significantly outperforming Emergency Medical Technicians (EMTs) with 35% and 38% correct rates, and emergency physicians with 50% and 47% correct rates (p < 0.05). Additionally, this model demonstrated a substantial reduction in under-triage rates to 8%, compared with 33% for GPT-3.5 without RAG, and 39% for GPT-4 without RAG. CONCLUSIONS The integration of RAG with LLMs shows promise in improving the accuracy and consistency of medical assessments in emergency settings. Further validation in diverse medical settings with broader datasets is necessary to confirm the effectiveness and adaptability of these technologies in live environments.
Collapse
Affiliation(s)
- Megumi Yazaki
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
- Tertiary Emergency Medical Center, Tokyo Metropolitan Bokutoh Hospital, Tokyo, Japan
- Department of Emergency and Critical Care Medicine, Chiba University, Chiba, Japan
| | - Satoshi Maki
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
- Center for Frontier Medical Engineering, Chiba University, Chiba, Japan
| | - Takeo Furuya
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Ken Inoue
- Tertiary Emergency Medical Center, Tokyo Metropolitan Bokutoh Hospital, Tokyo, Japan
| | - Ko Nagai
- Tertiary Emergency Medical Center, Tokyo Metropolitan Bokutoh Hospital, Tokyo, Japan
| | - Yuki Nagashima
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Juntaro Maruyama
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Yasunori Toki
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Kyota Kitagawa
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Shuhei Iwata
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Takaki Kitamura
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Sho Gushiken
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Yuji Noguchi
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Masahiro Inoue
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Yasuhiro Shiga
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Kazuhide Inage
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| | - Sumihisa Orita
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
- Center for Frontier Medical Engineering, Chiba University, Chiba, Japan
| | - Takaaki Nakada
- Department of Emergency and Critical Care Medicine, Chiba University, Chiba, Japan
| | - Seiji Ohtori
- Department of Orthopaedic Surgery, Graduate School of Medicine, Chiba University, Chiba, Japan
| |
Collapse
|
23
|
Sheerah HA, AlSalamah S, Alsalamah SA, Lu CT, Arafa A, Zaatari E, Alhomod A, Pujari S, Labrique A. The Rise of Virtual Health Care: Transforming the Health Care Landscape in the Kingdom of Saudi Arabia: A Review Article. Telemed J E Health 2024. [PMID: 38984415 DOI: 10.1089/tmj.2024.0114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/11/2024] Open
Abstract
BACKGROUND The rise of virtual healthcare underscores the transformative influence of digital technologies in reshaping the healthcare landscape. As technology advances and the global demand for accessible and convenient healthcare services escalates, the virtual healthcare sector is gaining unprecedented momentum. Saudi Arabia, with its ambitious Vision 2030 initiative, is actively embracing digital innovation in the healthcare sector. METHODS In this narrative review, we discussed the key drivers and prospects of virtual healthcare in Saudi Arabia, highlighting its potential to enhance healthcare accessibility, quality, and patient outcomes. We also summarized the role of the COVID-19 pandemic in the digital transformation of healthcare in the country. Healthcare services provided by Seha Virtual Hospital in Saudi Arabia, the world's largest and Middle East's first virtual hospital, were also described. Finally, we proposed a roadmap for the future development of virtual health in the country. RESULTS AND CONCLUSIONS The integration of virtual healthcare into the existing healthcare system can enhance patient experiences, improve outcomes, and contribute to the overall well-being of the population. However, careful planning, collaboration, and investment are essential to overcome the challenges and ensure the successful implementation and sustainability of virtual healthcare in the country.
Collapse
Affiliation(s)
- Haytham A Sheerah
- Ministry of Health, Office of the Vice Minister of Health, Riyadh, Saudi Arabia
| | - Shada AlSalamah
- Information Systems Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
- Department of Digital Health and Innovation, Science Division, World Health Organization, Geneva, Switzerland
| | - Sara A Alsalamah
- College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University, Riyadh, Saudi Arabia
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia, USA
| | - Chang-Tien Lu
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia, USA
| | - Ahmed Arafa
- Department of Preventive Cardiology, National Cerebral and Cardiovascular Center, Suita, Japan
- Department of Public Health and Community Medicine, Faculty of Medicine, Beni-Suef University, Beni-Suef, Egypt
| | - Ezzedine Zaatari
- Ministry of Health, Office of the Vice Minister of Health, Riyadh, Saudi Arabia
| | - Abdulaziz Alhomod
- Ministry of Health, SEHA Virtual Hospital, Riyadh, Saudi Arabia
- Emergency Medicine Administration, King Fahad Medical City, Riyadh, Saudi Arabia
| | - Sameer Pujari
- Department of Digital Health and Innovation, Science Division, World Health Organization, Geneva, Switzerland
| | - Alain Labrique
- Department of International Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, Maryland,United States
| |
Collapse
|
24
|
Hoppe JM, Auer MK, Strüven A, Massberg S, Stremmel C. ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis. J Med Internet Res 2024; 26:e56110. [PMID: 38976865 PMCID: PMC11263899 DOI: 10.2196/56110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 04/08/2024] [Accepted: 05/08/2024] [Indexed: 07/10/2024] Open
Abstract
BACKGROUND OpenAI's ChatGPT is a pioneering artificial intelligence (AI) in the field of natural language processing, and it holds significant potential in medicine for providing treatment advice. Additionally, recent studies have demonstrated promising results using ChatGPT for emergency medicine triage. However, its diagnostic accuracy in the emergency department (ED) has not yet been evaluated. OBJECTIVE This study compares the diagnostic accuracy of ChatGPT with GPT-3.5 and GPT-4 and primary treating resident physicians in an ED setting. METHODS Among 100 adults admitted to our ED in January 2023 with internal medicine issues, the diagnostic accuracy was assessed by comparing the diagnoses made by ED resident physicians and those made by ChatGPT with GPT-3.5 or GPT-4 against the final hospital discharge diagnosis, using a point system for grading accuracy. RESULTS The study enrolled 100 patients with a median age of 72 (IQR 58.5-82.0) years who were admitted to our internal medicine ED primarily for cardiovascular, endocrine, gastrointestinal, or infectious diseases. GPT-4 outperformed both GPT-3.5 (P<.001) and ED resident physicians (P=.01) in diagnostic accuracy for internal medicine emergencies. Furthermore, across various disease subgroups, GPT-4 consistently outperformed GPT-3.5 and resident physicians. It demonstrated significant superiority in cardiovascular (GPT-4 vs ED physicians: P=.03) and endocrine or gastrointestinal diseases (GPT-4 vs GPT-3.5: P=.01). However, in other categories, the differences were not statistically significant. CONCLUSIONS In this study, which compared the diagnostic accuracy of GPT-3.5, GPT-4, and ED resident physicians against a discharge diagnosis gold standard, GPT-4 outperformed both the resident physicians and its predecessor, GPT-3.5. Despite the retrospective design of the study and its limited sample size, the results underscore the potential of AI as a supportive diagnostic tool in ED settings.
Collapse
Affiliation(s)
| | - Matthias K Auer
- Department of Medicine IV, LMU University Hospital, Munich, Germany
| | - Anna Strüven
- Department of Medicine I, LMU University Hospital, Munich, Germany
- Munich Heart Alliance Partner Site, Deutsches Zentrum für Herz-Kreislaufforschung (German Centre for Cardiovascular Research), LMU University Hospital, Munich, Germany
| | - Steffen Massberg
- Department of Medicine I, LMU University Hospital, Munich, Germany
- Munich Heart Alliance Partner Site, Deutsches Zentrum für Herz-Kreislaufforschung (German Centre for Cardiovascular Research), LMU University Hospital, Munich, Germany
| | - Christopher Stremmel
- Department of Medicine I, LMU University Hospital, Munich, Germany
- Munich Heart Alliance Partner Site, Deutsches Zentrum für Herz-Kreislaufforschung (German Centre for Cardiovascular Research), LMU University Hospital, Munich, Germany
| |
Collapse
|
25
|
Keshavarz P, Bagherieh S, Nabipoorashrafi SA, Chalian H, Rahsepar AA, Kim GHJ, Hassani C, Raman SS, Bedayat A. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024; 105:251-265. [PMID: 38679540 DOI: 10.1016/j.diii.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/11/2024] [Accepted: 04/16/2024] [Indexed: 05/01/2024]
Abstract
PURPOSE The purpose of this study was to systematically review the reported performances of ChatGPT, identify potential limitations, and explore future directions for its integration, optimization, and ethical considerations in radiology applications. MATERIALS AND METHODS After a comprehensive review of PubMed, Web of Science, Embase, and Google Scholar databases, a cohort of published studies was identified up to January 1, 2024, utilizing ChatGPT for clinical radiology applications. RESULTS Out of 861 studies derived, 44 studies evaluated the performance of ChatGPT; among these, 37 (37/44; 84.1%) demonstrated high performance, and seven (7/44; 15.9%) indicated it had a lower performance in providing information on diagnosis and clinical decision support (6/44; 13.6%) and patient communication and educational content (1/44; 2.3%). Twenty-four (24/44; 54.5%) studies reported the proportion of ChatGPT's performance. Among these, 19 (19/24; 79.2%) studies recorded a median accuracy of 70.5%, and in five (5/24; 20.8%) studies, there was a median agreement of 83.6% between ChatGPT outcomes and reference standards [radiologists' decision or guidelines], generally confirming ChatGPT's high accuracy in these studies. Eleven studies compared two recent ChatGPT versions, and in ten (10/11; 90.9%), ChatGPTv4 outperformed v3.5, showing notable enhancements in addressing higher-order thinking questions, better comprehension of radiology terms, and improved accuracy in describing images. Risks and concerns about using ChatGPT included biased responses, limited originality, and the potential for inaccurate information leading to misinformation, hallucinations, improper citations and fake references, cybersecurity vulnerabilities, and patient privacy risks. CONCLUSION Although ChatGPT's effectiveness has been shown in 84.1% of radiology studies, there are still multiple pitfalls and limitations to address. It is too soon to confirm its complete proficiency and accuracy, and more extensive multicenter studies utilizing diverse datasets and pre-training techniques are required to verify ChatGPT's role in radiology.
Collapse
Affiliation(s)
- Pedram Keshavarz
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA; School of Science and Technology, The University of Georgia, Tbilisi 0171, Georgia
| | - Sara Bagherieh
- Independent Clinical Radiology Researcher, Los Angeles, CA 90024, USA
| | | | - Hamid Chalian
- Department of Radiology, Cardiothoracic Imaging, University of Washington, Seattle, WA 98195, USA
| | - Amir Ali Rahsepar
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Grace Hyun J Kim
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA; Department of Radiological Sciences, Center for Computer Vision and Imaging Biomarkers, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Cameron Hassani
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Steven S Raman
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Arash Bedayat
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA.
| |
Collapse
|
26
|
Law S, Oldfield B, Yang W. ChatGPT/GPT-4 (large language models): Opportunities and challenges of perspective in bariatric healthcare professionals. Obes Rev 2024; 25:e13746. [PMID: 38613164 DOI: 10.1111/obr.13746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/10/2023] [Revised: 03/14/2024] [Accepted: 03/15/2024] [Indexed: 04/14/2024]
Abstract
ChatGPT/GPT-4 is a conversational large language model (LLM) based on artificial intelligence (AI). The potential application of LLM as a virtual assistant for bariatric healthcare professionals in education and practice may be promising if relevant and valid issues are actively examined and addressed. In general medical terms, it is possible that AI models like ChatGPT/GPT-4 will be deeply integrated into medical scenarios, improving medical efficiency and quality, and allowing doctors more time to communicate with patients and implement personalized health management. Chatbots based on AI have great potential in bariatric healthcare and may play an important role in predicting and intervening in weight loss and obesity-related complications. However, given its potential limitations, we should carefully consider the medical, legal, ethical, data security, privacy, and liability issues arising from medical errors caused by ChatGPT/GPT-4. This concern also extends to ChatGPT/GPT -4's ability to justify wrong decisions, and there is an urgent need for appropriate guidelines and regulations to ensure the safe and responsible use of ChatGPT/GPT-4.
Collapse
Affiliation(s)
- Saikam Law
- Department of Metabolic and Bariatric Surgery, The First Affiliated Hospital of Jinan University, Guangzhou, China
- School of Medicine, Jinan University, Guangzhou, China
| | - Brian Oldfield
- Department of Physiology, Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia
| | - Wah Yang
- Department of Metabolic and Bariatric Surgery, The First Affiliated Hospital of Jinan University, Guangzhou, China
| |
Collapse
|
27
|
Kumar RP, Sivan V, Bachir H, Sarwar SA, Ruzicka F, O'Malley GR, Lobo P, Morales IC, Cassimatis ND, Hundal JS, Patel NV. Can Artificial Intelligence Mitigate Missed Diagnoses by Generating Differential Diagnoses for Neurosurgeons? World Neurosurg 2024; 187:e1083-e1088. [PMID: 38759788 DOI: 10.1016/j.wneu.2024.05.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 05/08/2024] [Accepted: 05/09/2024] [Indexed: 05/19/2024]
Abstract
BACKGROUND/OBJECTIVE Neurosurgery emphasizes the criticality of accurate differential diagnoses, with diagnostic delays posing significant health and economic challenges. As large language models (LLMs) emerge as transformative tools in healthcare, this study seeks to elucidate their role in assisting neurosurgeons with the differential diagnosis process, especially during preliminary consultations. METHODS This study employed 3 chat-based LLMs, ChatGPT (versions 3.5 and 4.0), Perplexity AI, and Bard AI, to evaluate their diagnostic accuracy. Each LLM was prompted using clinical vignettes, and their responses were recorded to generate differential diagnoses for 20 common and uncommon neurosurgical disorders. Disease-specific prompts were crafted using Dynamed, a clinical reference tool. The accuracy of the LLMs was determined based on their ability to identify the target disease within their top differential diagnoses correctly. RESULTS For the initial differential, ChatGPT 3.5 achieved an accuracy of 52.63%, while ChatGPT 4.0 performed slightly better at 53.68%. Perplexity AI and Bard AI demonstrated 40.00% and 29.47% accuracy, respectively. As the number of considered differentials increased from 2 to 5, ChatGPT 3.5 reached its peak accuracy of 77.89% for the top 5 differentials. Bard AI and Perplexity AI had varied performances, with Bard AI improving in the top 5 differentials at 62.11%. On a disease-specific note, the LLMs excelled in diagnosing conditions like epilepsy and cervical spine stenosis but faced challenges with more complex diseases such as Moyamoya disease and amyotrophic lateral sclerosis. CONCLUSIONS LLMs showcase the potential to enhance diagnostic accuracy and decrease the incidence of missed diagnoses in neurosurgery.
Collapse
Affiliation(s)
- Rohit Prem Kumar
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA.
| | - Vijay Sivan
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Hanin Bachir
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Syed A Sarwar
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Francis Ruzicka
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Geoffrey R O'Malley
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Paulo Lobo
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Ilona Cazorla Morales
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Nicholas D Cassimatis
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| | - Jasdeep S Hundal
- Department of Neurology, HMH-Jersey Shore University Medical Center, Neptune, New Jersey, USA
| | - Nitesh V Patel
- Department of Neurosurgery, Hackensack Meridian School of Medicine, Nutley, New Jersey, USA; Department of Neurosurgery, HMH-Jersey Shore University Medical Center, Neptune, New Jersey, USA
| |
Collapse
|
28
|
Alshutayli AAM, Asiri FM, Abutaleb YBA, Alomair BA, Almasaud AK, Almaqhawi A. Assessing Public Knowledge and Acceptance of Using Artificial Intelligence Doctors as a Partial Alternative to Human Doctors in Saudi Arabia: A Cross-Sectional Study. Cureus 2024; 16:e64461. [PMID: 39135842 PMCID: PMC11318498 DOI: 10.7759/cureus.64461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/13/2024] [Indexed: 08/15/2024] Open
Abstract
Objective To assess the public acceptance of using artificial intelligence (AI) doctors to diagnose and treat patients as a partial alternative to human physicians in Saudi Arabia. Methodology An observational cross-sectional study was conducted from January to March 2024. A link to an online questionnaire was distributed through social media applications to citizens and residents aged 18 years and older across various regions in Saudi Arabia. The sample size was calculated using the Raosoft online survey size calculator, which estimated that the minimum sample size should be 385. Results Of the 386 participants surveyed, 85.8% reported being aware of AI, and 47.9% reported having some knowledge about different AI fields in daily life. However, almost one-third (32.9%) reported a lack of knowledge about the use of AI in healthcare. In terms of acceptance, 52.3% of respondents indicated they felt comfortable with the use of AI tools as partial alternatives to human doctors, and 30.8% believed AI is useful in the field of health. The most common concern (63.7%) about the use of AI tools accessible to patients was the difficulty of describing symptoms using these tools. Conclusion The findings of this study provide valuable insights into the public's knowledge and acceptance of AI in medicine within the Saudi Arabian context. Overall, this study underscores the importance of proactively addressing the public's concerns and knowledge gaps regarding AI in healthcare. By fostering greater understanding and acceptance, healthcare stakeholders can better harness the potential of AI to improve patient outcomes and enhance the efficiency of medical services in Saudi Arabia.
Collapse
Affiliation(s)
| | - Faisal M Asiri
- College of Medicine, Prince Sattam Bin Abdulaziz University, Al-Kharj, SAU
| | | | | | | | | |
Collapse
|
29
|
Aden D, Zaheer S, Khan S. Possible benefits, challenges, pitfalls, and future perspective of using ChatGPT in pathology. REVISTA ESPANOLA DE PATOLOGIA : PUBLICACION OFICIAL DE LA SOCIEDAD ESPANOLA DE ANATOMIA PATOLOGICA Y DE LA SOCIEDAD ESPANOLA DE CITOLOGIA 2024; 57:198-210. [PMID: 38971620 DOI: 10.1016/j.patol.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 02/22/2024] [Accepted: 04/16/2024] [Indexed: 07/08/2024]
Abstract
The much-hyped artificial intelligence (AI) model called ChatGPT developed by Open AI can have great benefits for physicians, especially pathologists, by saving time so that they can use their time for more significant work. Generative AI is a special class of AI model, which uses patterns and structures learned from existing data and can create new data. Utilizing ChatGPT in Pathology offers a multitude of benefits, encompassing the summarization of patient records and its promising prospects in Digital Pathology, as well as its valuable contributions to education and research in this field. However, certain roadblocks need to be dealt like integrating ChatGPT with image analysis which will act as a revolution in the field of pathology by increasing diagnostic accuracy and precision. The challenges with the use of ChatGPT encompass biases from its training data, the need for ample input data, potential risks related to bias and transparency, and the potential adverse outcomes arising from inaccurate content generation. Generation of meaningful insights from the textual information which will be efficient in processing different types of image data, such as medical images, and pathology slides. Due consideration should be given to ethical and legal issues including bias.
Collapse
Affiliation(s)
- Durre Aden
- Department of Pathology, Hamdard Institute of Medical Sciences and Research, Jamia Hamdard, New Delhi, India
| | - Sufian Zaheer
- Department of Pathology, Vardhman Mahavir Medical College and Safdarjung Hospital, New Delhi, India.
| | - Sabina Khan
- Department of Pathology, Hamdard Institute of Medical Sciences and Research, Jamia Hamdard, New Delhi, India
| |
Collapse
|
30
|
Lahat A, Sharif K, Zoabi N, Shneor Patt Y, Sharif Y, Fisher L, Shani U, Arow M, Levin R, Klang E. Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4. J Med Internet Res 2024; 26:e54571. [PMID: 38935937 PMCID: PMC11240076 DOI: 10.2196/54571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 02/02/2024] [Accepted: 04/29/2024] [Indexed: 06/29/2024] Open
Abstract
BACKGROUND Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement. OBJECTIVE This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors' and residents' ratings, and specific question types. METHODS A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications. RESULTS Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5's accuracy, beneficial, and completeness dimensions. CONCLUSIONS ChatGPT's potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.
Collapse
Affiliation(s)
- Adi Lahat
- Department of Gastroenterology, Chaim Sheba Medical Center, Affiliated with Tel Aviv University, Ramat Gan, Israel
- Department of Gastroenterology, Samson Assuta Ashdod Medical Center, Affiliated with Ben Gurion University of the Negev, Be'er Sheva, Israel
| | - Kassem Sharif
- Department of Gastroenterology, Chaim Sheba Medical Center, Affiliated with Tel Aviv University, Ramat Gan, Israel
- Department of Internal Medicine B, Sheba Medical Centre, Tel Aviv, Israel
| | - Narmin Zoabi
- Department of Gastroenterology, Chaim Sheba Medical Center, Affiliated with Tel Aviv University, Ramat Gan, Israel
| | | | - Yousra Sharif
- Department of Internal Medicine C, Hadassah Medical Center, Jerusalem, Israel
| | - Lior Fisher
- Department of Internal Medicine B, Sheba Medical Centre, Tel Aviv, Israel
| | - Uria Shani
- Department of Internal Medicine B, Sheba Medical Centre, Tel Aviv, Israel
| | - Mohamad Arow
- Department of Internal Medicine B, Sheba Medical Centre, Tel Aviv, Israel
| | - Roni Levin
- Department of Internal Medicine B, Sheba Medical Centre, Tel Aviv, Israel
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
31
|
Ríos-Hoyo A, Shan NL, Li A, Pearson AT, Pusztai L, Howard FM. Evaluation of large language models as a diagnostic aid for complex medical cases. Front Med (Lausanne) 2024; 11:1380148. [PMID: 38966538 PMCID: PMC11222590 DOI: 10.3389/fmed.2024.1380148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 06/10/2024] [Indexed: 07/06/2024] Open
Abstract
Background The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals. Objective To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case. Design Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI's GPT3.5 and 4 models. Results The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 (p < 0.0001). GPT4 was more frequently able to list the correct diagnosis as first (22% versus 20% with GPT3.5, p = 0.86), provide the correct diagnosis among the top three generated diagnoses (42% versus 24%, p = 0.075). GPT4 was better at providing the correct diagnosis, when the different diagnoses were classified into groups according to the medical specialty and include the correct diagnosis at any point in the differential list (68% versus 48%, p = 0.0063). GPT4 provided a differential list that was more similar to the list provided by the case discussants than GPT3.5 (Jaccard Similarity Index 0.22 versus 0.12, p = 0.001). Inclusion of the correct diagnosis in the generated differential was correlated with PubMed articles matching the diagnosis (OR 1.40, 95% CI 1.25-1.56 for GPT3.5, OR 1.25, 95% CI 1.13-1.40 for GPT4), but not with disease incidence. Conclusions and relevance The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.
Collapse
Affiliation(s)
| | - Naing Lin Shan
- Yale Cancer Center, Yale School of Medicine, New Haven, CT, United States
| | - Anran Li
- Department of Medicine, University of Chicago, Chicago, IL, United States
| | | | - Lajos Pusztai
- Yale Cancer Center, Yale School of Medicine, New Haven, CT, United States
| | | |
Collapse
|
32
|
Born C, Schwarz R, Böttcher TP, Hein A, Krcmar H. The role of information systems in emergency department decision-making-a literature review. J Am Med Inform Assoc 2024; 31:1608-1621. [PMID: 38781289 PMCID: PMC11187435 DOI: 10.1093/jamia/ocae096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 04/11/2024] [Accepted: 04/15/2024] [Indexed: 05/25/2024] Open
Abstract
OBJECTIVES Healthcare providers employ heuristic and analytical decision-making to navigate the high-stakes environment of the emergency department (ED). Despite the increasing integration of information systems (ISs), research on their efficacy is conflicting. Drawing on related fields, we investigate how timing and mode of delivery influence IS effectiveness. Our objective is to reconcile previous contradictory findings, shedding light on optimal IS design in the ED. MATERIALS AND METHODS We conducted a systematic review following PRISMA across PubMed, Scopus, and Web of Science. We coded the ISs' timing as heuristic or analytical, their mode of delivery as active for automatic alerts and passive when requiring user-initiated information retrieval, and their effect on process, economic, and clinical outcomes. RESULTS Our analysis included 83 studies. During early heuristic decision-making, most active interventions were ineffective, while passive interventions generally improved outcomes. In the analytical phase, the effects were reversed. Passive interventions that facilitate information extraction consistently improved outcomes. DISCUSSION Our findings suggest that the effectiveness of active interventions negatively correlates with the amount of information received during delivery. During early heuristic decision-making, when information overload is high, physicians are unresponsive to alerts and proactively consult passive resources. In the later analytical phases, physicians show increased receptivity to alerts due to decreased diagnostic uncertainty and information quantity. Interventions that limit information lead to positive outcomes, supporting our interpretation. CONCLUSION We synthesize our findings into an integrated model that reveals the underlying reasons for conflicting findings from previous reviews and can guide practitioners in designing ISs in the ED.
Collapse
Affiliation(s)
- Cornelius Born
- School of Computation, Information and Technology, Technical University of Munich, 85748 Garching bei München, Germany
| | - Romy Schwarz
- School of Computation, Information and Technology, Technical University of Munich, 85748 Garching bei München, Germany
| | - Timo Phillip Böttcher
- School of Computation, Information and Technology, Technical University of Munich, 85748 Garching bei München, Germany
| | - Andreas Hein
- Institute of Information Systems and Digital Business, University of St. Gallen, 9000 St. Gallen, Switzerland
| | - Helmut Krcmar
- School of Computation, Information and Technology, Technical University of Munich, 85748 Garching bei München, Germany
| |
Collapse
|
33
|
Masanneck L, Schmidt L, Seifert A, Kölsche T, Huntemann N, Jansen R, Mehsin M, Bernhard M, Meuth SG, Böhm L, Pawlitzki M. Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study. J Med Internet Res 2024; 26:e53297. [PMID: 38875696 PMCID: PMC11214027 DOI: 10.2196/53297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 04/17/2024] [Accepted: 05/14/2024] [Indexed: 06/16/2024] Open
Abstract
BACKGROUND Large language models (LLMs) have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department (ED) triage. This study evaluated the triage proficiency of different LLMs and ChatGPT, an LLM-based chatbot, compared to professionally trained ED staff and untrained personnel. We further explored whether LLM responses could guide untrained staff in effective triage. OBJECTIVE This study aimed to assess the efficacy of LLMs and the associated product ChatGPT in ED triage compared to personnel of varying training status and to investigate if the models' responses can enhance the triage proficiency of untrained personnel. METHODS A total of 124 anonymized case vignettes were triaged by untrained doctors; different versions of currently available LLMs; ChatGPT; and professionally trained raters, who subsequently agreed on a consensus set according to the Manchester Triage System (MTS). The prototypical vignettes were adapted from cases at a tertiary ED in Germany. The main outcome was the level of agreement between raters' MTS level assignments, measured via quadratic-weighted Cohen κ. The extent of over- and undertriage was also determined. Notably, instances of ChatGPT were prompted using zero-shot approaches without extensive background information on the MTS. The tested LLMs included raw GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b. RESULTS GPT-4-based ChatGPT and untrained doctors showed substantial agreement with the consensus triage of professional raters (κ=mean 0.67, SD 0.037 and κ=mean 0.68, SD 0.056, respectively), significantly exceeding the performance of GPT-3.5-based ChatGPT (κ=mean 0.54, SD 0.024; P<.001). When untrained doctors used this LLM for second-opinion triage, there was a slight but statistically insignificant performance increase (κ=mean 0.70, SD 0.047; P=.97). Other tested LLMs performed similar to or worse than GPT-4-based ChatGPT or showed odd triaging behavior with the used parameters. LLMs and ChatGPT models tended toward overtriage, whereas untrained doctors undertriaged. CONCLUSIONS While LLMs and the LLM-based product ChatGPT do not yet match professionally trained raters, their best models' triage proficiency equals that of untrained ED doctors. In its current form, LLMs or ChatGPT thus did not demonstrate gold-standard performance in ED triage and, in the setting of this study, failed to significantly improve untrained doctors' triage when used as decision support. Notable performance enhancements in newer LLM versions over older ones hint at future improvements with further technological development and specific training.
Collapse
Affiliation(s)
- Lars Masanneck
- Department of Neurology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
| | - Linea Schmidt
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
| | - Antonia Seifert
- Emergency Department, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Tristan Kölsche
- Department of Neurology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Niklas Huntemann
- Department of Neurology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Robin Jansen
- Department of Neurology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Mohammed Mehsin
- Department of Neurology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Michael Bernhard
- Emergency Department, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Sven G Meuth
- Department of Neurology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Lennert Böhm
- Emergency Department, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Marc Pawlitzki
- Department of Neurology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| |
Collapse
|
34
|
Harada Y, Suzuki T, Harada T, Sakamoto T, Ishizuka K, Miyagami T, Kawamura R, Kunitomo K, Nagano H, Shimizu T, Watari T. Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors. BMJ Open Qual 2024; 13:e002654. [PMID: 38830730 PMCID: PMC11149143 DOI: 10.1136/bmjoq-2023-002654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors. OBJECTIVE This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations. METHODS We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians. RESULTS ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP. CONCLUSION ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.
Collapse
Affiliation(s)
- Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | | | - Taku Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
- Nerima Hikarigaoka Hospital, Nerima-ku, Tokyo, Japan
| | - Tetsu Sakamoto
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | - Kosuke Ishizuka
- Yokohama City University School of Medicine Graduate School of Medicine, Yokohama, Kanagawa, Japan
| | - Taiju Miyagami
- Department of General Medicine, Faculty of Medicine, Juntendo University, Bunkyo-ku, Tokyo, Japan
| | - Ren Kawamura
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | | | - Hiroyuki Nagano
- Department of General Internal Medicine, Tenri Hospital, Tenri, Nara, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | - Takashi Watari
- Integrated Clinical Education Center, Kyoto University Hospital, Kyoto, Kyoto, Japan
| |
Collapse
|
35
|
Barclay KS, You JY, Coleman MJ, Mathews PM, Ray VL, Riaz KM, De Rojas JO, Wang AS, Watson SH, Koo EH, Eghrari AO. Quality and Agreement With Scientific Consensus of ChatGPT Information Regarding Corneal Transplantation and Fuchs Dystrophy. Cornea 2024; 43:746-750. [PMID: 38016014 DOI: 10.1097/ico.0000000000003439] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 10/30/2023] [Indexed: 11/30/2023]
Abstract
PURPOSE ChatGPT is a commonly used source of information by patients and clinicians. However, it can be prone to error and requires validation. We sought to assess the quality and accuracy of information regarding corneal transplantation and Fuchs dystrophy from 2 iterations of ChatGPT, and whether its answers improve over time. METHODS A total of 10 corneal specialists collaborated to assess responses of the algorithm to 10 commonly asked questions related to endothelial keratoplasty and Fuchs dystrophy. These questions were asked from both ChatGPT-3.5 and its newer generation, GPT-4. Assessments tested quality, safety, accuracy, and bias of information. Chi-squared, Fisher exact tests, and regression analyses were conducted. RESULTS We analyzed 180 valid responses. On a 1 (A+) to 5 (F) scale, the average score given by all specialists across questions was 2.5 for ChatGPT-3.5 and 1.4 for GPT-4, a significant improvement ( P < 0.0001). Most responses by both ChatGPT-3.5 (61%) and GPT-4 (89%) used correct facts, a proportion that significantly improved across iterations ( P < 0.00001). Approximately a third (35%) of responses from ChatGPT-3.5 were considered against the scientific consensus, a notable rate of error that decreased to only 5% of answers from GPT-4 ( P < 0.00001). CONCLUSIONS The quality of responses in ChatGPT significantly improved between versions 3.5 and 4, and the odds of providing information against the scientific consensus decreased. However, the technology is still capable of producing inaccurate statements. Corneal specialists are uniquely positioned to assist users to discern the veracity and application of such information.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Shelly H Watson
- Northern Virginia Ophthalmology Associates, Falls Church, VA
| | | | | |
Collapse
|
36
|
Mousavi M, Shafiee S, Harley JM, Cheung JCK, Abbasgholizadeh Rahimi S. Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada. Fam Med Community Health 2024; 12:e002626. [PMID: 38806403 PMCID: PMC11138270 DOI: 10.1136/fmch-2023-002626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/30/2024] Open
Abstract
INTRODUCTION The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical education, and its performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC). METHOD Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews' score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds. RESULT According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer's scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p<0.001). Similarly, the Reviewers' Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed those of GPT-3.5 (OR: 2.23; 95% CI: 1.22 to 4.06; p=0.009). Running the GPTs after a one week interval, regeneration of the prompt or using or not using the prompt did not significantly change the CFPC score percentage. CONCLUSION In our study, we used GPT-3.5 and GPT-4 to answer complex, open-ended sample questions of the CFPC exam and showed that more than 70% of the answers were accurate, and GPT-4 outperformed GPT-3.5 in responding to the questions. Large language models such as GPTs seem promising for assisting candidates of the CFPC exam by providing potential answers. However, their use for family medicine education and exam preparation needs further studies.
Collapse
Affiliation(s)
- Mehdi Mousavi
- Department of Family Medicine, Faculty of Medicine, University of Saskatchewan, Nipawin, Saskatchewan, Canada
| | - Shabnam Shafiee
- Department of Family Medicine, Saskatchewan Health Authority, Riverside Health Complex, Turtleford, Saskatchewan, Canada
| | - Jason M Harley
- Department of Surgery, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada
- Research Institute of the McGill University Health Centre, Montreal, Quebec, Canada
- Institute for Health Sciences Education, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada
| | - Jackie Chi Kit Cheung
- McGill University School of Computer Science, Montreal, Quebec, Canada
- CIFAR AI Chair, Mila-Quebec AI Institute, Montreal, Quebec, Canada
| | - Samira Abbasgholizadeh Rahimi
- Department of Family Medicine, McGill University, Montreal, Quebec, Canada
- Mila Quebec AI-Institute, Montreal, Quebec, Canada
- Faculty of Dentistry Medicine and Oral Health Sciences, McGill University, Montreal, Quebec, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Quebec, Canada
| |
Collapse
|
37
|
Pardos ZA, Bhandari S. ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. PLoS One 2024; 19:e0304013. [PMID: 38787823 PMCID: PMC11125466 DOI: 10.1371/journal.pone.0304013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 05/03/2024] [Indexed: 05/26/2024] Open
Abstract
Authoring of help content within educational technologies is labor intensive, requiring many iterations of content creation, refining, and proofreading. In this paper, we conduct an efficacy evaluation of ChatGPT-generated help using a 3 x 4 study design (N = 274) to compare the learning gains of ChatGPT to human tutor-authored help across four mathematics problem subject areas. Participants are randomly assigned to one of three hint conditions (control, human tutor, or ChatGPT) paired with one of four randomly assigned subject areas (Elementary Algebra, Intermediate Algebra, College Algebra, or Statistics). We find that only the ChatGPT condition produces statistically significant learning gains compared to a no-help control, with no statistically significant differences in gains or time-on-task observed between learners receiving ChatGPT vs human tutor help. Notably, ChatGPT-generated help failed quality checks on 32% of problems. This was, however, reducible to nearly 0% for algebra problems and 13% for statistics problems after applying self-consistency, a "hallucination" mitigation technique for Large Language Models.
Collapse
Affiliation(s)
- Zachary A. Pardos
- Berkeley School of Education, University of California, Berkeley, California, United States of America
| | - Shreya Bhandari
- Electrical Engineering and Computer Science, University of California, Berkeley, California, United States of America
| |
Collapse
|
38
|
Jindal JA, Lungren MP, Shah NH. Ensuring useful adoption of generative artificial intelligence in healthcare. J Am Med Inform Assoc 2024; 31:1441-1444. [PMID: 38452298 PMCID: PMC11105148 DOI: 10.1093/jamia/ocae043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/01/2024] [Accepted: 02/22/2024] [Indexed: 03/09/2024] Open
Abstract
OBJECTIVES This article aims to examine how generative artificial intelligence (AI) can be adopted with the most value in health systems, in response to the Executive Order on AI. MATERIALS AND METHODS We reviewed how technology has historically been deployed in healthcare, and evaluated recent examples of deployments of both traditional AI and generative AI (GenAI) with a lens on value. RESULTS Traditional AI and GenAI are different technologies in terms of their capability and modes of current deployment, which have implications on value in health systems. DISCUSSION Traditional AI when applied with a framework top-down can realize value in healthcare. GenAI in the short term when applied top-down has unclear value, but encouraging more bottom-up adoption has the potential to provide more benefit to health systems and patients. CONCLUSION GenAI in healthcare can provide the most value for patients when health systems adapt culturally to grow with this new technology and its adoption patterns.
Collapse
Affiliation(s)
- Jenelle A Jindal
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305, United States
| | - Matthew P Lungren
- Health and Life Sciences, Microsoft Corporation, Redmond, WA 98052, United States
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, United States
- Department of Biomedical Imaging, University of California San Francisco, San Francisco, CA 94143, United States
| | - Nigam H Shah
- Department of Medicine, Stanford School of Medicine, Stanford, CA 94304, United States
- Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA 94304, United States
- Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA 94304, United States
| |
Collapse
|
39
|
Harada Y, Sakamoto T, Sugimoto S, Shimizu T. Longitudinal Changes in Diagnostic Accuracy of a Differential Diagnosis List Developed by an AI-Based Symptom Checker: Retrospective Observational Study. JMIR Form Res 2024; 8:e53985. [PMID: 38758588 PMCID: PMC11143391 DOI: 10.2196/53985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 03/23/2024] [Accepted: 04/24/2024] [Indexed: 05/18/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) symptom checker models should be trained using real-world patient data to improve their diagnostic accuracy. Given that AI-based symptom checkers are currently used in clinical practice, their performance should improve over time. However, longitudinal evaluations of the diagnostic accuracy of these symptom checkers are limited. OBJECTIVE This study aimed to assess the longitudinal changes in the accuracy of differential diagnosis lists created by an AI-based symptom checker used in the real world. METHODS This was a single-center, retrospective, observational study. Patients who visited an outpatient clinic without an appointment between May 1, 2019, and April 30, 2022, and who were admitted to a community hospital in Japan within 30 days of their index visit were considered eligible. We only included patients who underwent an AI-based symptom checkup at the index visit, and the diagnosis was finally confirmed during follow-up. Final diagnoses were categorized as common or uncommon, and all cases were categorized as typical or atypical. The primary outcome measure was the accuracy of the differential diagnosis list created by the AI-based symptom checker, defined as the final diagnosis in a list of 10 differential diagnoses created by the symptom checker. To assess the change in the symptom checker's diagnostic accuracy over 3 years, we used a chi-square test to compare the primary outcome over 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year). RESULTS A total of 381 patients were included. Common diseases comprised 257 (67.5%) cases, and typical presentations were observed in 298 (78.2%) cases. Overall, the accuracy of the differential diagnosis list created by the AI-based symptom checker was 172 (45.1%), which did not differ across the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). The accuracy of the differential diagnosis list created by the symptom checker was low in those with uncommon diseases (30/124, 24.2%) and atypical presentations (12/83, 14.5%). In the multivariate logistic regression model, common disease (P<.001; odds ratio 4.13, 95% CI 2.50-6.98) and typical presentation (P<.001; odds ratio 6.92, 95% CI 3.62-14.2) were significantly associated with the accuracy of the differential diagnosis list created by the symptom checker. CONCLUSIONS A 3-year longitudinal survey of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, which has been implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with a lower diagnostic accuracy. In the future, symptom checkers should be trained to recognize uncommon conditions.
Collapse
Affiliation(s)
- Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
- Department of General Medicine, Nagano Chuo Hospital, Nagano, Japan
| | - Tetsu Sakamoto
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| | - Shu Sugimoto
- Department of Medicine (Neurology and Rheumatology), Shinshu University School of Medicine, Matsumoto, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| |
Collapse
|
40
|
Yanagita Y, Yokokawa D, Fukuzawa F, Uchida S, Uehara T, Ikusaka M. Expert assessment of ChatGPT's ability to generate illness scripts: an evaluative study. BMC MEDICAL EDUCATION 2024; 24:536. [PMID: 38750546 PMCID: PMC11095028 DOI: 10.1186/s12909-024-05534-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 05/08/2024] [Indexed: 05/19/2024]
Abstract
BACKGROUND An illness script is a specific script format geared to represent patient-oriented clinical knowledge organized around enabling conditions, faults (i.e., pathophysiological process), and consequences. Generative artificial intelligence (AI) stands out as an educational aid in continuing medical education. The effortless creation of a typical illness script by generative AI could help the comprehension of key features of diseases and increase diagnostic accuracy. No systematic summary of specific examples of illness scripts has been reported since illness scripts are unique to each physician. OBJECTIVE This study investigated whether generative AI can generate illness scripts. METHODS We utilized ChatGPT-4, a generative AI, to create illness scripts for 184 diseases based on the diseases and conditions integral to the National Model Core Curriculum in Japan for undergraduate medical education (2022 revised edition) and primary care specialist training in Japan. Three physicians applied a three-tier grading scale: "A" denotes that the content of each disease's illness script proves sufficient for training medical students, "B" denotes that it is partially lacking but acceptable, and "C" denotes that it is deficient in multiple respects. RESULTS By leveraging ChatGPT-4, we successfully generated each component of the illness script for 184 diseases without any omission. The illness scripts received "A," "B," and "C" ratings of 56.0% (103/184), 28.3% (52/184), and 15.8% (29/184), respectively. CONCLUSION Useful illness scripts were seamlessly and instantaneously created using ChatGPT-4 by employing prompts appropriate for medical students. The technology-driven illness script is a valuable tool for introducing medical students to key features of diseases.
Collapse
Affiliation(s)
- Yasutaka Yanagita
- Department of General Medicine, Chiba University Hospital, 1-8-1, Inohana, Chuo-Ku, Chiba, Chiba Pref, Japan.
| | - Daiki Yokokawa
- Department of General Medicine, Chiba University Hospital, 1-8-1, Inohana, Chuo-Ku, Chiba, Chiba Pref, Japan
| | - Fumitoshi Fukuzawa
- Department of General Medicine, Chiba University Hospital, 1-8-1, Inohana, Chuo-Ku, Chiba, Chiba Pref, Japan
| | - Shun Uchida
- Uchida Internal Medicine Clinic, Saitama, Japan
| | - Takanori Uehara
- Department of General Medicine, Chiba University Hospital, 1-8-1, Inohana, Chuo-Ku, Chiba, Chiba Pref, Japan
| | - Masatomi Ikusaka
- Department of General Medicine, Chiba University Hospital, 1-8-1, Inohana, Chuo-Ku, Chiba, Chiba Pref, Japan
| |
Collapse
|
41
|
Makhoul M, Melkane AE, Khoury PE, Hadi CE, Matar N. A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases. Eur Arch Otorhinolaryngol 2024; 281:2717-2721. [PMID: 38365990 DOI: 10.1007/s00405-024-08509-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Accepted: 01/24/2024] [Indexed: 02/18/2024]
Abstract
PURPOSE With recent advances in artificial intelligence (AI), it has become crucial to thoroughly evaluate its applicability in healthcare. This study aimed to assess the accuracy of ChatGPT in diagnosing ear, nose, and throat (ENT) pathology, and comparing its performance to that of medical experts. METHODS We conducted a cross-sectional comparative study where 32 ENT cases were presented to ChatGPT 3.5, ENT physicians, ENT residents, family medicine (FM) specialists, second-year medical students (Med2), and third-year medical students (Med3). Each participant provided three differential diagnoses. The study analyzed diagnostic accuracy rates and inter-rater agreement within and between participant groups and ChatGPT. RESULTS The accuracy rate of ChatGPT was 70.8%, being not significantly different from ENT physicians or ENT residents. However, a significant difference in correctness rate existed between ChatGPT and FM specialists (49.8%, p < 0.001), and between ChatGPT and medical students (Med2 47.5%, p < 0.001; Med3 47%, p < 0.001). Inter-rater agreement for the differential diagnosis between ChatGPT and each participant group was either poor or fair. In 68.75% of cases, ChatGPT failed to mention the most critical diagnosis. CONCLUSIONS ChatGPT demonstrated accuracy comparable to that of ENT physicians and ENT residents in diagnosing ENT pathology, outperforming FM specialists, Med2 and Med3. However, it showed limitations in identifying the most critical diagnosis.
Collapse
Affiliation(s)
- Mikhael Makhoul
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon.
| | - Antoine E Melkane
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Patrick El Khoury
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Christopher El Hadi
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Nayla Matar
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| |
Collapse
|
42
|
Guimaraes GR, Figueiredo RG, Silva CS, Arata V, Contreras JCZ, Gomes CM, Tiraboschi RB, Bessa Junior J. Diagnosis in Bytes: Comparing the Diagnostic Accuracy of Google and ChatGPT 3.5 as an Educational Support Tool. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2024; 21:580. [PMID: 38791794 PMCID: PMC11120721 DOI: 10.3390/ijerph21050580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 04/27/2024] [Accepted: 04/29/2024] [Indexed: 05/26/2024]
Abstract
BACKGROUND Adopting advanced digital technologies as diagnostic support tools in healthcare is an unquestionable trend accelerated by the COVID-19 pandemic. However, their accuracy in suggesting diagnoses remains controversial and needs to be explored. We aimed to evaluate and compare the diagnostic accuracy of two free accessible internet search tools: Google and ChatGPT 3.5. METHODS To assess the effectiveness of both medical platforms, we conducted evaluations using a sample of 60 clinical cases related to urological pathologies. We organized the urological cases into two distinct categories for our analysis: (i) prevalent conditions, which were compiled using the most common symptoms, as outlined by EAU and UpToDate guidelines, and (ii) unusual disorders, identified through case reports published in the 'Urology Case Reports' journal from 2022 to 2023. The outcomes were meticulously classified into three categories to determine the accuracy of each platform: "correct diagnosis", "likely differential diagnosis", and "incorrect diagnosis". A group of experts evaluated the responses blindly and randomly. RESULTS For commonly encountered urological conditions, Google's accuracy was 53.3%, with an additional 23.3% of its results falling within a plausible range of differential diagnoses, and the remaining outcomes were incorrect. ChatGPT 3.5 outperformed Google with an accuracy of 86.6%, provided a likely differential diagnosis in 13.3% of cases, and made no unsuitable diagnosis. In evaluating unusual disorders, Google failed to deliver any correct diagnoses but proposed a likely differential diagnosis in 20% of cases. ChatGPT 3.5 identified the proper diagnosis in 16.6% of rare cases and offered a reasonable differential diagnosis in half of the cases. CONCLUSION ChatGPT 3.5 demonstrated higher diagnostic accuracy than Google in both contexts. The platform showed satisfactory accuracy when diagnosing common cases, yet its performance in identifying rare conditions remains limited.
Collapse
Affiliation(s)
- Guilherme R. Guimaraes
- Programa de Pós-Graduação em Saúde Coletiva, Universidade Estadual de Feira de Santana (UEFS), Feira de Santana 44.036-900, Brazil; (G.R.G.); (C.S.S.); (V.A.); (J.C.Z.C.); (R.B.T.); (J.B.J.)
| | - Ricardo G. Figueiredo
- Programa de Pós-Graduação em Saúde Coletiva, Universidade Estadual de Feira de Santana (UEFS), Feira de Santana 44.036-900, Brazil; (G.R.G.); (C.S.S.); (V.A.); (J.C.Z.C.); (R.B.T.); (J.B.J.)
| | - Caroline Santos Silva
- Programa de Pós-Graduação em Saúde Coletiva, Universidade Estadual de Feira de Santana (UEFS), Feira de Santana 44.036-900, Brazil; (G.R.G.); (C.S.S.); (V.A.); (J.C.Z.C.); (R.B.T.); (J.B.J.)
| | - Vanessa Arata
- Programa de Pós-Graduação em Saúde Coletiva, Universidade Estadual de Feira de Santana (UEFS), Feira de Santana 44.036-900, Brazil; (G.R.G.); (C.S.S.); (V.A.); (J.C.Z.C.); (R.B.T.); (J.B.J.)
| | - Jean Carlos Z. Contreras
- Programa de Pós-Graduação em Saúde Coletiva, Universidade Estadual de Feira de Santana (UEFS), Feira de Santana 44.036-900, Brazil; (G.R.G.); (C.S.S.); (V.A.); (J.C.Z.C.); (R.B.T.); (J.B.J.)
| | - Cristiano M. Gomes
- Faculty of Medicine, Universidade de São Paulo (USP), São Paulo 01.246-904, Brazil;
| | - Ricardo B. Tiraboschi
- Programa de Pós-Graduação em Saúde Coletiva, Universidade Estadual de Feira de Santana (UEFS), Feira de Santana 44.036-900, Brazil; (G.R.G.); (C.S.S.); (V.A.); (J.C.Z.C.); (R.B.T.); (J.B.J.)
| | - José Bessa Junior
- Programa de Pós-Graduação em Saúde Coletiva, Universidade Estadual de Feira de Santana (UEFS), Feira de Santana 44.036-900, Brazil; (G.R.G.); (C.S.S.); (V.A.); (J.C.Z.C.); (R.B.T.); (J.B.J.)
| |
Collapse
|
43
|
Kedia N, Sanjeev S, Ong J, Chhablani J. ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology. Eye (Lond) 2024; 38:1252-1261. [PMID: 38172581 PMCID: PMC11076576 DOI: 10.1038/s41433-023-02915-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 11/23/2023] [Accepted: 12/20/2023] [Indexed: 01/05/2024] Open
Abstract
ChatGPT, an artificial intelligence (AI) chatbot built on large language models (LLMs), has rapidly gained popularity. The benefits and limitations of this transformative technology have been discussed across various fields, including medicine. The widespread availability of ChatGPT has enabled clinicians to study how these tools could be used for a variety of tasks such as generating differential diagnosis lists, organizing patient notes, and synthesizing literature for scientific research. LLMs have shown promising capabilities in ophthalmology by performing well on the Ophthalmic Knowledge Assessment Program, providing fairly accurate responses to questions about retinal diseases, and in generating differential diagnoses list. There are current limitations to this technology, including the propensity of LLMs to "hallucinate", or confidently generate false information; their potential role in perpetuating biases in medicine; and the challenges in incorporating LLMs into research without allowing "AI-plagiarism" or publication of false information. In this paper, we provide a balanced overview of what LLMs are and introduce some of the LLMs that have been generated in the past few years. We discuss recent literature evaluating the role of these language models in medicine with a focus on ChatGPT. The field of AI is fast-paced, and new applications based on LLMs are being generated rapidly; therefore, it is important for ophthalmologists to be aware of how this technology works and how it may impact patient care. Here, we discuss the benefits, limitations, and future advancements of LLMs in patient care and research.
Collapse
Affiliation(s)
- Nikita Kedia
- Department of Ophthalmology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | | | - Joshua Ong
- Department of Ophthalmology and Visual Sciences, University of Michigan Kellogg Eye Center, Ann Arbor, MI, USA
| | - Jay Chhablani
- Department of Ophthalmology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.
| |
Collapse
|
44
|
Farhat F. ChatGPT as a Complementary Mental Health Resource: A Boon or a Bane. Ann Biomed Eng 2024; 52:1111-1114. [PMID: 37477707 DOI: 10.1007/s10439-023-03326-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 07/17/2023] [Indexed: 07/22/2023]
Abstract
The launch of Open AI's chatbot, ChatGPT, has generated a lot of attention and discussion among professionals in several fields. Many concerns and challenges have been brought up by researchers from various fields, particularly in relation to the harm that using these tools for medical diagnosis and treatment recommendations can cause. In addition, it has been debated if ChatGPT is dependable, efficient, and helpful for clinicians and medical professionals. Therefore, in this study, we assess ChatGPT's effectiveness in providing mental health support, particularly for issues related to anxiety and depression, based on the chatbot's responses and cross-questioning. The findings indicate that there are significant inconsistencies and that ChatGPT's reliability is low in this specific domain. As a result, care must be used when using ChatGPT as a complementary mental health resource.
Collapse
Affiliation(s)
- Faiza Farhat
- Section of Parasitology, Department of Zoology, Aligarh Muslim University, Aligarh, UP, 202002, India.
| |
Collapse
|
45
|
Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: A pilot study. J Dent 2024; 144:104938. [PMID: 38499280 DOI: 10.1016/j.jdent.2024.104938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Revised: 03/06/2024] [Accepted: 03/11/2024] [Indexed: 03/20/2024] Open
Abstract
OBJECTIVES Artificial Intelligence has applications such as Large Language Models (LLMs), which simulate human-like conversations. The potential of LLMs in healthcare is not fully evaluated. This pilot study assessed the accuracy and consistency of chatbots and clinicians in answering common questions in pediatric dentistry. METHODS Two expert pediatric dentists developed thirty true or false questions involving different aspects of pediatric dentistry. Publicly accessible chatbots (Google Bard, ChatGPT4, ChatGPT 3.5, Llama, Sage, Claude 2 100k, Claude-instant, Claude-instant-100k, and Google Palm) were employed to answer the questions (3 independent new conversations). Three groups of clinicians (general dentists, pediatric specialists, and students; n = 20/group) also answered. Responses were graded by two pediatric dentistry faculty members, along with a third independent pediatric dentist. Resulting accuracies (percentage of correct responses) were compared using analysis of variance (ANOVA), and post-hoc pairwise group comparisons were corrected using Tukey's HSD method. ACronbach's alpha was calculated to determine consistency. RESULTS Pediatric dentists were significantly more accurate (mean±SD 96.67 %± 4.3 %) than other clinicians and chatbots (p < 0.001). General dentists (88.0 % ± 6.1 %) also demonstrated significantly higher accuracy than chatbots (p < 0.001), followed by students (80.8 %±6.9 %). ChatGPT showed the highest accuracy (78 %±3 %) among chatbots. All chatbots except ChatGPT3.5 showed acceptable consistency (Cronbach alpha>0.7). CLINICAL SIGNIFICANCE Based on this pilot study, chatbots may be valuable adjuncts for educational purposes and for distributing information to patients. However, they are not yet ready to serve as substitutes for human clinicians in diagnostic decision-making. CONCLUSION In this pilot study, chatbots showed lower accuracy than dentists. Chatbots may not yet be recommended for clinical pediatric dentistry.
Collapse
Affiliation(s)
- Rata Rokhshad
- Department of Pediatric Dentistry, University of Alabama at Birmingham, Birmingham, AL, USA.
| | - Ping Zhang
- Department of Pediatric Dentistry, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Hossein Mohammad-Rahimi
- Topic Group Dental Diagnostics and Digital Dentistry, ITU/WHO Focus Group AI on Health, Berlin, Germany
| | - Vinay Pitchika
- Department of Conservative Dentistry and Periodontology, LMU Klinikum Munich, Germany
| | - Niloufar Entezari
- Department of pediatric dentistry, School of Dentistry, Qom University of Medical Sciences, Qom, Iran
| | - Falk Schwendicke
- Topic Group Dental Diagnostics and Digital Dentistry, ITU/WHO Focus Group AI on Health, Berlin, Germany; Department of Conservative Dentistry and Periodontology, LMU Klinikum Munich, Germany
| |
Collapse
|
46
|
Li H, Hayward J, Aguilar LS, Franc JM. Desired clinical applications of artificial intelligence in emergency medicine: A Delphi study. Am J Emerg Med 2024; 79:217-220. [PMID: 38458952 DOI: 10.1016/j.ajem.2024.02.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 02/01/2024] [Accepted: 02/08/2024] [Indexed: 03/10/2024] Open
Affiliation(s)
- Henry Li
- University of Alberta, Faculty of Medicine and Dentistry, Department of Emergency Medicine, 750 University Terrace Building, 8303-112 Street NW, Edmonton T6G 2T4, Canada.
| | - Jake Hayward
- University of Alberta, Faculty of Medicine and Dentistry, Department of Emergency Medicine, 750 University Terrace Building, 8303-112 Street NW, Edmonton T6G 2T4, Canada
| | - Leandro Solis Aguilar
- University of Alberta, Faculty of Medicine and Dentistry, Department of Biochemistry, 474 Medical Sciences Building, Edmonton T6G 2H7, Canada
| | - Jeffrey Michael Franc
- University of Alberta, Faculty of Medicine and Dentistry, Department of Emergency Medicine, 750 University Terrace Building, 8303-112 Street NW, Edmonton T6G 2T4, Canada; Università del Piemonte Orientale, Center for Research and Training in Disaster Medicine, Humanitarian Aid, and Global Health, Via Lanino 1, Novara 28100, Italy
| |
Collapse
|
47
|
Fisher AD, Fisher G. Evaluating performance of custom GPT in anesthesia practice. J Clin Anesth 2024; 93:111371. [PMID: 38154443 DOI: 10.1016/j.jclinane.2023.111371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 12/21/2023] [Indexed: 12/30/2023]
Affiliation(s)
- Andrew D Fisher
- Medical University of South Carolina, Department of Anesthesia and Perioperative Medicine, 167 Ashley Avenue, Suite 301, Charleston, SC 29464, United States of America.
| | - Gabrielle Fisher
- Medical University of South Carolina, Department of Anesthesia and Perioperative Medicine, 167 Ashley Avenue, Suite 301, Charleston, SC 29464, United States of America
| |
Collapse
|
48
|
Scott IA, Zuccon G. The new paradigm in machine learning - foundation models, large language models and beyond: a primer for physicians. Intern Med J 2024; 54:705-715. [PMID: 38715436 DOI: 10.1111/imj.16393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 03/26/2024] [Indexed: 05/18/2024]
Abstract
Foundation machine learning models are deep learning models capable of performing many different tasks using different data modalities such as text, audio, images and video. They represent a major shift from traditional task-specific machine learning prediction models. Large language models (LLM), brought to wide public prominence in the form of ChatGPT, are text-based foundational models that have the potential to transform medicine by enabling automation of a range of tasks, including writing discharge summaries, answering patients questions and assisting in clinical decision-making. However, such models are not without risk and can potentially cause harm if their development, evaluation and use are devoid of proper scrutiny. This narrative review describes the different types of LLM, their emerging applications and potential limitations and bias and likely future translation into clinical practice.
Collapse
Affiliation(s)
- Ian A Scott
- Centre for Health Services Research, University of Queensland, Woolloongabba, Australia
| | - Guido Zuccon
- School of Electrical Engineering and Computer Sciences, The University of Queensland, St Lucia, Queensland, Australia
| |
Collapse
|
49
|
Knebel D, Priglinger S, Scherer N, Klaas J, Siedlecki J, Schworm B. Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies - An Analysis of 10 Fictional Case Vignettes. Klin Monbl Augenheilkd 2024; 241:675-681. [PMID: 37890504 DOI: 10.1055/a-2149-0447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/29/2023]
Abstract
BACKGROUND The artificial intelligence (AI)-based platform ChatGPT (Chat Generative Pre-Trained Transformer, OpenAI LP, San Francisco, CA, USA) has gained impressive popularity in recent months. Its performance on case vignettes of general medical (non-ophthalmological) emergencies has been assessed - with very encouraging results. The purpose of this study was to assess the performance of ChatGPT on ophthalmological emergency case vignettes in terms of the main outcome measures triage accuracy, appropriateness of recommended prehospital measures, and overall potential to inflict harm to the user/patient. METHODS We wrote ten short, fictional case vignettes describing different acute ophthalmological symptoms. Each vignette was entered into ChatGPT five times with the same wording and following a standardized interaction pathway. The answers were analyzed following a systematic approach. RESULTS We observed a triage accuracy of 93.6%. Most answers contained only appropriate recommendations for prehospital measures. However, an overall potential to inflict harm to users/patients was present in 32% of answers. CONCLUSION ChatGPT should presently not be used as a stand-alone primary source of information about acute ophthalmological symptoms. As AI continues to evolve, its safety and efficacy in the prehospital management of ophthalmological emergencies has to be reassessed regularly.
Collapse
Affiliation(s)
- Dominik Knebel
- Department of Ophthalmology, University Hospital, Ludwigs-Maximilians-Universität München, München, Germany
| | - Siegfried Priglinger
- Department of Ophthalmology, University Hospital, Ludwigs-Maximilians-Universität München, München, Germany
| | - Nicolas Scherer
- Department of Ophthalmology, University Hospital, Ludwigs-Maximilians-Universität München, München, Germany
| | - Julian Klaas
- Department of Ophthalmology, University Hospital, Ludwigs-Maximilians-Universität München, München, Germany
| | - Jakob Siedlecki
- Department of Ophthalmology, University Hospital, Ludwigs-Maximilians-Universität München, München, Germany
| | - Benedikt Schworm
- Department of Ophthalmology, University Hospital, Ludwigs-Maximilians-Universität München, München, Germany
| |
Collapse
|
50
|
Safrai M, Azaria A. Does small talk with a medical provider affect ChatGPT's medical counsel? Performance of ChatGPT on USMLE with and without distractions. PLoS One 2024; 19:e0302217. [PMID: 38687696 PMCID: PMC11060598 DOI: 10.1371/journal.pone.0302217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 03/28/2024] [Indexed: 05/02/2024] Open
Abstract
Efforts are being made to improve the time effectiveness of healthcare providers. Artificial intelligence tools can help transcript and summarize physician-patient encounters and produce medical notes and medical recommendations. However, in addition to medical information, discussion between healthcare and patients includes small talk and other information irrelevant to medical concerns. As Large Language Models (LLMs) are predictive models building their response based on the words in the prompts, there is a risk that small talk and irrelevant information may alter the response and the suggestion given. Therefore, this study aims to investigate the impact of medical data mixed with small talk on the accuracy of medical advice provided by ChatGPT. USMLE step 3 questions were used as a model for relevant medical data. We use both multiple-choice and open-ended questions. First, we gathered small talk sentences from human participants using the Mechanical Turk platform. Second, both sets of USLME questions were arranged in a pattern where each sentence from the original questions was followed by a small talk sentence. ChatGPT 3.5 and 4 were asked to answer both sets of questions with and without the small talk sentences. Finally, a board-certified physician analyzed the answers by ChatGPT and compared them to the formal correct answer. The analysis results demonstrate that the ability of ChatGPT-3.5 to answer correctly was impaired when small talk was added to medical data (66.8% vs. 56.6%; p = 0.025). Specifically, for multiple-choice questions (72.1% vs. 68.9%; p = 0.67) and for the open questions (61.5% vs. 44.3%; p = 0.01), respectively. In contrast, small talk phrases did not impair ChatGPT-4 ability in both types of questions (83.6% and 66.2%, respectively). According to these results, ChatGPT-4 seems more accurate than the earlier 3.5 version, and it appears that small talk does not impair its capability to provide medical recommendations. Our results are an important first step in understanding the potential and limitations of utilizing ChatGPT and other LLMs for physician-patient interactions, which include casual conversations.
Collapse
Affiliation(s)
- Myriam Safrai
- Department of Obstetrics and Gynecology, Chaim Sheba Medical Center (Tel Hashomer), Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
- Department of Obstetrics, Gynecology and Reproductive Sciences, Magee-Womens Research Institute, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States of America
| | - Amos Azaria
- School of Computer Science, Ariel University, Ari’el, Israel
| |
Collapse
|