1
|
Merlino DJ, Brufau SR, Saieed G, Van Abel KM, Price DL, Archibald DJ, Ator GA, Carlson ML. Comparative Assessment of Otolaryngology Knowledge Among Large Language Models. Laryngoscope 2025; 135:629-634. [PMID: 39305216 DOI: 10.1002/lary.31781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Revised: 08/18/2024] [Accepted: 08/23/2024] [Indexed: 01/14/2025]
Abstract
OBJECTIVE The purpose of this study was to evaluate the performance of advanced large language models from OpenAI (GPT-3.5 and GPT-4), Google (PaLM2 and MedPaLM), and an open source model from Meta (Llama3:70b) in answering clinical test multiple choice questions in the field of otolaryngology-head and neck surgery. METHODS A dataset of 4566 otolaryngology questions was used; each model was provided a standardized prompt followed by a question. One hundred questions that were answered incorrectly by all models were further interrogated to gain insight into the causes of incorrect answers. RESULTS GPT4 was the most accurate, correctly answering 3520 of 4566 questions (77.1%). MedPaLM correctly answered 3223 of 4566 (70.6%) questions, while llama3:70b, GPT3.5, and PaLM2 were correct on 3052 of 4566 (66.8%), 2672 of 4566 (58.5%), and 2583 of 4566 (56.5%) questions. Three hundred and sixty-nine questions were answered incorrectly by all models. Prompts to provide reasoning improved accuracy in all models: GPT4 changed from incorrect to correct answer 31% of the time, while GPT3.5, Llama3, PaLM2, and MedPaLM corrected their responses 25%, 18%, 19%, and 17% of the time, respectively. CONCLUSION Large language models vary in their understanding of otolaryngology-specific clinical knowledge. OpenAI's GPT4 has a strong understanding of core concepts as well as detailed information in the field of otolaryngology. Its baseline understanding in this field makes it well-suited to serve in roles related to head and neck surgery education provided that the appropriate precautions are taken and potential limitations are understood. LEVEL OF EVIDENCE NA Laryngoscope, 135:629-634, 2025.
Collapse
Affiliation(s)
- Dante J Merlino
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| | - Santiago R Brufau
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| | - George Saieed
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| | - Kathryn M Van Abel
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| | - Daniel L Price
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| | - David J Archibald
- The Center for Plastic Surgery at Castle Rock, Castle Rock, Colorado, U.S.A
| | - Gregory A Ator
- Department of Otolaryngology-Head and Neck Surgery, University of Kansas Medical Center, Kansas City, Kansas, U.S.A
| | - Matthew L Carlson
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
- Department of Neurologic Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| |
Collapse
|
2
|
Du W, Jin X, Harris JC, Brunetti A, Johnson E, Leung O, Li X, Walle S, Yu Q, Zhou X, Bian F, McKenzie K, Kanathanavanich M, Ozcelik Y, El-Sharkawy F, Koga S. Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions. Ann Diagn Pathol 2024; 73:152392. [PMID: 39515029 DOI: 10.1016/j.anndiagpath.2024.152392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2024] [Revised: 10/31/2024] [Accepted: 11/01/2024] [Indexed: 11/16/2024]
Abstract
Large language models (LLMs), such as ChatGPT and Bard, have shown potential in various medical applications. This study aimed to evaluate the performance of LLMs, specifically ChatGPT and Bard, in pathology by comparing their performance with those of pathology trainees, and to assess the consistency of their responses. We selected 150 multiple-choice questions from 15 subspecialties, excluding those with images. Both ChatGPT and Bard were tested on these questions across three separate sessions between June 2023 and January 2024, and their responses were compared with those of 16 pathology trainees (8 junior and 8 senior) from two hospitals. Questions were categorized into easy, intermediate, and difficult based on trainee performance. Consistency and variability in LLM responses were analyzed across three evaluation sessions. ChatGPT significantly outperformed Bard and trainees, achieving an average total score of 82.2% compared to Bard's 49.5%, junior trainees' 45.1%, and senior trainees' 56.0%. ChatGPT's performance was notably stronger in difficult questions (63.4%-68.3%) compared to Bard (31.7%-34.1%) and trainees (4.9%-48.8%). For easy questions, ChatGPT (83.1%-91.5%) and trainees (73.7%-100.0%) showed similar high scores. Consistency analysis revealed that ChatGPT showed a high consistency rate of 80%-85% across three tests, whereas Bard exhibited greater variability with consistency rates of 54%-61%. While LLMs show significant promise in pathology education and practice, continued development and human oversight are crucial for reliable clinical application.
Collapse
Affiliation(s)
- Wei Du
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Xueting Jin
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Jaryse Carol Harris
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Alessandro Brunetti
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Erika Johnson
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Olivia Leung
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Xingchen Li
- Department of Pathology and Laboratory Medicine, Pennsylvania Hospital, Philadelphia, PA, United States of America
| | - Selemon Walle
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Qing Yu
- Department of Pathology and Laboratory Medicine, Pennsylvania Hospital, Philadelphia, PA, United States of America
| | - Xiao Zhou
- Department of Pathology and Laboratory Medicine, Pennsylvania Hospital, Philadelphia, PA, United States of America
| | - Fang Bian
- Department of Pathology and Laboratory Medicine, Pennsylvania Hospital, Philadelphia, PA, United States of America
| | - Kajanna McKenzie
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Manita Kanathanavanich
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Yusuf Ozcelik
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Farah El-Sharkawy
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America
| | - Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States of America.
| |
Collapse
|
3
|
Pavone M, Palmieri L, Bizzarri N, Rosati A, Campolo F, Innocenzi C, Taliento C, Restaino S, Catena U, Vizzielli G, Akladios C, Ianieri MM, Marescaux J, Campo R, Fanfani F, Scambia G. Artificial Intelligence, the ChatGPT Large Language Model: Assessing the Accuracy of Responses to the Gynaecological Endoscopic Surgical Education and Assessment (GESEA) Level 1-2 knowledge tests. Facts Views Vis Obgyn 2024; 16:449-456. [PMID: 39718328 DOI: 10.52054/fvvo.16.4.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2024] Open
Abstract
Background In 2022, OpenAI launched ChatGPT 3.5, which is now widely used in medical education, training, and research. Despite its valuable use for the generation of information, concerns persist about its authenticity and accuracy. Its undisclosed information source and outdated dataset pose risks of misinformation. Although it is widely used, AI-generated text inaccuracies raise doubts about its reliability. The ethical use of such technologies is crucial to uphold scientific accuracy in research. Objective This study aimed to assess the accuracy of ChatGPT in doing GESEA tests 1 and 2. Materials and Methods The 100 multiple-choice theoretical questions from GESEA certifications 1 and 2 were presented to ChatGPT, requesting the selection of the correct answer along with an explanation. Expert gynaecologists evaluated and graded the explanations for accuracy. Main outcome measures ChatGPT showed a 59% accuracy in responses, with 64% providing comprehensive explanations. It performed better in GESEA Level 1 (64% accuracy) than in GESEA Level 2 (54% accuracy) questions. Conclusions ChatGPT is a versatile tool in medicine and research, offering knowledge, information, and promoting evidence-based practice. Despite its widespread use, its accuracy has not been validated yet. This study found a 59% correct response rate, highlighting the need for accuracy validation and ethical use considerations. Future research should investigate ChatGPT's truthfulness in subspecialty fields such as gynaecologic oncology and compare different versions of chatbot for continuous improvement. What is new? Artificial intelligence (AI) has a great potential in scientific research. However, the validity of outputs remains unverified. This study aims to evaluate the accuracy of responses generated by ChatGPT to enhance the critical use of this tool.
Collapse
|
4
|
Lee J, Park S, Shin J, Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 2024; 24:366. [PMID: 39614219 PMCID: PMC11606129 DOI: 10.1186/s12911-024-02709-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/03/2024] [Indexed: 12/01/2024] Open
Abstract
BACKGROUND Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs. OBJECTIVE This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies. METHODS & MATERIALS We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy. RESULTS A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering. CONCLUSIONS More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
Collapse
Affiliation(s)
- Junbok Lee
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sungkyung Park
- Department of Bigdata AI Management Information, Seoul National University of Science and Technology, Seoul, Republic of Korea
| | - Jaeyong Shin
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, 50-1, Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea.
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea.
| | - Belong Cho
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
| |
Collapse
|
5
|
Waldock WJ, Zhang J, Guni A, Nabeel A, Darzi A, Ashrafian H. The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis. J Med Internet Res 2024; 26:e56532. [PMID: 39499913 PMCID: PMC11576595 DOI: 10.2196/56532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 06/26/2024] [Accepted: 09/25/2024] [Indexed: 11/20/2024] Open
Abstract
BACKGROUND Large language models (LLMs) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text. However, there is a lack of clarity about the accuracy and capability standards of LLMs in health care examinations. OBJECTIVE We conducted a systematic review of LLM accuracy, as tested under health care examination conditions, as compared to known human performance standards. METHODS We quantified the accuracy of LLMs in responding to health care examination questions and evaluated the consistency and quality of study reporting. The search included all papers up until September 10, 2023, with all LLMs published in English journals that report clear LLM accuracy standards. The exclusion criteria were as follows: the assessment was not a health care exam, there was no LLM, there was no evaluation of comparable success accuracy, and the literature was not original research.The literature search included the following Medical Subject Headings (MeSH) terms used in all possible combinations: "artificial intelligence," "ChatGPT," "GPT," "LLM," "large language model," "machine learning," "neural network," "Generative Pre-trained Transformer," "Generative Transformer," "Generative Language Model," "Generative Model," "medical exam," "healthcare exam," and "clinical exam." Sensitivity, accuracy, and precision data were extracted, including relevant CIs. RESULTS The search identified 1673 relevant citations. After removing duplicate results, 1268 (75.8%) papers were screened for titles and abstracts, and 32 (2.5%) studies were included for full-text review. Our meta-analysis suggested that LLMs are able to perform with an overall medical examination accuracy of 0.61 (CI 0.58-0.64) and a United States Medical Licensing Examination (USMLE) accuracy of 0.51 (CI 0.46-0.56), while Chat Generative Pretrained Transformer (ChatGPT) can perform with an overall medical examination accuracy of 0.64 (CI 0.6-0.67). CONCLUSIONS LLMs offer promise to remediate health care demand and staffing challenges by providing accurate and efficient context-specific information to critical decision makers. For policy and deployment decisions about LLMs to advance health care, we proposed a new framework called RUBRICC (Regulatory, Usability, Bias, Reliability [Evidence and Safety], Interoperability, Cost, and Codesign-Patient and Public Involvement and Engagement [PPIE]). This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services, while respecting patient safety considerations. TRIAL REGISTRATION OSF Registries osf.io/xqzkw; https://osf.io/xqzkw.
Collapse
Affiliation(s)
| | - Joe Zhang
- Imperial College London, London, United Kingdom
| | - Ahmad Guni
- Imperial College London, London, United Kingdom
| | - Ahmad Nabeel
- Institute of Global Health Innovation, Imperial College London, London, United Kingdom
| | - Ara Darzi
- Imperial College London, London, United Kingdom
| | - Hutan Ashrafian
- Institute of Global Health Innovation, Imperial College London, London, United Kingdom
| |
Collapse
|
6
|
Moulaei K, Yadegari A, Baharestani M, Farzanbakhsh S, Sabet B, Reza Afrash M. Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications. Int J Med Inform 2024; 188:105474. [PMID: 38733640 DOI: 10.1016/j.ijmedinf.2024.105474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 05/03/2024] [Accepted: 05/04/2024] [Indexed: 05/13/2024]
Abstract
BACKGROUND Generative artificial intelligence (GAI) is revolutionizing healthcare with solutions for complex challenges, enhancing diagnosis, treatment, and care through new data and insights. However, its integration raises questions about applications, benefits, and challenges. Our study explores these aspects, offering an overview of GAI's applications and future prospects in healthcare. METHODS This scoping review searched Web of Science, PubMed, and Scopus . The selection of studies involved screening titles, reviewing abstracts, and examining full texts, adhering to the PRISMA-ScR guidelines throughout the process. RESULTS From 1406 articles across three databases, 109 met inclusion criteria after screening and deduplication. Nine GAI models were utilized in healthcare, with ChatGPT (n = 102, 74 %), Google Bard (Gemini) (n = 16, 11 %), and Microsoft Bing AI (n = 10, 7 %) being the most frequently employed. A total of 24 different applications of GAI in healthcare were identified, with the most common being "offering insights and information on health conditions through answering questions" (n = 41) and "diagnosis and prediction of diseases" (n = 17). In total, 606 benefits and challenges were identified, which were condensed to 48 benefits and 61 challenges after consolidation. The predominant benefits included "Providing rapid access to information and valuable insights" and "Improving prediction and diagnosis accuracy", while the primary challenges comprised "generating inaccurate or fictional content", "unknown source of information and fake references for texts", and "lower accuracy in answering questions". CONCLUSION This scoping review identified the applications, benefits, and challenges of GAI in healthcare. This synthesis offers a crucial overview of GAI's potential to revolutionize healthcare, emphasizing the imperative to address its limitations.
Collapse
Affiliation(s)
- Khadijeh Moulaei
- Department of Health Information Technology, School of Paramedical, Ilam University of Medical Sciences, Ilam, Iran
| | - Atiye Yadegari
- Department of Pediatric Dentistry, School of Dentistry, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Mahdi Baharestani
- Network of Interdisciplinarity in Neonates and Infants (NINI), Universal Scientific Education and Research Network (USERN), Tehran, Iran
| | - Shayan Farzanbakhsh
- Network of Interdisciplinarity in Neonates and Infants (NINI), Universal Scientific Education and Research Network (USERN), Tehran, Iran
| | - Babak Sabet
- Department of Surgery, Faculty of Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mohammad Reza Afrash
- Department of Artificial Intelligence, Smart University of Medical Sciences, Tehran, Iran.
| |
Collapse
|
7
|
Nedbal C, Naik N, Castellani D, Gauhar V, Geraghty R, Somani BK. ChatGPT in urology practice: revolutionizing efficiency and patient care with generative artificial intelligence. Curr Opin Urol 2024; 34:98-104. [PMID: 37962176 DOI: 10.1097/mou.0000000000001151] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
PURPOSE OF REVIEW ChatGPT has emerged as a potentially useful tool for healthcare. Its role in urology is in its infancy and has much potential for research, clinical practice and for patient assistance. With this narrative review, we want to draw a picture of what is known about ChatGPT's integration in urology, alongside future promises and challenges. RECENT FINDINGS The use of ChatGPT can ease the administrative work, helping urologists with note-taking and clinical documentation such as discharge summaries and clinical notes. It can improve patient engagement through increasing awareness and facilitating communication, as it has especially been investigated for uro-oncological diseases. Its ability to understand human emotions makes ChatGPT an empathic and thoughtful interactive tool or source for urological patients and their relatives. Currently, its role in clinical diagnosis and treatment decisions is uncertain, as concerns have been raised about misinterpretation, hallucination and out-of-date information. Moreover, a mandatory regulatory process for ChatGPT in urology is yet to be established. SUMMARY ChatGPT has the potential to contribute to precision medicine and tailored practice by its quick, structured responses. However, this will depend on how well information can be obtained by seeking appropriate responses and asking the pertinent questions. The key lies in being able to validate the responses, regulating the information shared and avoiding misuse of the same to protect the data and patient privacy. Its successful integration into mainstream urology needs educational bodies to provide guidelines or best practice recommendations for the same.
Collapse
Affiliation(s)
- Carlotta Nedbal
- Department of Urology, University Hospitals Southampton, NHS Trust, Southampton, UK
- Urology Unit, Azienda Ospedaliero-Universitaria delle Marche, Polytechnic University of Marche, Ancona, Italy
| | - Nitesh Naik
- Department of Mechanical and Industrial Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Daniele Castellani
- Urology Unit, Azienda Ospedaliero-Universitaria delle Marche, Polytechnic University of Marche, Ancona, Italy
| | - Vineet Gauhar
- Department of Urology, Ng Teng Fong General Hospital, NUHS, Singapore
| | - Robert Geraghty
- Department of Urology, Freeman Hospital, Newcastle-upon-Tyne, UK
| | - Bhaskar Kumar Somani
- Department of Urology, University Hospitals Southampton, NHS Trust, Southampton, UK
| |
Collapse
|
8
|
Lee KH, Lee RW. ChatGPT's Accuracy on Magnetic Resonance Imaging Basics: Characteristics and Limitations Depending on the Question Type. Diagnostics (Basel) 2024; 14:171. [PMID: 38248048 PMCID: PMC10814518 DOI: 10.3390/diagnostics14020171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Revised: 01/04/2024] [Accepted: 01/11/2024] [Indexed: 01/23/2024] Open
Abstract
Our study aimed to assess the accuracy and limitations of ChatGPT in the domain of MRI, focused on evaluating ChatGPT's performance in answering simple knowledge questions and specialized multiple-choice questions related to MRI. A two-step approach was used to evaluate ChatGPT. In the first step, 50 simple MRI-related questions were asked, and ChatGPT's answers were categorized as correct, partially correct, or incorrect by independent researchers. In the second step, 75 multiple-choice questions covering various MRI topics were posed, and the answers were similarly categorized. The study utilized Cohen's kappa coefficient for assessing interobserver agreement. ChatGPT demonstrated high accuracy in answering straightforward MRI questions, with over 85% classified as correct. However, its performance varied significantly across multiple-choice questions, with accuracy rates ranging from 40% to 66.7%, depending on the topic. This indicated a notable gap in its ability to handle more complex, specialized questions requiring deeper understanding and context. In conclusion, this study critically evaluates the accuracy of ChatGPT in addressing questions related to Magnetic Resonance Imaging (MRI), highlighting its potential and limitations in the healthcare sector, particularly in radiology. Our findings demonstrate that ChatGPT, while proficient in responding to straightforward MRI-related questions, exhibits variability in its ability to accurately answer complex multiple-choice questions that require more profound, specialized knowledge of MRI. This discrepancy underscores the nuanced role AI can play in medical education and healthcare decision-making, necessitating a balanced approach to its application.
Collapse
Affiliation(s)
| | - Ro-Woon Lee
- Department of Radiology, Inha University College of Medicine, Incheon 22212, Republic of Korea;
| |
Collapse
|
9
|
Kollitsch L, Eredics K, Marszalek M, Rauchenwald M, Brookman-May SD, Burger M, Körner-Riffard K, May M. How does artificial intelligence master urological board examinations? A comparative analysis of different Large Language Models' accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology. World J Urol 2024; 42:20. [PMID: 38197996 DOI: 10.1007/s00345-023-04749-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Accepted: 11/02/2023] [Indexed: 01/11/2024] Open
Abstract
PURPOSE This study is a comparative analysis of three Large Language Models (LLMs) evaluating their rate of correct answers (RoCA) and the reliability of generated answers on a set of urological knowledge-based questions spanning different levels of complexity. METHODS ChatGPT-3.5, ChatGPT-4, and Bing AI underwent two testing rounds, with a 48-h gap in between, using the 100 multiple-choice questions from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA). For conflicting responses, an additional consensus round was conducted to establish conclusive answers. RoCA was compared across various question complexities. Ten weeks after the consensus round, a subsequent testing round was conducted to assess potential knowledge gain and improvement in RoCA, respectively. RESULTS Over three testing rounds, ChatGPT-3.5 achieved RoCa scores of 58%, 62%, and 59%. In contrast, ChatGPT-4 achieved RoCA scores of 63%, 77%, and 77%, while Bing AI yielded scores of 81%, 73%, and 77%, respectively. Agreement rates between rounds 1 and 2 were 84% (κ = 0.67, p < 0.001) for ChatGPT-3.5, 74% (κ = 0.40, p < 0.001) for ChatGPT-4, and 76% (κ = 0.33, p < 0.001) for BING AI. In the consensus round, ChatGPT-4 and Bing AI significantly outperformed ChatGPT-3.5 (77% and 77% vs. 59%, both p = 0.010). All LLMs demonstrated decreasing RoCA scores with increasing question complexity (p < 0.001). In the fourth round, no significant improvement in RoCA was observed across all three LLMs. CONCLUSIONS The performance of the tested LLMs in addressing urological specialist inquiries warrants further refinement. Moreover, the deficiency in response reliability contributes to existing challenges related to their current utility for educational purposes.
Collapse
Affiliation(s)
- Lisa Kollitsch
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
| | - Klaus Eredics
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
- Department of Urology, Paracelsus Medical University, Salzburg, Austria
| | - Martin Marszalek
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
| | - Michael Rauchenwald
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
- European Board of Urology, Arnhem, The Netherlands
| | - Sabine D Brookman-May
- Department of Urology, University of Munich, LMU, Munich, Germany
- Johnson and Johnson Innovative Medicine, Research and Development, Spring House, PA, USA
| | - Maximilian Burger
- Department of Urology, Caritas St. Josef Medical Centre, University of Regensburg, Regensburg, Germany
| | - Katharina Körner-Riffard
- Department of Urology, Caritas St. Josef Medical Centre, University of Regensburg, Regensburg, Germany
| | - Matthias May
- Department of Urology, St. Elisabeth Hospital Straubing, Brothers of Mercy Hospital, Straubing, Germany.
| |
Collapse
|
10
|
Lee KH, Lee RW, Kwon YE. Validation of a Deep Learning Chest X-ray Interpretation Model: Integrating Large-Scale AI and Large Language Models for Comparative Analysis with ChatGPT. Diagnostics (Basel) 2023; 14:90. [PMID: 38201398 PMCID: PMC10795741 DOI: 10.3390/diagnostics14010090] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 12/28/2023] [Accepted: 12/29/2023] [Indexed: 01/12/2024] Open
Abstract
This study evaluates the diagnostic accuracy and clinical utility of two artificial intelligence (AI) techniques: Kakao Brain Artificial Neural Network for Chest X-ray Reading (KARA-CXR), an assistive technology developed using large-scale AI and large language models (LLMs), and ChatGPT, a well-known LLM. The study was conducted to validate the performance of the two technologies in chest X-ray reading and explore their potential applications in the medical imaging diagnosis domain. The study methodology consisted of randomly selecting 2000 chest X-ray images from a single institution's patient database, and two radiologists evaluated the readings provided by KARA-CXR and ChatGPT. The study used five qualitative factors to evaluate the readings generated by each model: accuracy, false findings, location inaccuracies, count inaccuracies, and hallucinations. Statistical analysis showed that KARA-CXR achieved significantly higher diagnostic accuracy compared to ChatGPT. In the 'Acceptable' accuracy category, KARA-CXR was rated at 70.50% and 68.00% by two observers, while ChatGPT achieved 40.50% and 47.00%. Interobserver agreement was moderate for both systems, with KARA at 0.74 and GPT4 at 0.73. For 'False Findings', KARA-CXR scored 68.00% and 68.50%, while ChatGPT scored 37.00% for both observers, with high interobserver agreements of 0.96 for KARA and 0.97 for GPT4. In 'Location Inaccuracy' and 'Hallucinations', KARA-CXR outperformed ChatGPT with significant margins. KARA-CXR demonstrated a non-hallucination rate of 75%, which is significantly higher than ChatGPT's 38%. The interobserver agreement was high for KARA (0.91) and moderate to high for GPT4 (0.85) in the hallucination category. In conclusion, this study demonstrates the potential of AI and large-scale language models in medical imaging and diagnostics. It also shows that in the chest X-ray domain, KARA-CXR has relatively higher accuracy than ChatGPT.
Collapse
Affiliation(s)
| | - Ro Woon Lee
- Department of Radiology, College of Medicine, Inha University, Incheon 22212, Republic of Korea
| | | |
Collapse
|
11
|
Spitale G, Schneider G, Germani F, Biller-Andorno N. Exploring the role of AI in classifying, analyzing, and generating case reports on assisted suicide cases: feasibility and ethical implications. Front Artif Intell 2023; 6:1328865. [PMID: 38164497 PMCID: PMC10757918 DOI: 10.3389/frai.2023.1328865] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Accepted: 11/24/2023] [Indexed: 01/03/2024] Open
Abstract
This paper presents a study on the use of AI models for the classification of case reports on assisted suicide procedures. The database of the five Dutch regional bioethics committees was scraped to collect the 72 case reports available in English. We trained several AI models for classification according to the categories defined by the Dutch Termination of Life on Request and Assisted Suicide (Review Procedures) Act. We also conducted a related project to fine-tune an OpenAI GPT-3.5-turbo large language model for generating new fictional but plausible cases. As AI is increasingly being used for judgement, it is possible to imagine an application in decision-making regarding assisted suicide. Here we explore two arising questions: feasibility and ethics, with the aim of contributing to a critical assessment of the potential role of AI in decision-making in highly sensitive areas.
Collapse
Affiliation(s)
- Giovanni Spitale
- Institute of Biomedical Ethics and History of Medicine, University of Zurich, Zürich, Switzerland
| | - Gerold Schneider
- Department of Computational Linguistics, University of Zurich, Zürich, Switzerland
| | - Federico Germani
- Institute of Biomedical Ethics and History of Medicine, University of Zurich, Zürich, Switzerland
| | - Nikola Biller-Andorno
- Institute of Biomedical Ethics and History of Medicine, University of Zurich, Zürich, Switzerland
| |
Collapse
|
12
|
Ohta K, Ohta S. The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study. Cureus 2023; 15:e50369. [PMID: 38213361 PMCID: PMC10782219 DOI: 10.7759/cureus.50369] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2023] [Indexed: 01/13/2024] Open
Abstract
Purpose This study aims to evaluate the performance of three large language models (LLMs), the Generative Pre-trained Transformer (GPT)-3.5, GPT-4, and Google Bard, on the 2023 Japanese National Dentist Examination (JNDE) and assess their potential clinical applications in Japan. Methods A total of 185 questions from the 2023 JNDE were used. These questions were categorized by question type and category. McNemar's test compared the correct response rates between two LLMs, while Fisher's exact test evaluated the performance of LLMs in each question category. Results The overall correct response rates were 73.5% for GPT-4, 66.5% for Bard, and 51.9% for GPT-3.5. GPT-4 showed a significantly higher correct response rate than Bard and GPT-3.5. In the category of essential questions, Bard achieved a correct response rate of 80.5%, surpassing the passing criterion of 80%. In contrast, both GPT-4 and GPT-3.5 fell short of this benchmark, with GPT-4 attaining 77.6% and GPT-3.5 only 52.5%. The scores of GPT-4 and Bard were significantly higher than that of GPT-3.5 (p<0.01). For general questions, the correct response rates were 71.2% for GPT-4, 58.5% for Bard, and 52.5% for GPT-3.5. GPT-4 outperformed GPT-3.5 and Bard (p<0.01). The correct response rates for professional dental questions were 51.6% for GPT-4, 45.3% for Bard, and 35.9% for GPT-3.5. The differences among the models were not statistically significant. All LLMs demonstrated significantly lower accuracy for dentistry questions compared to other types of questions (p<0.01). Conclusions GPT-4 achieved the highest overall score in the JNDE, followed by Bard and GPT-3.5. However, only Bard surpassed the passing score for essential questions. To further understand the application of LLMs in clinical dentistry worldwide, more research on their performance in dental examinations across different languages is required.
Collapse
Affiliation(s)
| | - Satomi Ohta
- Dentistry, Dentist of Mama and Kodomo, Kobe, JPN
| |
Collapse
|
13
|
Koga S. Exploring the pitfalls of large language models: Inconsistency and inaccuracy in answering pathology board examination-style questions. Pathol Int 2023; 73:618-620. [PMID: 37818818 DOI: 10.1111/pin.13382] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 09/19/2023] [Indexed: 10/13/2023]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
14
|
Meo SA, Al-Khlaiwi T, AbuKhalaf AA, Meo AS, Klonoff DC. The Scientific Knowledge of Bard and ChatGPT in Endocrinology, Diabetes, and Diabetes Technology: Multiple-Choice Questions Examination-Based Performance. J Diabetes Sci Technol 2023:19322968231203987. [PMID: 37798960 DOI: 10.1177/19322968231203987] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/07/2023]
Abstract
BACKGROUND The present study aimed to investigate the knowledge level of Bard and ChatGPT in the areas of endocrinology, diabetes, and diabetes technology through a multiple-choice question (MCQ) examination format. METHODS Initially, a 100-MCQ bank was established based on MCQs in endocrinology, diabetes, and diabetes technology. The MCQs were created from physiology, medical textbooks, and academic examination pools in the areas of endocrinology, diabetes, and diabetes technology and academic examination pools. The study team members analyzed the MCQ contents to ensure that they were related to the endocrinology, diabetes, and diabetes technology. The number of MCQs from endocrinology was 50, and that from diabetes and science technology was also 50. The knowledge level of Google's Bard and ChatGPT was assessed with an MCQ-based examination. RESULTS In the endocrinology examination section, ChatGPT obtained 29 marks (correct responses) of 50 (58%), and Bard obtained a similar score of 29 of 50 (58%). However, in the diabetes technology examination section, ChatGPT obtained 23 marks of 50 (46%), and Bard obtained 20 marks of 50 (40%). Overall, in the entire three-part examination, ChatGPT obtained 52 marks of 100 (52%), and Bard obtained 49 marks of 100 (49%). ChatGPT obtained slightly more marks than Bard. However, both ChatGPT and Bard did not achieve satisfactory scores in endocrinology or diabetes/technology of at least 60%. CONCLUSIONS The overall MCQ-based performance of ChatGPT was slightly better than that of Google's Bard. However, both ChatGPT and Bard did not achieve appropriate scores in endocrinology and diabetes/diabetes technology. The study indicates that Bard and ChatGPT have the potential to facilitate medical students and faculty in academic medical education settings, but both artificial intelligence tools need more updated information in the fields of endocrinology, diabetes, and diabetes technology.
Collapse
Affiliation(s)
- Sultan Ayoub Meo
- Department of Physiology, College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Thamir Al-Khlaiwi
- Department of Physiology, College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | | | - Anusha Sultan Meo
- The School of Medicine, Medical Sciences and Nutrition, University of Aberdeen, Aberdeen, UK
| | - David C Klonoff
- Diabetes Research Institute, Mills-Peninsula Medical Center, San Mateo, CA, USA
| |
Collapse
|