Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Chen TC, Multala E, Kearns P, Delashaw J, Dumont A, Maraganore D, Wang A. Assessment of ChatGPT's performance on neurology written board examination questions. BMJ Neurol Open 2023;5:e000530. [PMID: 37936648 PMCID: PMC10626870 DOI: 10.1136/bmjno-2023-000530] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 10/19/2023] [Indexed: 11/09/2023] Open

For:	Chen TC, Multala E, Kearns P, Delashaw J, Dumont A, Maraganore D, Wang A. Assessment of ChatGPT's performance on neurology written board examination questions. BMJ Neurol Open 2023;5:e000530. [PMID: 37936648 PMCID: PMC10626870 DOI: 10.1136/bmjno-2023-000530] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 10/19/2023] [Indexed: 11/09/2023] Open

Number

Cited by Other Article(s)

Bülbül O, Bülbül HM, Kaba E. Assessing ChatGPT's summarization of ⁶⁸Ga PSMA PET/CT reports for patients. Abdom Radiol (NY) 2024:10.1007/s00261-024-04619-8. [PMID: 39347975 DOI: 10.1007/s00261-024-04619-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 09/19/2024] [Accepted: 09/25/2024] [Indexed: 10/01/2024]

Is EE, Menekseoglu AK. Comparative performance of artificial intelligence models in rheumatology board-level questions: evaluating Google Gemini and ChatGPT-4o. Clin Rheumatol 2024:10.1007/s10067-024-07154-5. [PMID: 39340572 DOI: 10.1007/s10067-024-07154-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 09/18/2024] [Accepted: 09/19/2024] [Indexed: 09/30/2024]

Abstract

OBJECTIVES

This study evaluates the performance of AI models, ChatGPT-4o and Google Gemini, in answering rheumatology board-level questions, comparing their effectiveness, reliability, and applicability in clinical practice.

METHOD

A cross-sectional study was conducted using 420 rheumatology questions from the BoardVitals question bank, excluding 27 visual data questions. Both artificial intelligence models categorized the questions according to difficulty (easy, medium, hard) and answered them. In addition, the reliability of the answers was assessed by asking the questions a second time. The accuracy, reliability, and difficulty categorization of the AI models' response to the questions were analyzed.

RESULTS

ChatGPT-4o answered 86.9% of the questions correctly, significantly outperforming Google Gemini's 60.2% accuracy (p < 0.001). When the questions were asked a second time, the success rate was 86.7% for ChatGPT-4o and 60.5% for Google Gemini. Both models mainly categorized questions as medium difficulty. ChatGPT-4o showed higher accuracy in various rheumatology subfields, notably in Basic and Clinical Science (p = 0.028), Osteoarthritis (p = 0.023), and Rheumatoid Arthritis (p < 0.001).

CONCLUSIONS

ChatGPT-4o significantly outperformed Google Gemini in rheumatology board-level questions. This demonstrates the success of ChatGPT-4o in situations requiring complex and specialized knowledge related to rheumatological diseases. The performance of both AI models decreased as the question difficulty increased. This study demonstrates the potential of AI in clinical applications and suggests that its use as a tool to assist clinicians may improve healthcare efficiency in the future. Future studies using real clinical scenarios and real board questions are recommended. Key Points •ChatGPT-4o significantly outperformed Google Gemini in answering rheumatology board-level questions, achieving 86.9% accuracy compared to Google Gemini's 60.2%. •For both AI models, the correct answer rate decreased as the question difficulty increased. •The study demonstrates the potential for AI models to be used in clinical practice as a tool to assist clinicians and improve healthcare efficiency.

Collapse

Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, Spaedy O, Skelton A, Edupuganti N, Dzubinski L, Tate H, Dyess G, Lindeman B, Lehmann LS. Critical Analysis of ChatGPT 4 Omni in USMLE Disciplines, Clinical Clerkships, and Clinical Skills. JMIR MEDICAL EDUCATION 2024. [PMID: 39276063 DOI: 10.2196/63430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/16/2024]

Abstract

BACKGROUND

Recent studies, including those by the National Board of Medical Examiners (NBME), have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of these models' performance in specific medical content areas, thus limiting an assessment of their potential utility for medical education.

OBJECTIVE

To assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management.

METHODS

This study used 750 clinical vignette-based multiple-choice questions (MCQs) to characterize the performance of successive ChatGPT versions [ChatGPT 3.5 (GPT-3.5), ChatGPT 4 (GPT-4), and ChatGPT 4 Omni (GPT-4o)] across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models' performances.

RESULTS

GPT-4o achieved the highest accuracy across 750 MCQs at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0% respectively. GPT-4o's highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o's diagnostic accuracy was 92.7% and management accuracy 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI: 58.3-60.3).

CONCLUSIONS

ChatGPT 4 Omni's performance in USMLE preclinical content areas as well as clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the necessity of careful consideration of LLMs' integration into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.

CLINICALTRIAL

Collapse

Erdogan M. Evaluation of responses of the large language model GPT to the neurology question of the week. Neurol Sci 2024;45:4605-4606. [PMID: 38717580 DOI: 10.1007/s10072-024-07580-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 05/03/2024] [Indexed: 08/09/2024]

Kenney RC, Requarth TW, Jack AI, Hyman SW, Galetta SL, Grossman SN. AI in Neuro-Ophthalmology: Current Practice and Future Opportunities. J Neuroophthalmol 2024;44:308-318. [PMID: 38965655 DOI: 10.1097/wno.0000000000002205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2024]

Abstract

BACKGROUND

Neuro-ophthalmology frequently requires a complex and multi-faceted clinical assessment supported by sophisticated imaging techniques in order to assess disease status. The current approach to diagnosis requires substantial expertise and time. The emergence of AI has brought forth innovative solutions to streamline and enhance this diagnostic process, which is especially valuable given the shortage of neuro-ophthalmologists. Machine learning algorithms, in particular, have demonstrated significant potential in interpreting imaging data, identifying subtle patterns, and aiding clinicians in making more accurate and timely diagnosis while also supplementing nonspecialist evaluations of neuro-ophthalmic disease.

EVIDENCE ACQUISITION

Electronic searches of published literature were conducted using PubMed and Google Scholar. A comprehensive search of the following terms was conducted within the Journal of Neuro-Ophthalmology: AI, artificial intelligence, machine learning, deep learning, natural language processing, computer vision, large language models, and generative AI.

RESULTS

This review aims to provide a comprehensive overview of the evolving landscape of AI applications in neuro-ophthalmology. It will delve into the diverse applications of AI, optical coherence tomography (OCT), and fundus photography to the development of predictive models for disease progression. Additionally, the review will explore the integration of generative AI into neuro-ophthalmic education and clinical practice.

CONCLUSIONS

We review the current state of AI in neuro-ophthalmology and its potentially transformative impact. The inclusion of AI in neuro-ophthalmic practice and research not only holds promise for improving diagnostic accuracy but also opens avenues for novel therapeutic interventions. We emphasize its potential to improve access to scarce subspecialty resources while examining the current challenges associated with the integration of AI into clinical practice and research.

Collapse

Al-Naser Y, Halka F, Ng B, Mountford D, Sharma S, Niure K, Yong-Hing C, Khosa F, Van der Pol C. Evaluating Artificial Intelligence Competency in Education: Performance of ChatGPT-4 in the American Registry of Radiologic Technologists (ARRT) Radiography Certification Exam. Acad Radiol 2024:S1076-6332(24)00572-5. [PMID: 39153961 DOI: 10.1016/j.acra.2024.08.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 07/12/2024] [Accepted: 08/06/2024] [Indexed: 08/19/2024]

Samman L, Akuffo-Addo E, Rao B. The Performance of Artificial Intelligence Chatbot (GPT-4) on Image-Based Dermatology Certification Board Exam Questions. J Cutan Med Surg 2024:12034754241266166. [PMID: 39056427 DOI: 10.1177/12034754241266166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/28/2024]

Landais R, Sultan M, Thomas RH. The promise of AI Large Language Models for Epilepsy care. Epilepsy Behav 2024;154:109747. [PMID: 38518673 DOI: 10.1016/j.yebeh.2024.109747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/08/2024] [Accepted: 03/12/2024] [Indexed: 03/24/2024]

Dabbas WF, Odeibat YM, Alhazaimeh M, Hiasat MY, Alomari AA, Marji A, Samara QA, Ibrahim B, Al Arabiyat RM, Momani G. Accuracy of ChatGPT in Neurolocalization. Cureus 2024;16:e59143. [PMID: 38803743 PMCID: PMC11129669 DOI: 10.7759/cureus.59143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/27/2024] [Indexed: 05/29/2024] Open

Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res 2024;13:e54704. [PMID: 38276872 PMCID: PMC10905357 DOI: 10.2196/54704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 12/18/2023] [Accepted: 01/26/2024] [Indexed: 01/27/2024] Open

Abstract

BACKGROUND

Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence.

OBJECTIVE

This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice.

METHODS

A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with "ChatGPT," "Bing," or "Bard" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability.

RESULTS

The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 (P<.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the "Model" item, followed by the "Specificity" item, while the lowest scores were recorded for the "Randomization" item (classified as suboptimal) and "Individual factors" item (classified as satisfactory).

CONCLUSIONS

The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.

Collapse

Sallam M, Al-Salahat K. Below average ChatGPT performance in medical microbiology exam compared to university students. FRONTIERS IN EDUCATION 2023;8. [DOI: 10.3389/feduc.2023.1333415] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/01/2024]

Abstract BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education. Collapse