1
|
Bülbül O, Bülbül HM, Kaba E. Assessing ChatGPT's summarization of 68Ga PSMA PET/CT reports for patients. Abdom Radiol (NY) 2024:10.1007/s00261-024-04619-8. [PMID: 39347975 DOI: 10.1007/s00261-024-04619-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 09/19/2024] [Accepted: 09/25/2024] [Indexed: 10/01/2024]
Abstract
PURPOSE ChatGPT has recently been the subject of many studies, and its responses to medical questions have been successful. We examined ChatGPT-4's evaluation of structured 68Ga prostate-specific membrane antigen (PSMA) PET/CT reports of newly diagnosed prostate cancer patients. METHODS 68Ga PSMA PET/CT reports of 164 patients were entered to ChatGPT-4. ChatGPT-4 was asked to respond the following questions according to the PET/CT reports: 1-Has the cancer in the prostate extended to organs adjacent to the prostate? 2-Has the cancer in the prostate spread to neighboring lymph nodes? 3-Has the cancer in the prostate spread to lymph nodes in distant areas? 4-Has the cancer in the prostate spread to the bones? 5-Has the cancer in the prostate spread to other organs? ChatGPT-4's responses were scored on a Likert-type scale for clarity and accuracy. RESULTS The mean scores for clarity were 4.93 ± 0.32, 4.95 ± 0.25, 4.96 ± 0.19, 4.99 ± 0.11, and 4.96 ± 0.30, respectively. The mean scores for accuracy were 4.87 ∓ 0.61, 4.87 ∓ 0.62, 4.79 ± 0.83, 4.96 ± 0.25, and 4.93 ± 0.45, respectively. Patients with distant lymphatic metastases had a lower mean accuracy score than those without (4.28 ± 1.45 vs. 4.94 ± 0.39; p < 0.001). ChatGPT-4's responses in 13 patients (8%) had the potential for harmful information. CONCLUSION ChatGPT-4 successfully interprets structured 68Ga PSMA PET/CT reports of reports of newly diagnosed prostate cancer patients. However, it is unlikely that ChatGPT-4 evaluations will replace physicians' evaluations today, especially since it can produce fabricated information.
Collapse
Affiliation(s)
- Ogün Bülbül
- Recep Tayyip Erdogan University, Faculty of Medicine, Department of Nuclear Medicine, Rize, Turkey.
| | - Hande Melike Bülbül
- Recep Tayyip Erdogan University, Faculty of Medicine, Department of Radiology, Rize, Turkey
| | - Esat Kaba
- Recep Tayyip Erdogan University, Faculty of Medicine, Department of Radiology, Rize, Turkey
| |
Collapse
|
2
|
Is EE, Menekseoglu AK. Comparative performance of artificial intelligence models in rheumatology board-level questions: evaluating Google Gemini and ChatGPT-4o. Clin Rheumatol 2024:10.1007/s10067-024-07154-5. [PMID: 39340572 DOI: 10.1007/s10067-024-07154-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 09/18/2024] [Accepted: 09/19/2024] [Indexed: 09/30/2024]
Abstract
OBJECTIVES This study evaluates the performance of AI models, ChatGPT-4o and Google Gemini, in answering rheumatology board-level questions, comparing their effectiveness, reliability, and applicability in clinical practice. METHOD A cross-sectional study was conducted using 420 rheumatology questions from the BoardVitals question bank, excluding 27 visual data questions. Both artificial intelligence models categorized the questions according to difficulty (easy, medium, hard) and answered them. In addition, the reliability of the answers was assessed by asking the questions a second time. The accuracy, reliability, and difficulty categorization of the AI models' response to the questions were analyzed. RESULTS ChatGPT-4o answered 86.9% of the questions correctly, significantly outperforming Google Gemini's 60.2% accuracy (p < 0.001). When the questions were asked a second time, the success rate was 86.7% for ChatGPT-4o and 60.5% for Google Gemini. Both models mainly categorized questions as medium difficulty. ChatGPT-4o showed higher accuracy in various rheumatology subfields, notably in Basic and Clinical Science (p = 0.028), Osteoarthritis (p = 0.023), and Rheumatoid Arthritis (p < 0.001). CONCLUSIONS ChatGPT-4o significantly outperformed Google Gemini in rheumatology board-level questions. This demonstrates the success of ChatGPT-4o in situations requiring complex and specialized knowledge related to rheumatological diseases. The performance of both AI models decreased as the question difficulty increased. This study demonstrates the potential of AI in clinical applications and suggests that its use as a tool to assist clinicians may improve healthcare efficiency in the future. Future studies using real clinical scenarios and real board questions are recommended. Key Points •ChatGPT-4o significantly outperformed Google Gemini in answering rheumatology board-level questions, achieving 86.9% accuracy compared to Google Gemini's 60.2%. •For both AI models, the correct answer rate decreased as the question difficulty increased. •The study demonstrates the potential for AI models to be used in clinical practice as a tool to assist clinicians and improve healthcare efficiency.
Collapse
Affiliation(s)
- Enes Efe Is
- Department of Physical Medicine and Rehabilitation, Sisli Hamidiye Etfal Training and Research Hospital, University of Health Sciences, Seyrantepe Campus, Cumhuriyet ve Demokrasi Avenue, Istanbul, Turkey.
| | - Ahmet Kivanc Menekseoglu
- Department of Physical Medicine and Rehabilitation, Kanuni Sultan Süleyman Training and Research Hospital, University of Health Sciences, Istanbul, Turkey
| |
Collapse
|
3
|
Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, Spaedy O, Skelton A, Edupuganti N, Dzubinski L, Tate H, Dyess G, Lindeman B, Lehmann LS. Critical Analysis of ChatGPT 4 Omni in USMLE Disciplines, Clinical Clerkships, and Clinical Skills. JMIR MEDICAL EDUCATION 2024. [PMID: 39276063 DOI: 10.2196/63430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/16/2024]
Abstract
BACKGROUND Recent studies, including those by the National Board of Medical Examiners (NBME), have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of these models' performance in specific medical content areas, thus limiting an assessment of their potential utility for medical education. OBJECTIVE To assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management. METHODS This study used 750 clinical vignette-based multiple-choice questions (MCQs) to characterize the performance of successive ChatGPT versions [ChatGPT 3.5 (GPT-3.5), ChatGPT 4 (GPT-4), and ChatGPT 4 Omni (GPT-4o)] across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models' performances. RESULTS GPT-4o achieved the highest accuracy across 750 MCQs at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0% respectively. GPT-4o's highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o's diagnostic accuracy was 92.7% and management accuracy 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI: 58.3-60.3). CONCLUSIONS ChatGPT 4 Omni's performance in USMLE preclinical content areas as well as clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the necessity of careful consideration of LLMs' integration into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness. CLINICALTRIAL
Collapse
Affiliation(s)
- Brenton T Bicknell
- University of Alabama at Birmingham Heersink School of Medicine, 1670 University Blvd, Birmingham, AL 35233, Birmingham, US
| | - Danner Butler
- University of South Alabama Whiddon College of Medicine, Mobile, US
| | - Sydney Whalen
- University of Illinois College of Medicine, Chicago, US
| | | | - Cory J Dixon
- Alabama College of Osteopathic Medicine, Dothan, US
| | | | - Olivia Spaedy
- Saint Louis University School of Medicine, St. Louis, US
| | - Adam Skelton
- University of Alabama at Birmingham Heersink School of Medicine, 1670 University Blvd, Birmingham, AL 35233, Birmingham, US
| | | | - Lance Dzubinski
- University of Colorado Anschutz Medical Campus School of Medicine, Aurora, US
| | - Hudson Tate
- University of Alabama at Birmingham Heersink School of Medicine, 1670 University Blvd, Birmingham, AL 35233, Birmingham, US
| | - Garrett Dyess
- University of South Alabama Whiddon College of Medicine, Mobile, US
| | - Brenessa Lindeman
- University of Alabama at Birmingham Heersink School of Medicine, 1670 University Blvd, Birmingham, AL 35233, Birmingham, US
| | | |
Collapse
|
4
|
Erdogan M. Evaluation of responses of the large language model GPT to the neurology question of the week. Neurol Sci 2024; 45:4605-4606. [PMID: 38717580 DOI: 10.1007/s10072-024-07580-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 05/03/2024] [Indexed: 08/09/2024]
Affiliation(s)
- Mucahid Erdogan
- Neurology Department, Kartal Dr. Lütfi Kirdar City Hospital, Istanbul, Turkey.
| |
Collapse
|
5
|
Kenney RC, Requarth TW, Jack AI, Hyman SW, Galetta SL, Grossman SN. AI in Neuro-Ophthalmology: Current Practice and Future Opportunities. J Neuroophthalmol 2024; 44:308-318. [PMID: 38965655 DOI: 10.1097/wno.0000000000002205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2024]
Abstract
BACKGROUND Neuro-ophthalmology frequently requires a complex and multi-faceted clinical assessment supported by sophisticated imaging techniques in order to assess disease status. The current approach to diagnosis requires substantial expertise and time. The emergence of AI has brought forth innovative solutions to streamline and enhance this diagnostic process, which is especially valuable given the shortage of neuro-ophthalmologists. Machine learning algorithms, in particular, have demonstrated significant potential in interpreting imaging data, identifying subtle patterns, and aiding clinicians in making more accurate and timely diagnosis while also supplementing nonspecialist evaluations of neuro-ophthalmic disease. EVIDENCE ACQUISITION Electronic searches of published literature were conducted using PubMed and Google Scholar. A comprehensive search of the following terms was conducted within the Journal of Neuro-Ophthalmology: AI, artificial intelligence, machine learning, deep learning, natural language processing, computer vision, large language models, and generative AI. RESULTS This review aims to provide a comprehensive overview of the evolving landscape of AI applications in neuro-ophthalmology. It will delve into the diverse applications of AI, optical coherence tomography (OCT), and fundus photography to the development of predictive models for disease progression. Additionally, the review will explore the integration of generative AI into neuro-ophthalmic education and clinical practice. CONCLUSIONS We review the current state of AI in neuro-ophthalmology and its potentially transformative impact. The inclusion of AI in neuro-ophthalmic practice and research not only holds promise for improving diagnostic accuracy but also opens avenues for novel therapeutic interventions. We emphasize its potential to improve access to scarce subspecialty resources while examining the current challenges associated with the integration of AI into clinical practice and research.
Collapse
Affiliation(s)
- Rachel C Kenney
- Departments of Neurology (RCK, AJ, SH, SG, SNG), Population Health (RCK), and Ophthalmology (SG), New York University Grossman School of Medicine, New York, New York; and Vilcek Institute of Graduate Biomedical Sciences (TR), New York University Grossman School of Medicine, New York, New York
| | | | | | | | | | | |
Collapse
|
6
|
Al-Naser Y, Halka F, Ng B, Mountford D, Sharma S, Niure K, Yong-Hing C, Khosa F, Van der Pol C. Evaluating Artificial Intelligence Competency in Education: Performance of ChatGPT-4 in the American Registry of Radiologic Technologists (ARRT) Radiography Certification Exam. Acad Radiol 2024:S1076-6332(24)00572-5. [PMID: 39153961 DOI: 10.1016/j.acra.2024.08.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 07/12/2024] [Accepted: 08/06/2024] [Indexed: 08/19/2024]
Abstract
RATIONALE AND OBJECTIVES The American Registry of Radiologic Technologists (ARRT) leads the certification process with an exam comprising 200 multiple-choice questions. This study aims to evaluate ChatGPT-4's performance in responding to practice questions similar to those found in the ARRT board examination. MATERIALS AND METHODS We used a dataset of 200 practice multiple-choice questions for the ARRT certification exam from BoardVitals. Each question was fed to ChatGPT-4 fifteen times, resulting in 3000 observations to account for response variability. RESULTS ChatGPT's overall performance was 80.56%, with higher accuracy on text-based questions (86.3%) compared to image-based questions (45.6%). Response times were longer for image-based questions (18.01 s) than for text-based questions (13.27 s). Performance varied by domain: 72.6% for Safety, 70.6% for Image Production, 67.3% for Patient Care, and 53.4% for Procedures. As anticipated, performance was best on on easy questions (78.5%). CONCLUSION ChatGPT demonstrated effective performance on the BoardVitals question bank for ARRT certification. Future studies could benefit from analyzing the correlation between BoardVitals scores and actual exam outcomes. Further development in AI, particularly in image processing and interpretation, is necessary to enhance its utility in educational settings.
Collapse
Affiliation(s)
- Yousif Al-Naser
- Medical Radiation Sciences, McMaster University, Hamilton, ON, Canada; Department of Diagnostic Imaging, Trillium Health Partners, Mississauga, ON, Canada.
| | - Felobater Halka
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine & Dentistry, Western University, Canada
| | - Boris Ng
- Department of Mechanical and Industrial Engineering, University of Toronto, ON, Canada
| | - Dwight Mountford
- Medical Radiation Sciences, McMaster University, Hamilton, ON, Canada
| | - Sonali Sharma
- Department of Radiology, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Ken Niure
- Department of Diagnostic Imaging, Trillium Health Partners, Mississauga, ON, Canada
| | - Charlotte Yong-Hing
- Department of Radiology, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Faisal Khosa
- Department of Radiology, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Christian Van der Pol
- Department of Diagnostic Imaging, Juravinski Hospital and Cancer Centre, Hamilton Health Sciences, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
7
|
Samman L, Akuffo-Addo E, Rao B. The Performance of Artificial Intelligence Chatbot (GPT-4) on Image-Based Dermatology Certification Board Exam Questions. J Cutan Med Surg 2024:12034754241266166. [PMID: 39056427 DOI: 10.1177/12034754241266166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/28/2024]
Affiliation(s)
- Luna Samman
- Department of Dermatology, Rowan School of Osteopathic Medicine, Stratford, NJ, USA
| | - Edgar Akuffo-Addo
- Division of Dermatology, Department of Medicine, University of Toronto, Toronto, ON, Canada
| | - Babar Rao
- Department of Dermatology, Rutgers Robert Wood, Somerset, NJ, USA
| |
Collapse
|
8
|
Landais R, Sultan M, Thomas RH. The promise of AI Large Language Models for Epilepsy care. Epilepsy Behav 2024; 154:109747. [PMID: 38518673 DOI: 10.1016/j.yebeh.2024.109747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/08/2024] [Accepted: 03/12/2024] [Indexed: 03/24/2024]
Abstract
Artificial intelligence (AI) has been supporting our digital life for decades, but public interest in this has exploded with the recognition of large language models, such as GPT-4. We examine and evaluate the potential uses for generative AI technologies in epilepsy and neurological services. Generative AI could not only improve patient care and safety by refining communication and removing certain barriers to healthcare but may also extend to streamlining a doctor's practice through strategies such as automating paperwork. Challenges with the integration of generative AI in epilepsy services are also explored and include the risk of producing inaccurate and biased information. The impact generative AI could have on the provision of healthcare, both positive and negative, should be understood and considered carefully when deciding on the steps that need to be taken before AI is ready for use in hospitals and epilepsy services.
Collapse
Affiliation(s)
- Raphaëlle Landais
- Faculty of Medical Sciences, Newcastle University, Newcastle-Upon-Tyne NE1 7RU, United Kingdom
| | - Mustafa Sultan
- Manchester University NHS Foundation Trust, Manchester M13 9PT, United Kingdom
| | - Rhys H Thomas
- Department of Neurology, Royal Victoria Infirmary, Queen Victoria Rd, Newcastle-Upon-Tyne NE1 4LP, United Kingdom; Translational and Clinical Research Institute, Henry Wellcome Building, Framlington Place, Newcastle-Upon-Tyne NE2 4HH, United Kingdom.
| |
Collapse
|
9
|
Dabbas WF, Odeibat YM, Alhazaimeh M, Hiasat MY, Alomari AA, Marji A, Samara QA, Ibrahim B, Al Arabiyat RM, Momani G. Accuracy of ChatGPT in Neurolocalization. Cureus 2024; 16:e59143. [PMID: 38803743 PMCID: PMC11129669 DOI: 10.7759/cureus.59143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/27/2024] [Indexed: 05/29/2024] Open
Abstract
Introduction ChatGPT (OpenAI Incorporated, Mission District, San Francisco, United States) is an artificial intelligence (AI) chatbot with advanced communication skills and a massive knowledge database. However, its application in medicine, specifically in neurolocalization, necessitates clinical reasoning in addition to deep neuroanatomical knowledge. This article examines ChatGPT's capabilities in neurolocalization. Methods Forty-six text-based neurolocalization case scenarios were presented to ChatGPT-3.5 from November 6th, 2023, to November 16th, 2023. Seven neurosurgeons evaluated ChatGPT's responses to these cases, utilizing a 5-point scoring system recommended by ChatGPT, to score the accuracy of these responses. Results ChatGPT-3.5 achieved an accuracy score of 84.8% in generating "completely correct" and "mostly correct" responses. ANOVA analysis suggested a consistent scoring approach between different evaluators. The mean length of the case text was 69.8 tokens (SD 20.8). Conclusion While this accuracy score is promising, it is not yet reliable for routine patient care. We recommend keeping interactions with ChatGPT concise, precise, and simple to improve response accuracy. As AI continues to evolve, it will hold significant and innovative breakthroughs in medicine.
Collapse
Affiliation(s)
- Waleed F Dabbas
- Division of Neurosurgery, Department of Special Surgery, Faculty of Medicine, Al-Balqa Applied University, Al-Salt, JOR
| | | | - Mohammad Alhazaimeh
- Division of Neurosurgery, Department of Clinical Sciences, Faculty of Medicine, Yarmouk University, Irbid, JOR
| | | | - Amer A Alomari
- Department of Neurosurgery, San Filippo Neri Hospital/Azienda Sanitaria Locale (ASL) Roma 1, Rome, ITA
- Division of Neurosurgery, Department of Special Surgery, Faculty of Medicine, Mutah University, Al-Karak, JOR
| | - Ala Marji
- Department of Neurosurgery, King Hussein Cancer Center, Amman, JOR
- Department of Neurosurgery, San Filippo Neri Hospital/Azienda Sanitaria Locale (ASL) Roma 1, Rome, ITA
| | - Qais A Samara
- Division of Neurosurgery, Department of Special Surgery, Faculty of Medicine, Al-Balqa Applied University, Al-Salt, JOR
| | - Bilal Ibrahim
- Division of Neurosurgery, Department of Special Surgery, Faculty of Medicine, Al-Balqa Applied University, Al-Salt, JOR
| | - Rashed M Al Arabiyat
- Department of General Practice, Al-Hussein Salt New Hospital, Ministry of Health, Al-Salt, JOR
| | - Ghena Momani
- Faculty of Medicine, The Hashemite University, Zarqa, JOR
| |
Collapse
|
10
|
Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res 2024; 13:e54704. [PMID: 38276872 PMCID: PMC10905357 DOI: 10.2196/54704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 12/18/2023] [Accepted: 01/26/2024] [Indexed: 01/27/2024] Open
Abstract
BACKGROUND Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence. OBJECTIVE This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice. METHODS A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with "ChatGPT," "Bing," or "Bard" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability. RESULTS The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 (P<.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the "Model" item, followed by the "Specificity" item, while the lowest scores were recorded for the "Randomization" item (classified as suboptimal) and "Individual factors" item (classified as satisfactory). CONCLUSIONS The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
- Department of Translational Medicine, Faculty of Medicine, Lund University, Malmo, Sweden
| | - Muna Barakat
- Department of Clinical Pharmacy and Therapeutics, Faculty of Pharmacy, Applied Science Private University, Amman, Jordan
| | - Mohammed Sallam
- Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates
| |
Collapse
|
11
|
Sallam M, Al-Salahat K. Below average ChatGPT performance in medical microbiology exam compared to university students. FRONTIERS IN EDUCATION 2023; 8. [DOI: 10.3389/feduc.2023.1333415] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/01/2024]
Abstract
BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education.
Collapse
|