1
|
Su JM, Hsu SY, Fang TY, Wang PC. Developing and validating a knowledge-based AI assessment system for learning clinical core medical knowledge in otolaryngology. Comput Biol Med 2024; 178:108765. [PMID: 38897143 DOI: 10.1016/j.compbiomed.2024.108765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Revised: 05/11/2024] [Accepted: 06/13/2024] [Indexed: 06/21/2024]
Abstract
BACKGROUND Clinical core medical knowledge (CCMK) learning is essential for medical trainees. Adaptive assessment systems can facilitate self-learning, but extracting experts' CCMK is challenging, especially using modern data-driven artificial intelligence (AI) approaches (e.g., deep learning). OBJECTIVES This study aims to develop a multi-expert knowledge-aggregated adaptive assessment scheme (MEKAS) using knowledge-based AI approaches to facilitate the learning of CCMK in otolaryngology (CCMK-OTO) and validate its effectiveness through a one-month training program for CCMK-OTO education at a tertiary referral hospital. METHODS The MEKAS utilized the repertory grid technique and case-based reasoning to aggregate experts' knowledge to construct a representative CCMK base, thereby enabling adaptive assessment for CCMK-OTO training. The effects of longitudinal training were compared between the experimental group (EG) and the control group (CG). Both groups received a normal training program (routine meeting, outpatient/operation room teaching, and classroom teaching), while EG received MEKAS for self-learning. The EG comprised 22 UPGY trainees (6 postgraduate [PGY] and 16 undergraduate [UGY] trainees) and 8 otolaryngology residents (ENT-R); the CG comprised 24 UPGY trainees (8 PGY and 16 UGY trainees). The training effectiveness was compared through pre- and post-test CCMK-OTO scores, and user experiences were evaluated using a technology acceptance model-based questionnaire. RESULTS Both UPGY (z = -3.976, P < 0.001) and ENT-R (z = -2.038, P = 0.042) groups in EG exhibited significant improvements in their CCMK-OTO scores, while UPGY in CG did not (z = -1.204, P = 0.228). The UPGY group in EG also demonstrated a substantial improvement compared to the UPGY group in CG (z = -4.943, P < 0.001). The EG participants were highly satisfied with the MEKAS system concerning self-learning assistance, adaptive testing, perceived satisfaction, intention to use, perceived usefulness, perceived ease of use, and perceived enjoyment, rating it between an overall average of 3.8 and 4.1 out of 5.0 on all scales. CONCLUSIONS The MEKAS system facilitates CCMK-OTO learning and provides an efficient knowledge aggregation scheme that can be applied to other medical subjects to efficiently build adaptive assessment systems for CCMK learning. Larger-scale validation across diverse institutions and settings is warranted further to assess MEKAS's scalability, generalizability, and long-term impact.
Collapse
Affiliation(s)
- Jun-Ming Su
- Department of Information and Learning Technology, National University of Tainan, Tainan, Taiwan.
| | - Su-Yi Hsu
- Department of Otolaryngology, Cathay General Hospital, Taipei, Taiwan; School of Medicine, Fu Jen Catholic University, New Taipei City, Taiwan; School of Medicine, National Tsing Hua University, Hsinchu, Taiwan.
| | - Te-Yung Fang
- Department of Otolaryngology, Cathay General Hospital, Taipei, Taiwan; School of Medicine, Fu Jen Catholic University, New Taipei City, Taiwan; Department of Otolaryngology, Sijhih Cathay General Hospital, New Taipei City, Taiwan.
| | - Pa-Chun Wang
- Department of Otolaryngology, Cathay General Hospital, Taipei, Taiwan; School of Medicine, Fu Jen Catholic University, New Taipei City, Taiwan; Department of Medical Research, China Medical University Hospital, China Medical University, Taichung, Taiwan.
| |
Collapse
|
2
|
Moulaei K, Yadegari A, Baharestani M, Farzanbakhsh S, Sabet B, Reza Afrash M. Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications. Int J Med Inform 2024; 188:105474. [PMID: 38733640 DOI: 10.1016/j.ijmedinf.2024.105474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 05/03/2024] [Accepted: 05/04/2024] [Indexed: 05/13/2024]
Abstract
BACKGROUND Generative artificial intelligence (GAI) is revolutionizing healthcare with solutions for complex challenges, enhancing diagnosis, treatment, and care through new data and insights. However, its integration raises questions about applications, benefits, and challenges. Our study explores these aspects, offering an overview of GAI's applications and future prospects in healthcare. METHODS This scoping review searched Web of Science, PubMed, and Scopus . The selection of studies involved screening titles, reviewing abstracts, and examining full texts, adhering to the PRISMA-ScR guidelines throughout the process. RESULTS From 1406 articles across three databases, 109 met inclusion criteria after screening and deduplication. Nine GAI models were utilized in healthcare, with ChatGPT (n = 102, 74 %), Google Bard (Gemini) (n = 16, 11 %), and Microsoft Bing AI (n = 10, 7 %) being the most frequently employed. A total of 24 different applications of GAI in healthcare were identified, with the most common being "offering insights and information on health conditions through answering questions" (n = 41) and "diagnosis and prediction of diseases" (n = 17). In total, 606 benefits and challenges were identified, which were condensed to 48 benefits and 61 challenges after consolidation. The predominant benefits included "Providing rapid access to information and valuable insights" and "Improving prediction and diagnosis accuracy", while the primary challenges comprised "generating inaccurate or fictional content", "unknown source of information and fake references for texts", and "lower accuracy in answering questions". CONCLUSION This scoping review identified the applications, benefits, and challenges of GAI in healthcare. This synthesis offers a crucial overview of GAI's potential to revolutionize healthcare, emphasizing the imperative to address its limitations.
Collapse
Affiliation(s)
- Khadijeh Moulaei
- Department of Health Information Technology, School of Paramedical, Ilam University of Medical Sciences, Ilam, Iran
| | - Atiye Yadegari
- Department of Pediatric Dentistry, School of Dentistry, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Mahdi Baharestani
- Network of Interdisciplinarity in Neonates and Infants (NINI), Universal Scientific Education and Research Network (USERN), Tehran, Iran
| | - Shayan Farzanbakhsh
- Network of Interdisciplinarity in Neonates and Infants (NINI), Universal Scientific Education and Research Network (USERN), Tehran, Iran
| | - Babak Sabet
- Department of Surgery, Faculty of Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mohammad Reza Afrash
- Department of Artificial Intelligence, Smart University of Medical Sciences, Tehran, Iran.
| |
Collapse
|
3
|
Zhui L, Yhap N, Liping L, Zhengjie W, Zhonghao X, Xiaoshu Y, Hong C, Xuexiu L, Wei R. Impact of Large Language Models on Medical Education and Teaching Adaptations. JMIR Med Inform 2024; 12:e55933. [PMID: 39087590 PMCID: PMC11294775 DOI: 10.2196/55933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 04/25/2024] [Accepted: 06/08/2024] [Indexed: 08/02/2024] Open
Abstract
Unlabelled This viewpoint article explores the transformative role of large language models (LLMs) in the field of medical education, highlighting their potential to enhance teaching quality, promote personalized learning paths, strengthen clinical skills training, optimize teaching assessment processes, boost the efficiency of medical research, and support continuing medical education. However, the use of LLMs entails certain challenges, such as questions regarding the accuracy of information, the risk of overreliance on technology, a lack of emotional recognition capabilities, and concerns related to ethics, privacy, and data security. This article emphasizes that to maximize the potential of LLMs and overcome these challenges, educators must exhibit leadership in medical education, adjust their teaching strategies flexibly, cultivate students' critical thinking, and emphasize the importance of practical experience, thus ensuring that students can use LLMs correctly and effectively. By adopting such a comprehensive and balanced approach, educators can train health care professionals who are proficient in the use of advanced technologies and who exhibit solid professional ethics and practical skills, thus laying a strong foundation for these professionals to overcome future challenges in the health care sector.
Collapse
Affiliation(s)
- Li Zhui
- Department of Vascular Surgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Nina Yhap
- Department of General Surgery, Queen Elizabeth Hospital, St Michael, Barbados
| | - Liu Liping
- Department of Ultrasound, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Wang Zhengjie
- Department of Nuclear Medicine, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Xiong Zhonghao
- Department of Acupuncture and Moxibustion, Chongqing Traditional Chinese Medicine Hospital, Chongqing, China
| | - Yuan Xiaoshu
- Department of Anesthesia, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Cui Hong
- Department of Anesthesia, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Liu Xuexiu
- Department of Neonatology, Children’s Hospital of Chongqing Medical University, Chongqing, China
| | - Ren Wei
- Department of Vascular Surgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| |
Collapse
|
4
|
Mistry NP, Saeed H, Rafique S, Le T, Obaid H, Adams SJ. Large Language Models as Tools to Generate Radiology Board-Style Multiple-Choice Questions. Acad Radiol 2024:S1076-6332(24)00432-X. [PMID: 39013736 DOI: 10.1016/j.acra.2024.06.046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2024] [Revised: 06/24/2024] [Accepted: 06/27/2024] [Indexed: 07/18/2024]
Abstract
RATIONALE AND OBJECTIVES To determine the potential of large language models (LLMs) to be used as tools by radiology educators to create radiology board-style multiple choice questions (MCQs), answers, and rationales. METHODS Two LLMs (Llama 2 and GPT-4) were used to develop 104 MCQs based on the American Board of Radiology exam blueprint. Two board-certified radiologists assessed each MCQ using a 10-point Likert scale across five criteria-clarity, relevance, suitability for a board exam based on level of difficulty, quality of distractors, and adequacy of rationale. For comparison, MCQs from prior American College of Radiology (ACR) Diagnostic Radiology In-Training (DXIT) exams were also assessed using these criteria, with radiologists blinded to the question source. RESULTS Mean scores (±standard deviation) for clarity, relevance, suitability, quality of distractors, and adequacy of rationale were 8.7 (±1.4), 9.2 (±1.3), 9.0 (±1.2), 8.4 (±1.9), and 7.2 (±2.2), respectively, for Llama 2; 9.9 (±0.4), 9.9 (±0.5), 9.9 (±0.4), 9.8 (±0.5), and 9.9 (±0.3), respectively, for GPT-4; and 9.9 (±0.3), 9.9 (±0.2), 9.9 (±0.2), 9.9 (±0.4), and 9.8 (±0.6), respectively, for ACR DXIT items (p < 0.001 for Llama 2 vs. ACR DXIT across all criteria; no statistically significant difference for GPT-4 vs. ACR DXIT). The accuracy of model-generated answers was 69% for Llama 2 and 100% for GPT-4. CONCLUSION A state-of-the art LLM such as GPT-4 may be used to develop radiology board-style MCQs and rationales to enhance exam preparation materials and expand exam banks, and may allow radiology educators to further use MCQs as teaching and learning tools.
Collapse
Affiliation(s)
- Neel P Mistry
- College of Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada (N.P.M., H.S., H.O., S.J.A.); Department of Medical Imaging, Royal University Hospital, Saskatoon, Saskatchewan, Canada (N.P.M., H.O., S.J.A.)
| | - Huzaifa Saeed
- College of Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada (N.P.M., H.S., H.O., S.J.A.)
| | - Sidra Rafique
- Department of Computer Science, University of Saskatchewan, Saskatoon, Saskatchewan, Canada (S.R.)
| | - Thuy Le
- Department of Community Health and Epidemiology, University of Saskatchewan, Saskatoon, Saskatchewan, Canada (T.L.)
| | - Haron Obaid
- College of Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada (N.P.M., H.S., H.O., S.J.A.); Department of Medical Imaging, Royal University Hospital, Saskatoon, Saskatchewan, Canada (N.P.M., H.O., S.J.A.)
| | - Scott J Adams
- College of Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada (N.P.M., H.S., H.O., S.J.A.); Department of Medical Imaging, Royal University Hospital, Saskatoon, Saskatchewan, Canada (N.P.M., H.O., S.J.A.).
| |
Collapse
|
5
|
Kıyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J 2024:qgae065. [PMID: 38840505 DOI: 10.1093/postmj/qgae065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 04/29/2024] [Accepted: 05/23/2024] [Indexed: 06/07/2024]
Abstract
ChatGPT's role in creating multiple-choice questions (MCQs) is growing but the validity of these artificial-intelligence-generated questions is unclear. This literature review was conducted to address the urgent need for understanding the application of ChatGPT in generating MCQs for medical education. Following the database search and screening of 1920 studies, we found 23 relevant studies. We extracted the prompts for MCQ generation and assessed the validity evidence of MCQs. The findings showed that prompts varied, including referencing specific exam styles and adopting specific personas, which align with recommended prompt engineering tactics. The validity evidence covered various domains, showing mixed accuracy rates, with some studies indicating comparable quality to human-written questions, and others highlighting differences in difficulty and discrimination levels, alongside a significant reduction in question creation time. Despite its efficiency, we highlight the necessity of careful review and suggest a need for further research to optimize the use of ChatGPT in question generation. Main messages Ensure high-quality outputs by utilizing well-designed prompts; medical educators should prioritize the use of detailed, clear ChatGPT prompts when generating MCQs. Avoid using ChatGPT-generated MCQs directly in examinations without thorough review to prevent inaccuracies and ensure relevance. Leverage ChatGPT's potential to streamline the test development process, enhancing efficiency without compromising quality.
Collapse
Affiliation(s)
- Yavuz Selim Kıyak
- Department of Medical Education and Informatics, Faculty of Medicine, Gazi University, Ankara 06500, Turkey
| | - Emre Emekli
- Department of Radiology, Faculty of Medicine, Eskişehir Osmangazi University, Eskişehir 26040, Turkey
| |
Collapse
|
6
|
Bharatha A, Ojeh N, Fazle Rabbi AM, Campbell MH, Krishnamurthy K, Layne-Yarde RNA, Kumar A, Springer DCR, Connell KL, Majumder MAA. Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom's Taxonomy. ADVANCES IN MEDICAL EDUCATION AND PRACTICE 2024; 15:393-400. [PMID: 38751805 PMCID: PMC11094742 DOI: 10.2147/amep.s457408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 05/01/2024] [Indexed: 05/18/2024]
Abstract
Introduction This research investigated the capabilities of ChatGPT-4 compared to medical students in answering MCQs using the revised Bloom's Taxonomy as a benchmark. Methods A cross-sectional study was conducted at The University of the West Indies, Barbados. ChatGPT-4 and medical students were assessed on MCQs from various medical courses using computer-based testing. Results The study included 304 MCQs. Students demonstrated good knowledge, with 78% correctly answering at least 90% of the questions. However, ChatGPT-4 achieved a higher overall score (73.7%) compared to students (66.7%). Course type significantly affected ChatGPT-4's performance, but revised Bloom's Taxonomy levels did not. A detailed association check between program levels and Bloom's taxonomy levels for correct answers by ChatGPT-4 showed a highly significant correlation (p<0.001), reflecting a concentration of "remember-level" questions in preclinical and "evaluate-level" questions in clinical courses. Discussion The study highlights ChatGPT-4's proficiency in standardized tests but indicates limitations in clinical reasoning and practical skills. This performance discrepancy suggests that the effectiveness of artificial intelligence (AI) varies based on course content. Conclusion While ChatGPT-4 shows promise as an educational tool, its role should be supplementary, with strategic integration into medical education to leverage its strengths and address limitations. Further research is needed to explore AI's impact on medical education and student performance across educational levels and courses.
Collapse
Affiliation(s)
- Ambadasu Bharatha
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | - Nkemcho Ojeh
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | | | - Michael H Campbell
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | | | | | - Alok Kumar
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | - Dale C R Springer
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | - Kenneth L Connell
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | | |
Collapse
|
7
|
Stadler M, Horrer A, Fischer MR. Crafting medical MCQs with generative AI: A how-to guide on leveraging ChatGPT. GMS JOURNAL FOR MEDICAL EDUCATION 2024; 41:Doc20. [PMID: 38779693 PMCID: PMC11106576 DOI: 10.3205/zma001675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Figures] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Accepted: 12/19/2023] [Indexed: 05/25/2024]
Abstract
As medical educators grapple with the consistent demand for high-quality assessments, the integration of artificial intelligence presents a novel solution. This how-to article delves into the mechanics of employing ChatGPT for generating Multiple Choice Questions (MCQs) within the medical curriculum. Focusing on the intricacies of prompt engineering, we elucidate the steps and considerations imperative for achieving targeted, high-fidelity results. The article presents varying outcomes based on different prompt structures, highlighting the AI's adaptability in producing questions of distinct complexities. While emphasizing the transformative potential of ChatGPT, we also spotlight challenges, including the AI's occasional "hallucination", underscoring the importance of rigorous review. This guide aims to furnish educators with the know-how to integrate AI into their assessment creation process, heralding a new era in medical education tools.
Collapse
Affiliation(s)
- Matthias Stadler
- LMU University Hospital, LMU Munich, Institute for Medical Education, Munich, Germany
| | - Anna Horrer
- LMU University Hospital, LMU Munich, Institute for Medical Education, Munich, Germany
| | - Martin R. Fischer
- LMU University Hospital, LMU Munich, Institute for Medical Education, Munich, Germany
| |
Collapse
|
8
|
Gordon M, Daniel M, Ajiboye A, Uraiby H, Xu NY, Bartlett R, Hanson J, Haas M, Spadafore M, Grafton-Clarke C, Gasiea RY, Michie C, Corral J, Kwan B, Dolmans D, Thammasitboon S. A scoping review of artificial intelligence in medical education: BEME Guide No. 84. MEDICAL TEACHER 2024; 46:446-470. [PMID: 38423127 DOI: 10.1080/0142159x.2024.2314198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 01/31/2024] [Indexed: 03/02/2024]
Abstract
BACKGROUND Artificial Intelligence (AI) is rapidly transforming healthcare, and there is a critical need for a nuanced understanding of how AI is reshaping teaching, learning, and educational practice in medical education. This review aimed to map the literature regarding AI applications in medical education, core areas of findings, potential candidates for formal systematic review and gaps for future research. METHODS This rapid scoping review, conducted over 16 weeks, employed Arksey and O'Malley's framework and adhered to STORIES and BEME guidelines. A systematic and comprehensive search across PubMed/MEDLINE, EMBASE, and MedEdPublish was conducted without date or language restrictions. Publications included in the review spanned undergraduate, graduate, and continuing medical education, encompassing both original studies and perspective pieces. Data were charted by multiple author pairs and synthesized into various thematic maps and charts, ensuring a broad and detailed representation of the current landscape. RESULTS The review synthesized 278 publications, with a majority (68%) from North American and European regions. The studies covered diverse AI applications in medical education, such as AI for admissions, teaching, assessment, and clinical reasoning. The review highlighted AI's varied roles, from augmenting traditional educational methods to introducing innovative practices, and underscores the urgent need for ethical guidelines in AI's application in medical education. CONCLUSION The current literature has been charted. The findings underscore the need for ongoing research to explore uncharted areas and address potential risks associated with AI use in medical education. This work serves as a foundational resource for educators, policymakers, and researchers in navigating AI's evolving role in medical education. A framework to support future high utility reporting is proposed, the FACETS framework.
Collapse
Affiliation(s)
- Morris Gordon
- School of Medicine and Dentistry, University of Central Lancashire, Preston, UK
- Blackpool Hospitals NHS Foundation Trust, Blackpool, UK
| | - Michelle Daniel
- School of Medicine, University of California, San Diego, SanDiego, CA, USA
| | - Aderonke Ajiboye
- School of Medicine and Dentistry, University of Central Lancashire, Preston, UK
| | - Hussein Uraiby
- Department of Cellular Pathology, University Hospitals of Leicester NHS Trust, Leicester, UK
| | - Nicole Y Xu
- School of Medicine, University of California, San Diego, SanDiego, CA, USA
| | - Rangana Bartlett
- Department of Cognitive Science, University of California, San Diego, CA, USA
| | - Janice Hanson
- Department of Medicine and Office of Education, School of Medicine, Washington University in Saint Louis, Saint Louis, MO, USA
| | - Mary Haas
- Department of Emergency Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Maxwell Spadafore
- Department of Emergency Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
| | | | | | - Colin Michie
- School of Medicine and Dentistry, University of Central Lancashire, Preston, UK
| | - Janet Corral
- Department of Medicine, University of Nevada Reno, School of Medicine, Reno, NV, USA
| | - Brian Kwan
- School of Medicine, University of California, San Diego, SanDiego, CA, USA
| | - Diana Dolmans
- School of Health Professions Education, Faculty of Health, Maastricht University, Maastricht, NL, USA
| | - Satid Thammasitboon
- Center for Research, Innovation and Scholarship in Health Professions Education, Baylor College of Medicine, Houston, TX, USA
| |
Collapse
|
9
|
Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC MEDICAL EDUCATION 2024; 24:354. [PMID: 38553693 PMCID: PMC10981304 DOI: 10.1186/s12909-024-05239-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 02/28/2024] [Indexed: 04/01/2024]
Abstract
BACKGROUND Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs. METHODS The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool. RESULTS Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify. CONCLUSIONS LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
Collapse
Affiliation(s)
- Yaara Artsi
- Azrieli Faculty of Medicine, Bar-Ilan University, Ha'Hadas St. 1, Rishon Le Zion, Zefat, 7550598, Israel.
| | - Vera Sorin
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- Tel-Aviv University School of Medicine, Tel Aviv, Israel
- DeepVision Lab, Chaim Sheba Medical Center, Ramat Gan, Israel
| | - Eli Konen
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- Tel-Aviv University School of Medicine, Tel Aviv, Israel
| | - Benjamin S Glicksberg
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Girish Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
10
|
Sahin MC, Sozer A, Kuzucu P, Turkmen T, Sahin MB, Sozer E, Tufek OY, Nernekli K, Emmez H, Celtikci E. Beyond human in neurosurgical exams: ChatGPT's success in the Turkish neurosurgical society proficiency board exams. Comput Biol Med 2024; 169:107807. [PMID: 38091727 DOI: 10.1016/j.compbiomed.2023.107807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 11/29/2023] [Accepted: 12/01/2023] [Indexed: 02/08/2024]
Abstract
Chat Generative Pre-Trained Transformer (ChatGPT) is a sophisticated natural language model that employs advanced deep learning techniques and is trained on extensive datasets to produce responses akin to human conversation for user inputs. In this study, ChatGPT's success in the Turkish Neurosurgical Society Proficiency Board Exams (TNSPBE) is compared to the actual candidates who took the exam, along with identifying the types of questions it answered incorrectly, assessing the quality of its responses, and evaluating its performance based on the difficulty level of the questions. Scores of all 260 candidates were recalculated according to the exams they took and included questions in those exams for ranking purposes of this study. The average score of the candidates for a total of 523 questions is 62.02 ± 0.61 compared to ChatGPT, which was 78.77. We have concluded that in addition to ChatGPT's higher response rate, there was also a correlation with the increase in clarity regardless of the difficulty level of the questions with Clarity 1.5, 2.0, 2.5, and 3.0. In the participants, however, there is no such increase in parallel with the increase in clarity.
Collapse
Affiliation(s)
- Mustafa Caglar Sahin
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey.
| | - Alperen Sozer
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey.
| | - Pelin Kuzucu
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey.
| | - Tolga Turkmen
- Ministry of Health Dortyol State Hospital, Department of Neurosurgery, Hatay, Turkey.
| | - Merve Buke Sahin
- Ministry of Health Etimesgut District Health Directorate, Department of Public Health, Ankara, Turkey.
| | - Ekin Sozer
- Gazi University, Directorate of Health Culture and Sports, Ankara, Turkey.
| | - Ozan Yavuz Tufek
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey.
| | - Kerem Nernekli
- Stanford University Medical School, Department of Radiology, Stanford, CA, USA.
| | - Hakan Emmez
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey.
| | - Emrah Celtikci
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey; Gazi University Artificial Intelligence Center, Ankara, Turkey.
| |
Collapse
|
11
|
Sallam M, Al-Salahat K. Below average ChatGPT performance in medical microbiology exam compared to university students. FRONTIERS IN EDUCATION 2023; 8. [DOI: 10.3389/feduc.2023.1333415] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/01/2024]
Abstract
BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education.
Collapse
|
12
|
Ignjatović A, Stevanović L. Efficacy and limitations of ChatGPT as a biostatistical problem-solving tool in medical education in Serbia: a descriptive study. JOURNAL OF EDUCATIONAL EVALUATION FOR HEALTH PROFESSIONS 2023; 20:28. [PMID: 37840252 PMCID: PMC10646144 DOI: 10.3352/jeehp.2023.20.28] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 10/10/2023] [Indexed: 10/17/2023]
Abstract
PURPOSE This study aimed to assess the performance of ChatGPT (GPT-3.5 and GPT-4) as a study tool in solving biostatistical problems and to identify any potential drawbacks that might arise from using ChatGPT in medical education, particularly in solving practical biostatistical problems. METHODS ChatGPT was tested to evaluate its ability to solve biostatistical problems from the Handbook of Medical Statistics by Peacock and Peacock in this descriptive study. Tables from the problems were transformed into textual questions. Ten biostatistical problems were randomly chosen and used as text-based input for conversation with ChatGPT (versions 3.5 and 4). RESULTS GPT-3.5 solved 5 practical problems in the first attempt, related to categorical data, cross-sectional study, measuring reliability, probability properties, and the t-test. GPT-3.5 failed to provide correct answers regarding analysis of variance, the chi-square test, and sample size within 3 attempts. GPT-4 also solved a task related to the confidence interval in the first attempt and solved all questions within 3 attempts, with precise guidance and monitoring. CONCLUSION The assessment of both versions of ChatGPT performance in 10 biostatistical problems revealed that GPT-3.5 and 4’s performance was below average, with correct response rates of 5 and 6 out of 10 on the first attempt. GPT-4 succeeded in providing all correct answers within 3 attempts. These findings indicate that students must be aware that this tool, even when providing and calculating different statistical analyses, can be wrong, and they should be aware of ChatGPT’s limitations and be careful when incorporating this model into medical education.
Collapse
Affiliation(s)
- Aleksandra Ignjatović
- Department of Medical Statistics and Informatics, Faculty of Medicine, University of Niš, Niš, Serbia
| | | |
Collapse
|