Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lai UH, Wu KS, Hsu TY, Kan JKC. Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment. Front Med (Lausanne) 2023;10:1240915. [PMID: 37795422 PMCID: PMC10547055 DOI: 10.3389/fmed.2023.1240915] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 08/30/2023] [Indexed: 10/06/2023] Open

For:	Lai UH, Wu KS, Hsu TY, Kan JKC. Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment. Front Med (Lausanne) 2023;10:1240915. [PMID: 37795422 PMCID: PMC10547055 DOI: 10.3389/fmed.2023.1240915] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 08/30/2023] [Indexed: 10/06/2023] Open

Number

Cited by Other Article(s)

AlSaad R, Abd-Alrazaq A, Boughorbel S, Ahmed A, Renault MA, Damseh R, Sheikh J. Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook. J Med Internet Res 2024;26:e59505. [PMID: 39321458 PMCID: PMC11464944 DOI: 10.2196/59505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2024] [Revised: 08/07/2024] [Accepted: 08/20/2024] [Indexed: 09/27/2024] Open

Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC MEDICAL EDUCATION 2024;24:1013. [PMID: 39285377 PMCID: PMC11406751 DOI: 10.1186/s12909-024-05944-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Accepted: 08/22/2024] [Indexed: 09/19/2024]

Abstract

BACKGROUND

ChatGPT, a recently developed artificial intelligence (AI) chatbot, has demonstrated improved performance in examinations in the medical field. However, thus far, an overall evaluation of the potential of ChatGPT models (ChatGPT-3.5 and GPT-4) in a variety of national health licensing examinations is lacking. This study aimed to provide a comprehensive assessment of the ChatGPT models' performance in national licensing examinations for medical, pharmacy, dentistry, and nursing research through a meta-analysis.

METHODS

Following the PRISMA protocol, full-text articles from MEDLINE/PubMed, EMBASE, ERIC, Cochrane Library, Web of Science, and key journals were reviewed from the time of ChatGPT's introduction to February 27, 2024. Studies were eligible if they evaluated the performance of a ChatGPT model (ChatGPT-3.5 or GPT-4); related to national licensing examinations in the fields of medicine, pharmacy, dentistry, or nursing; involved multiple-choice questions; and provided data that enabled the calculation of effect size. Two reviewers independently completed data extraction, coding, and quality assessment. The JBI Critical Appraisal Tools were used to assess the quality of the selected articles. Overall effect size and 95% confidence intervals [CIs] were calculated using a random-effects model.

RESULTS

A total of 23 studies were considered for this review, which evaluated the accuracy of four types of national licensing examinations. The selected articles were in the fields of medicine (n = 17), pharmacy (n = 3), nursing (n = 2), and dentistry (n = 1). They reported varying accuracy levels, ranging from 36 to 77% for ChatGPT-3.5 and 64.4-100% for GPT-4. The overall effect size for the percentage of accuracy was 70.1% (95% CI, 65-74.8%), which was statistically significant (p < 0.001). Subgroup analyses revealed that GPT-4 demonstrated significantly higher accuracy in providing correct responses than its earlier version, ChatGPT-3.5. Additionally, in the context of health licensing examinations, the ChatGPT models exhibited greater proficiency in the following order: pharmacy, medicine, dentistry, and nursing. However, the lack of a broader set of questions, including open-ended and scenario-based questions, and significant heterogeneity were limitations of this meta-analysis.

CONCLUSIONS

This study sheds light on the accuracy of ChatGPT models in four national health licensing examinations across various countries and provides a practical basis and theoretical support for future research. Further studies are needed to explore their utilization in medical and health education by including a broader and more diverse range of questions, along with more advanced versions of AI chatbots.

Collapse

Ishida K, Arisaka N, Fujii K. Analysis of Responses of GPT-4 V to the Japanese National Clinical Engineer Licensing Examination. J Med Syst 2024;48:83. [PMID: 39259341 DOI: 10.1007/s10916-024-02103-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Accepted: 08/28/2024] [Indexed: 09/13/2024]

Arfaie S, Sadegh Mashayekhi M, Mofatteh M, Ma C, Ruan R, MacLean MA, Far R, Saini J, Harmsen IE, Duda T, Gomez A, Rebchuk AD, Pingbei Wang A, Rasiah N, Guo E, Fazlollahi AM, Rose Swan E, Amin P, Mohammed S, Atkinson JD, Del Maestro RF, Girgis F, Kumar A, Das S. ChatGPT and neurosurgical education: A crossroads of innovation and opportunity. J Clin Neurosci 2024;129:110815. [PMID: 39236407 DOI: 10.1016/j.jocn.2024.110815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 08/18/2024] [Accepted: 08/24/2024] [Indexed: 09/07/2024]

Affiliation(s)

Saman Arfaie Division of Neurosurgery, Department of Clinical Neurological Sciences, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada; Department of Neurosurgery and Neurology, McGill University Faculty of Medicine, Montreal, QC, Canada.
Mohammad Sadegh Mashayekhi University of Ottawa Faculty of Medicine, Division of Neurosurgery, Ottawa, ON, Canada
Mohammad Mofatteh School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, Belfast, UK
Crystal Ma Faculty of Medicine, University of British Columbia, Vancouver, BC, Canada
Richard Ruan Eli and Edythe L. Broad Institute of MIT and Harvard, Cambridge, MA, USA
Mark A MacLean Division of Neurosurgery, Dalhousie University, Halifax, NS, Canada
Rena Far Division of Neurosurgery, Department of Clinical Neurosciences, University of Calgary, AB, Canada
Jasleen Saini Department of Neurosurgery, University of Toronto Faculty of Medicine, Toronto, ON, Canada
Irene E Harmsen Division of Neurosurgery, University of Alberta Faculty of Medicine, Edmonton, AB, Canada
Taylor Duda Division of Neurosurgery, McMaster University, Hamilton, ON, Canada
Alwyn Gomez Section of Neurosurgery, Department of Surgery, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada
Alexander D Rebchuk Division of Neurosurgery, University of British Columbia, Vancouver, BC, Canada
Alick Pingbei Wang University of Ottawa Faculty of Medicine, Division of Neurosurgery, Ottawa, ON, Canada
Neilen Rasiah Section of Neurosurgery, Department of Surgery, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada
Eddie Guo Division of Neurosurgery, Department of Clinical Neurosciences, University of Calgary, AB, Canada
Ali M Fazlollahi Division of Neurosurgery, Department of Clinical Neurological Sciences, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada
Emma Rose Swan Faculty of Medicine, University of British Columbia, Vancouver, BC, Canada
Pouya Amin University of California Irvine School of Medicine, California, CA, USA
Safraz Mohammed University of Ottawa Faculty of Medicine, Division of Neurosurgery, Ottawa, ON, Canada
Jeffrey D Atkinson Department of Neurosurgery and Neurology, McGill University Faculty of Medicine, Montreal, QC, Canada
Rolando F Del Maestro Department of Neurosurgery and Neurology, McGill University Faculty of Medicine, Montreal, QC, Canada
Fady Girgis Division of Neurosurgery, Department of Clinical Neurosciences, University of Calgary, AB, Canada
Ashish Kumar Division of Neurosurgery, Department of Surgery, Sunnybrook Health Sciences Centre, University of Toronto, Toronto, ON, Canada
Sunit Das Division of Neurosurgery, Department of Surgery, St. Michael's Hospital, University of Toronto, Toronto, ON, Canada

Collapse

Moulaei K, Yadegari A, Baharestani M, Farzanbakhsh S, Sabet B, Reza Afrash M. Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications. Int J Med Inform 2024;188:105474. [PMID: 38733640 DOI: 10.1016/j.ijmedinf.2024.105474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 05/03/2024] [Accepted: 05/04/2024] [Indexed: 05/13/2024]

Abstract

BACKGROUND

Generative artificial intelligence (GAI) is revolutionizing healthcare with solutions for complex challenges, enhancing diagnosis, treatment, and care through new data and insights. However, its integration raises questions about applications, benefits, and challenges. Our study explores these aspects, offering an overview of GAI's applications and future prospects in healthcare.

METHODS

This scoping review searched Web of Science, PubMed, and Scopus . The selection of studies involved screening titles, reviewing abstracts, and examining full texts, adhering to the PRISMA-ScR guidelines throughout the process.

RESULTS

From 1406 articles across three databases, 109 met inclusion criteria after screening and deduplication. Nine GAI models were utilized in healthcare, with ChatGPT (n = 102, 74 %), Google Bard (Gemini) (n = 16, 11 %), and Microsoft Bing AI (n = 10, 7 %) being the most frequently employed. A total of 24 different applications of GAI in healthcare were identified, with the most common being "offering insights and information on health conditions through answering questions" (n = 41) and "diagnosis and prediction of diseases" (n = 17). In total, 606 benefits and challenges were identified, which were condensed to 48 benefits and 61 challenges after consolidation. The predominant benefits included "Providing rapid access to information and valuable insights" and "Improving prediction and diagnosis accuracy", while the primary challenges comprised "generating inaccurate or fictional content", "unknown source of information and fake references for texts", and "lower accuracy in answering questions".

CONCLUSION

This scoping review identified the applications, benefits, and challenges of GAI in healthcare. This synthesis offers a crucial overview of GAI's potential to revolutionize healthcare, emphasizing the imperative to address its limitations.

Collapse

Ishida K, Hanada E. Potential of ChatGPT to Pass the Japanese Medical and Healthcare Professional National Licenses: A Literature Review. Cureus 2024;16:e66324. [PMID: 39247019 PMCID: PMC11377128 DOI: 10.7759/cureus.66324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/06/2024] [Indexed: 09/10/2024] Open

Vij O, Calver H, Myall N, Dey M, Kouranloo K. Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments. PLoS One 2024;19:e0307372. [PMID: 39083455 PMCID: PMC11290618 DOI: 10.1371/journal.pone.0307372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Accepted: 07/03/2024] [Indexed: 08/02/2024] Open

Abstract

OBJECTIVES

As a large language model (LLM) trained on a large data set, ChatGPT can perform a wide array of tasks without additional training. We evaluated the performance of ChatGPT on postgraduate UK medical examinations through a systematic literature review of ChatGPT's performance in UK postgraduate medical assessments and its performance on Member of Royal College of Physicians (MRCP) Part 1 examination.

METHODS

Medline, Embase and Cochrane databases were searched. Articles discussing the performance of ChatGPT in UK postgraduate medical examinations were included in the systematic review. Information was extracted on exam performance including percentage scores and pass/fail rates. MRCP UK Part 1 sample paper questions were inserted into ChatGPT-3.5 and -4 four times each and the scores marked against the correct answers provided.

RESULTS

12 studies were ultimately included in the systematic literature review. ChatGPT-3.5 scored 66.4% and ChatGPT-4 scored 84.8% on MRCP Part 1 sample paper, which is 4.4% and 22.8% above the historical pass mark respectively. Both ChatGPT-3.5 and -4 performance was significantly above the historical pass mark for MRCP Part 1, indicating they would likely pass this examination. ChatGPT-3.5 failed eight out of nine postgraduate exams it performed with an average percentage of 5.0% below the pass mark. ChatGPT-4 passed nine out of eleven postgraduate exams it performed with an average percentage of 13.56% above the pass mark. ChatGPT-4 performance was significantly better than ChatGPT-3.5 in all examinations that both models were tested on.

CONCLUSION

ChatGPT-4 performed at above passing level for the majority of UK postgraduate medical examinations it was tested on. ChatGPT is prone to hallucinations, fabrications and reduced explanation accuracy which could limit its potential as a learning tool. The potential for these errors is an inherent part of LLMs and may always be a limitation for medical applications of ChatGPT.

Collapse

Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res 2024;26:e60807. [PMID: 39052324 PMCID: PMC11310649 DOI: 10.2196/60807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 06/11/2024] [Accepted: 06/15/2024] [Indexed: 07/27/2024] Open

Abstract

BACKGROUND

Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT's performance on different medical licensing examinations.

OBJECTIVE

In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education.

METHODS

We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses.

RESULTS

A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5's performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non-English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5's (P=.03) and GPT-4's (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT's accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs.

CONCLUSIONS

GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education.

TRIAL REGISTRATION

PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687.

Collapse

Paul S, Govindaraj S, Jk J. ChatGPT Versus National Eligibility cum Entrance Test for Postgraduate (NEET PG). Cureus 2024;16:e63048. [PMID: 39050297 PMCID: PMC11268980 DOI: 10.7759/cureus.63048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/24/2024] [Indexed: 07/27/2024] Open

Huang L, Hu J, Cai Q, Fu G, Bai Z, Liu Y, Zheng J, Meng Z. The performance evaluation of artificial intelligence ERNIE bot in Chinese National Medical Licensing Examination. Postgrad Med J 2024:qgae062. [PMID: 38813794 DOI: 10.1093/postmj/qgae062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Accepted: 04/23/2024] [Indexed: 05/31/2024]

Affiliation(s)

Leiyun Huang Medical College, Kunming University of Science and Technology, Kunming, 650500, China Department of Orthopedics, The First People's Hospital of Yunnan Province, Kunming, 650032, China Key Laboratory of Digital Orthopedics of Yunnan Province, Kunming, 650032, China Clinical Medicine Research Center of Orthopedics and Sports Rehabilitation in Yunnan Province, Kunming, 650032, China Clinical Medical Center for Spinal Cord Diseases in Yunnan Province, Kunming, 650032, China
Jinghan Hu People's Hospital of Wenshan Prefecture, the Affiliated Hospital of Kunming University of Science and Technology, Wenshan, 663000, China
Qingjin Cai Department of Urology, Urologic Surgery Center, Xinqiao Hospital, Third Military Medical University (Army Medical University), Chongqing, 400037, China
Guangjie Fu Medical College, Kunming University of Science and Technology, Kunming, 650500, China Department of Orthopedics, The First People's Hospital of Yunnan Province, Kunming, 650032, China Key Laboratory of Digital Orthopedics of Yunnan Province, Kunming, 650032, China Clinical Medicine Research Center of Orthopedics and Sports Rehabilitation in Yunnan Province, Kunming, 650032, China Clinical Medical Center for Spinal Cord Diseases in Yunnan Province, Kunming, 650032, China
Zhenglin Bai Medical College, Kunming University of Science and Technology, Kunming, 650500, China Department of Orthopedics, The First People's Hospital of Yunnan Province, Kunming, 650032, China Key Laboratory of Digital Orthopedics of Yunnan Province, Kunming, 650032, China Clinical Medicine Research Center of Orthopedics and Sports Rehabilitation in Yunnan Province, Kunming, 650032, China Clinical Medical Center for Spinal Cord Diseases in Yunnan Province, Kunming, 650032, China
Yongzhen Liu People's Hospital of Wenshan Prefecture, the Affiliated Hospital of Kunming University of Science and Technology, Wenshan, 663000, China
Ji Zheng Department of Urology, Urologic Surgery Center, Xinqiao Hospital, Third Military Medical University (Army Medical University), Chongqing, 400037, China
Zengdong Meng Medical College, Kunming University of Science and Technology, Kunming, 650500, China Department of Orthopedics, The First People's Hospital of Yunnan Province, Kunming, 650032, China Key Laboratory of Digital Orthopedics of Yunnan Province, Kunming, 650032, China Clinical Medicine Research Center of Orthopedics and Sports Rehabilitation in Yunnan Province, Kunming, 650032, China Clinical Medical Center for Spinal Cord Diseases in Yunnan Province, Kunming, 650032, China

Collapse

Bharatha A, Ojeh N, Fazle Rabbi AM, Campbell MH, Krishnamurthy K, Layne-Yarde RNA, Kumar A, Springer DCR, Connell KL, Majumder MAA. Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom's Taxonomy. ADVANCES IN MEDICAL EDUCATION AND PRACTICE 2024;15:393-400. [PMID: 38751805 PMCID: PMC11094742 DOI: 10.2147/amep.s457408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 05/01/2024] [Indexed: 05/18/2024]

Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res 2024;13:e54704. [PMID: 38276872 PMCID: PMC10905357 DOI: 10.2196/54704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 12/18/2023] [Accepted: 01/26/2024] [Indexed: 01/27/2024] Open

Abstract

BACKGROUND

Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence.

OBJECTIVE

This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice.

METHODS

A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with "ChatGPT," "Bing," or "Bard" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability.

RESULTS

The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 (P<.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the "Model" item, followed by the "Specificity" item, while the lowest scores were recorded for the "Randomization" item (classified as suboptimal) and "Individual factors" item (classified as satisfactory).

CONCLUSIONS

The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.

Collapse

Sallam M, Al-Salahat K. Below average ChatGPT performance in medical microbiology exam compared to university students. FRONTIERS IN EDUCATION 2023;8. [DOI: 10.3389/feduc.2023.1333415] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/01/2024]

Abstract BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education. Collapse

Ohta K, Ohta S. The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study. Cureus 2023;15:e50369. [PMID: 38213361 PMCID: PMC10782219 DOI: 10.7759/cureus.50369] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2023] [Indexed: 01/13/2024] Open

Abstract

Purpose This study aims to evaluate the performance of three large language models (LLMs), the Generative Pre-trained Transformer (GPT)-3.5, GPT-4, and Google Bard, on the 2023 Japanese National Dentist Examination (JNDE) and assess their potential clinical applications in Japan. Methods A total of 185 questions from the 2023 JNDE were used. These questions were categorized by question type and category. McNemar's test compared the correct response rates between two LLMs, while Fisher's exact test evaluated the performance of LLMs in each question category. Results The overall correct response rates were 73.5% for GPT-4, 66.5% for Bard, and 51.9% for GPT-3.5. GPT-4 showed a significantly higher correct response rate than Bard and GPT-3.5. In the category of essential questions, Bard achieved a correct response rate of 80.5%, surpassing the passing criterion of 80%. In contrast, both GPT-4 and GPT-3.5 fell short of this benchmark, with GPT-4 attaining 77.6% and GPT-3.5 only 52.5%. The scores of GPT-4 and Bard were significantly higher than that of GPT-3.5 (p<0.01). For general questions, the correct response rates were 71.2% for GPT-4, 58.5% for Bard, and 52.5% for GPT-3.5. GPT-4 outperformed GPT-3.5 and Bard (p<0.01). The correct response rates for professional dental questions were 51.6% for GPT-4, 45.3% for Bard, and 35.9% for GPT-3.5. The differences among the models were not statistically significant. All LLMs demonstrated significantly lower accuracy for dentistry questions compared to other types of questions (p<0.01). Conclusions GPT-4 achieved the highest overall score in the JNDE, followed by Bard and GPT-3.5. However, only Bard surpassed the passing score for essential questions. To further understand the application of LLMs in clinical dentistry worldwide, more research on their performance in dental examinations across different languages is required.

Collapse