1
|
Yanagita Y, Yokokawa D, Uchida S, Li Y, Uehara T, Ikusaka M. Can AI-Generated Clinical Vignettes in Japanese Be Used Medically and Linguistically? J Gen Intern Med 2024:10.1007/s11606-024-09031-y. [PMID: 39313665 DOI: 10.1007/s11606-024-09031-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 09/10/2024] [Indexed: 09/25/2024]
Abstract
BACKGROUND Creating clinical vignettes requires considerable effort. Recent developments in generative artificial intelligence (AI) for natural language processing have been remarkable and may allow for the easy and immediate creation of diverse clinical vignettes. OBJECTIVE In this study, we evaluated the medical accuracy and grammatical correctness of AI-generated clinical vignettes in Japanese and verified their usefulness. METHODS Clinical vignettes were created using the generative AI model GPT-4-0613. The input prompts for the clinical vignettes specified the following seven elements: (1) age, (2) sex, (3) chief complaint and time course since onset, (4) physical findings, (5) examination results, (6) diagnosis, and (7) treatment course. The list of diseases integrated into the vignettes was based on 202 cases considered in the management of diseases and symptoms in Japan's Primary Care Physicians Training Program. The clinical vignettes were evaluated for medical and Japanese-language accuracy by three physicians using a five-point scale. A total score of 13 points or above was defined as "sufficiently beneficial and immediately usable with minor revisions," a score between 10 and 12 points was defined as "partly insufficient and in need of modifications," and a score of 9 points or below was defined as "insufficient." RESULTS Regarding medical accuracy, of the 202 clinical vignettes, 118 scored 13 points or above, 78 scored between 10 and 12 points, and 6 scored 9 points or below. Regarding Japanese-language accuracy, 142 vignettes scored 13 points or above, 56 scored between 10 and 12 points, and 4 scored 9 points or below. Overall, 97% (196/202) of vignettes were available with some modifications. CONCLUSION Overall, 97% of the clinical vignettes proved practically useful, based on confirmation and revision by Japanese medical physicians. Given the significant effort required by physicians to create vignettes without AI, using GPT is expected to greatly optimize this process.
Collapse
Affiliation(s)
- Yasutaka Yanagita
- Department of General Medicine, Chiba University Hospital, Chiba, Japan.
| | - Daiki Yokokawa
- Department of General Medicine, Chiba University Hospital, Chiba, Japan
| | - Shun Uchida
- Uchida Internal Medicine Clinic, Saitama, Japan
| | - Yu Li
- Department of General Medicine, Chiba University Hospital, Chiba, Japan
| | - Takanori Uehara
- Department of General Medicine, Chiba University Hospital, Chiba, Japan
| | - Masatomi Ikusaka
- Department of General Medicine, Chiba University Hospital, Chiba, Japan
| |
Collapse
|
2
|
Sallam M, Al-Salahat K, Eid H, Egger J, Puladi B. Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions. ADVANCES IN MEDICAL EDUCATION AND PRACTICE 2024; 15:857-871. [PMID: 39319062 PMCID: PMC11421444 DOI: 10.2147/amep.s479801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Accepted: 09/15/2024] [Indexed: 09/26/2024]
Abstract
Introduction Artificial intelligence (AI) chatbots excel in language understanding and generation. These models can transform healthcare education and practice. However, it is important to assess the performance of such AI models in various topics to highlight its strengths and possible limitations. This study aimed to evaluate the performance of ChatGPT (GPT-3.5 and GPT-4), Bing, and Bard compared to human students at a postgraduate master's level in Medical Laboratory Sciences. Methods The study design was based on the METRICS checklist for the design and reporting of AI-based studies in healthcare. The study utilized a dataset of 60 Clinical Chemistry multiple-choice questions (MCQs) initially conceived for assessing 20 MSc students. The revised Bloom's taxonomy was used as the framework for classifying the MCQs into four cognitive categories: Remember, Understand, Analyze, and Apply. A modified version of the CLEAR tool was used for the assessment of the quality of AI-generated content, with Cohen's κ for inter-rater agreement. Results Compared to the mean students' score which was 0.68±0.23, GPT-4 scored 0.90 ± 0.30, followed by Bing (0.77 ± 0.43), GPT-3.5 (0.73 ± 0.45), and Bard (0.67 ± 0.48). Statistically significant better performance was noted in lower cognitive domains (Remember and Understand) in GPT-3.5 (P=0.041), GPT-4 (P=0.003), and Bard (P=0.017) compared to the higher cognitive domains (Apply and Analyze). The CLEAR scores indicated that ChatGPT-4 performance was "Excellent" compared to the "Above average" performance of ChatGPT-3.5, Bing, and Bard. Discussion The findings indicated that ChatGPT-4 excelled in the Clinical Chemistry exam, while ChatGPT-3.5, Bing, and Bard were above average. Given that the MCQs were directed to postgraduate students with a high degree of specialization, the performance of these AI chatbots was remarkable. Due to the risk of academic dishonesty and possible dependence on these AI models, the appropriateness of MCQs as an assessment tool in higher education should be re-evaluated.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
- Scientific Approaches to Fight Epidemics of Infectious Diseases (SAFE-ID) Research Group, The University of Jordan, Amman, Jordan
| | - Khaled Al-Salahat
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Scientific Approaches to Fight Epidemics of Infectious Diseases (SAFE-ID) Research Group, The University of Jordan, Amman, Jordan
| | - Huda Eid
- Scientific Approaches to Fight Epidemics of Infectious Diseases (SAFE-ID) Research Group, The University of Jordan, Amman, Jordan
| | - Jan Egger
- Institute for AI in Medicine (IKIM), University Medicine Essen (AöR), Essen, Germany
| | - Behrus Puladi
- Institute of Medical Informatics, University Hospital RWTH Aachen, Aachen, Germany
| |
Collapse
|
3
|
Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC MEDICAL EDUCATION 2024; 24:1013. [PMID: 39285377 PMCID: PMC11406751 DOI: 10.1186/s12909-024-05944-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Accepted: 08/22/2024] [Indexed: 09/19/2024]
Abstract
BACKGROUND ChatGPT, a recently developed artificial intelligence (AI) chatbot, has demonstrated improved performance in examinations in the medical field. However, thus far, an overall evaluation of the potential of ChatGPT models (ChatGPT-3.5 and GPT-4) in a variety of national health licensing examinations is lacking. This study aimed to provide a comprehensive assessment of the ChatGPT models' performance in national licensing examinations for medical, pharmacy, dentistry, and nursing research through a meta-analysis. METHODS Following the PRISMA protocol, full-text articles from MEDLINE/PubMed, EMBASE, ERIC, Cochrane Library, Web of Science, and key journals were reviewed from the time of ChatGPT's introduction to February 27, 2024. Studies were eligible if they evaluated the performance of a ChatGPT model (ChatGPT-3.5 or GPT-4); related to national licensing examinations in the fields of medicine, pharmacy, dentistry, or nursing; involved multiple-choice questions; and provided data that enabled the calculation of effect size. Two reviewers independently completed data extraction, coding, and quality assessment. The JBI Critical Appraisal Tools were used to assess the quality of the selected articles. Overall effect size and 95% confidence intervals [CIs] were calculated using a random-effects model. RESULTS A total of 23 studies were considered for this review, which evaluated the accuracy of four types of national licensing examinations. The selected articles were in the fields of medicine (n = 17), pharmacy (n = 3), nursing (n = 2), and dentistry (n = 1). They reported varying accuracy levels, ranging from 36 to 77% for ChatGPT-3.5 and 64.4-100% for GPT-4. The overall effect size for the percentage of accuracy was 70.1% (95% CI, 65-74.8%), which was statistically significant (p < 0.001). Subgroup analyses revealed that GPT-4 demonstrated significantly higher accuracy in providing correct responses than its earlier version, ChatGPT-3.5. Additionally, in the context of health licensing examinations, the ChatGPT models exhibited greater proficiency in the following order: pharmacy, medicine, dentistry, and nursing. However, the lack of a broader set of questions, including open-ended and scenario-based questions, and significant heterogeneity were limitations of this meta-analysis. CONCLUSIONS This study sheds light on the accuracy of ChatGPT models in four national health licensing examinations across various countries and provides a practical basis and theoretical support for future research. Further studies are needed to explore their utilization in medical and health education by including a broader and more diverse range of questions, along with more advanced versions of AI chatbots.
Collapse
Affiliation(s)
- Hye Kyung Jin
- Research Institute of Pharmaceutical Sciences, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea
| | - Ha Eun Lee
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea
| | - EunYoung Kim
- Research Institute of Pharmaceutical Sciences, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea.
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea.
- Division of Licensing of Medicines and Regulatory Science, The Graduate School of Pharmaceutical Management, and Regulatory Science Policy, The Graduate School of Pharmaceutical Regulatory Sciences, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea.
| |
Collapse
|
4
|
Ishida K, Arisaka N, Fujii K. Analysis of Responses of GPT-4 V to the Japanese National Clinical Engineer Licensing Examination. J Med Syst 2024; 48:83. [PMID: 39259341 DOI: 10.1007/s10916-024-02103-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Accepted: 08/28/2024] [Indexed: 09/13/2024]
Abstract
Chat Generative Pretrained Transformer (ChatGPT; OpenAI) is a state-of-the-art large language model that can simulate human-like conversations based on user input. We evaluated the performance of GPT-4 V in the Japanese National Clinical Engineer Licensing Examination using 2,155 questions from 2012 to 2023. The average correct answer rate for all questions was 86.0%. In particular, clinical medicine, basic medicine, medical materials, biological properties, and mechanical engineering achieved a correct response rate of ≥ 90%. Conversely, medical device safety management, electrical and electronic engineering, and extracorporeal circulation obtained low correct answer rates ranging from 64.8% to 76.5%. The correct answer rates for questions that included figures/tables, required numerical calculation, figure/table ∩ calculation, and knowledge of Japanese Industrial Standards were 55.2%, 85.8%, 64.2% and 31.0%, respectively. The reason for the low correct answer rates is that ChatGPT lacked recognition of the images and knowledge of standards and laws. This study concludes that careful attention is required when using ChatGPT because several of its explanations lack the correct description.
Collapse
Affiliation(s)
- Kai Ishida
- Department of Materials and Human Environmental Sciences, Faculty of Engineering, Shonan Institute of Technology, Fujisawa, Japan.
| | - Naoya Arisaka
- Department of Medical Informatics, School of Allied Health Science, Kitasato University, Sagamihara, Japan
| | - Kiyotaka Fujii
- Department of Clinical Engineering, School of Allied Health Science, Kitasato University, Sagamihara, Japan
| |
Collapse
|
5
|
Chau RCW, Thu KM, Yu OY, Lo ECM, Hsung RTC, Lam WYH. Response to Generative AI in Dental Licensing Examinations: Comment. Int Dent J 2024; 74:897-898. [PMID: 38403499 PMCID: PMC11287190 DOI: 10.1016/j.identj.2024.02.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Accepted: 02/06/2024] [Indexed: 02/27/2024] Open
Affiliation(s)
| | - Khaing Myat Thu
- Faculty of Dentistry, The University of Hong Kong, Hong Kong, China
| | - Ollie Yiru Yu
- Faculty of Dentistry, The University of Hong Kong, Hong Kong, China
| | | | - Richard Tai-Chiu Hsung
- Faculty of Dentistry, The University of Hong Kong, Hong Kong, China; Department of Computer Science, Hong Kong Chu Hai College, Hong Kong, China
| | - Walter Yu Hang Lam
- Faculty of Dentistry, The University of Hong Kong, Hong Kong, China; Musketeers Foundation Institute of Data Science, The University of Hong Kong, Hong Kong, China.
| |
Collapse
|
6
|
Yokokawa D, Yanagita Y, Li Y, Yamashita S, Shikino K, Noda K, Tsukamoto T, Uehara T, Ikusaka M. For any disease a human can imagine, ChatGPT can generate a fake report. Diagnosis (Berl) 2024; 11:329-332. [PMID: 38386808 DOI: 10.1515/dx-2024-0007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 02/06/2024] [Indexed: 02/24/2024]
Affiliation(s)
- Daiki Yokokawa
- Department of General Medicine, 92154 Chiba University Hospital , Chiba, Japan
| | - Yasutaka Yanagita
- Department of General Medicine, 92154 Chiba University Hospital , Chiba, Japan
| | - Yu Li
- Department of General Medicine, 92154 Chiba University Hospital , Chiba, Japan
| | - Shiho Yamashita
- Department of General Medicine, 92154 Chiba University Hospital , Chiba, Japan
| | - Kiyoshi Shikino
- Department of General Medicine, 92154 Chiba University Hospital , Chiba, Japan
- Department of Community-oriented Medical Education, Chiba University Graduate School of Medicine, Chiba, Japan
| | - Kazutaka Noda
- Department of General Medicine, 92154 Chiba University Hospital , Chiba, Japan
| | - Tomoko Tsukamoto
- Department of General Medicine, 92154 Chiba University Hospital , Chiba, Japan
| | - Takanori Uehara
- Department of General Medicine, 92154 Chiba University Hospital , Chiba, Japan
| | - Masatomi Ikusaka
- Department of General Medicine, 92154 Chiba University Hospital , Chiba, Japan
| |
Collapse
|
7
|
Hirano Y, Hanaoka S, Nakao T, Miki S, Kikuchi T, Nakamura Y, Nomura Y, Yoshikawa T, Abe O. GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination. Jpn J Radiol 2024; 42:918-926. [PMID: 38733472 PMCID: PMC11286662 DOI: 10.1007/s11604-024-01561-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 03/16/2024] [Indexed: 05/13/2024]
Abstract
PURPOSE To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI's latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE). MATERIALS AND METHODS The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar's exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test. RESULTS The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses. CONCLUSION No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.
Collapse
Affiliation(s)
- Yuichiro Hirano
- Department of Radiology, The International University of Health and Welfare Narita Hospital, 852 Hatakeda, Narita, Chiba, Japan.
- Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan.
| | - Shouhei Hanaoka
- Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| | - Takahiro Nakao
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| | - Soichiro Miki
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| | - Tomohiro Kikuchi
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
- Department of Radiology, School of Medicine, Jichi Medical University, 3311-1 Yakushiji, Shimotsuke, Tochigi, Japan
| | - Yuta Nakamura
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| | - Yukihiro Nomura
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
- Center for Frontier Medical Engineering, Chiba University, 1-33 Yayoicho, Inage-Ku, Chiba, Japan
| | - Takeharu Yoshikawa
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| | - Osamu Abe
- Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| |
Collapse
|
8
|
Ishida K, Hanada E. Potential of ChatGPT to Pass the Japanese Medical and Healthcare Professional National Licenses: A Literature Review. Cureus 2024; 16:e66324. [PMID: 39247019 PMCID: PMC11377128 DOI: 10.7759/cureus.66324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/06/2024] [Indexed: 09/10/2024] Open
Abstract
This systematic review aimed to assess the academic potential of ChatGPT (GPT-3.5, 4, and 4V) for Japanese national medical and healthcare licensing examinations, taking into account its strengths and limitations. Electronic databases such as PubMed/Medline, Google Scholar, and ICHUSHI (a Japanese medical article database) were systematically searched for relevant articles, particularly those published between January 1, 2022, and April 30, 2024. A formal narrative analysis was conducted by systematically arranging similarities and differences between individual research findings together. After rigorous screening, we reviewed 22 articles. Except for one article, all articles that evaluated GPT-4 showed that this tool could pass each exam containing text only. However, some studies also reported that, despite the possibility to pass, the results of GPT-4 were worse than those of the actual examinee. Moreover, the newest model GPT-4V insufficiently recognized images, thereby providing insufficient answers to questions that involved images and figures/tables. Therefore, their precision needs to be improved to obtain better results.
Collapse
Affiliation(s)
- Kai Ishida
- Faculty of Engineering, Shonan Institute of Technology, Fujisawa, JPN
| | - Eisuke Hanada
- Faculty of Science and Engineering, Saga University, Saga, JPN
| |
Collapse
|
9
|
Hsieh CH, Hsieh HY, Lin HP. Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination. Heliyon 2024; 10:e34851. [PMID: 39149010 PMCID: PMC11324965 DOI: 10.1016/j.heliyon.2024.e34851] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 06/27/2024] [Accepted: 07/17/2024] [Indexed: 08/17/2024] Open
Abstract
Background Chat Generative Pre-Trained Transformer (ChatGPT) is a state-of-the-art large language model that has been evaluated across various medical fields, with mixed performance on licensing examinations. This study aimed to assess the performance of ChatGPT-3.5 and ChatGPT-4 in answering questions from the Taiwan Plastic Surgery Board Examination. Methods The study evaluated the performance of ChatGPT-3.5 and ChatGPT-4 on 1375 questions from the past 8 years of the Taiwan Plastic Surgery Board Examination, including 985 single-choice and 390 multiple-choice questions. We obtained the responses between June and July 2023, launching a new chat session for each question to eliminate memory retention bias. Results Overall, ChatGPT-4 outperformed ChatGPT-3.5, achieving a 59 % correct answer rate compared to 41 % for ChatGPT-3.5. ChatGPT-4 passed five out of eight yearly exams, whereas ChatGPT-3.5 failed all. On single-choice questions, ChatGPT-4 scored 66 % correct, compared to 48 % for ChatGPT-3.5. On multiple-choice, ChatGPT-4 achieved a 43 % correct rate, nearly double the 23 % of ChatGPT-3.5. Conclusion As ChatGPT evolves, its performance on the Taiwan Plastic Surgery Board Examination is expected to improve further. The study suggests potential reforms, such as incorporating more problem-based scenarios, leveraging ChatGPT to refine exam questions, and integrating AI-assisted learning into candidate preparation. These advancements could enhance the assessment of candidates' critical thinking and problem-solving abilities in the field of plastic surgery.
Collapse
Affiliation(s)
- Ching-Hua Hsieh
- Department of Plastic Surgery, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University and College of Medicine, Kaohsiung, 83301, Taiwan
| | - Hsiao-Yun Hsieh
- Department of Plastic Surgery, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University and College of Medicine, Kaohsiung, 83301, Taiwan
| | - Hui-Ping Lin
- Department of Plastic Surgery, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University and College of Medicine, Kaohsiung, 83301, Taiwan
| |
Collapse
|
10
|
Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res 2024; 26:e60807. [PMID: 39052324 PMCID: PMC11310649 DOI: 10.2196/60807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 06/11/2024] [Accepted: 06/15/2024] [Indexed: 07/27/2024] Open
Abstract
BACKGROUND Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT's performance on different medical licensing examinations. OBJECTIVE In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education. METHODS We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses. RESULTS A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5's performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non-English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5's (P=.03) and GPT-4's (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT's accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs. CONCLUSIONS GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education. TRIAL REGISTRATION PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687.
Collapse
Affiliation(s)
- Mingxin Liu
- Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Tsuyoshi Okuhara
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - XinYi Chang
- Department of Industrial Engineering and Economics, School of Engineering, Tokyo Institute of Technology, Tokyo, Japan
| | - Ritsuko Shirabe
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Yuriko Nishiie
- Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Hiroko Okada
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Takahiro Kiuchi
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
11
|
Rossettini G, Rodeghiero L, Corradi F, Cook C, Pillastrini P, Turolla A, Castellini G, Chiappinotto S, Gianola S, Palese A. Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC MEDICAL EDUCATION 2024; 24:694. [PMID: 38926809 PMCID: PMC11210096 DOI: 10.1186/s12909-024-05630-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 06/04/2024] [Indexed: 06/28/2024]
Abstract
BACKGROUND Artificial intelligence (AI) chatbots are emerging educational tools for students in healthcare science. However, assessing their accuracy is essential prior to adoption in educational settings. This study aimed to assess the accuracy of predicting the correct answers from three AI chatbots (ChatGPT-4, Microsoft Copilot and Google Gemini) in the Italian entrance standardized examination test of healthcare science degrees (CINECA test). Secondarily, we assessed the narrative coherence of the AI chatbots' responses (i.e., text output) based on three qualitative metrics: the logical rationale behind the chosen answer, the presence of information internal to the question, and presence of information external to the question. METHODS An observational cross-sectional design was performed in September of 2023. Accuracy of the three chatbots was evaluated for the CINECA test, where questions were formatted using a multiple-choice structure with a single best answer. The outcome is binary (correct or incorrect). Chi-squared test and a post hoc analysis with Bonferroni correction assessed differences among chatbots performance in accuracy. A p-value of < 0.05 was considered statistically significant. A sensitivity analysis was performed, excluding answers that were not applicable (e.g., images). Narrative coherence was analyzed by absolute and relative frequencies of correct answers and errors. RESULTS Overall, of the 820 CINECA multiple-choice questions inputted into all chatbots, 20 questions were not imported in ChatGPT-4 (n = 808) and Google Gemini (n = 808) due to technical limitations. We found statistically significant differences in the ChatGPT-4 vs Google Gemini and Microsoft Copilot vs Google Gemini comparisons (p-value < 0.001). The narrative coherence of AI chatbots revealed "Logical reasoning" as the prevalent correct answer (n = 622, 81.5%) and "Logical error" as the prevalent incorrect answer (n = 40, 88.9%). CONCLUSIONS Our main findings reveal that: (A) AI chatbots performed well; (B) ChatGPT-4 and Microsoft Copilot performed better than Google Gemini; and (C) their narrative coherence is primarily logical. Although AI chatbots showed promising accuracy in predicting the correct answer in the Italian entrance university standardized examination test, we encourage candidates to cautiously incorporate this new technology to supplement their learning rather than a primary resource. TRIAL REGISTRATION Not required.
Collapse
Affiliation(s)
- Giacomo Rossettini
- School of Physiotherapy, University of Verona, Verona, Italy.
- Department of Physiotherapy, Faculty of Sport Sciences, Universidad Europea de Madrid, Villaviciosa de Odón, 28670, Spain.
| | - Lia Rodeghiero
- Department of Rehabilitation, Hospital of Merano (SABES-ASDAA), Teaching Hospital of Paracelsus Medical University (PMU), Merano-Meran, Italy.
| | | | - Chad Cook
- Department of Orthopaedics, Duke University, Durham, NC, USA
- Duke Clinical Research Institute, Duke University, Durham, NC, USA
- Department of Population Health Sciences, Duke University, Durham, NC, USA
| | - Paolo Pillastrini
- Department of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy
- Unit of Occupational Medicine, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, Bologna, Italy
| | - Andrea Turolla
- Department of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy
- Unit of Occupational Medicine, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, Bologna, Italy
| | - Greta Castellini
- Unit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| | | | - Silvia Gianola
- Unit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, Italy.
| | - Alvisa Palese
- Department of Medical Sciences, University of Udine, Udine, Italy.
| |
Collapse
|
12
|
Yanagita Y, Yokokawa D, Fukuzawa F, Uchida S, Uehara T, Ikusaka M. Expert assessment of ChatGPT's ability to generate illness scripts: an evaluative study. BMC MEDICAL EDUCATION 2024; 24:536. [PMID: 38750546 PMCID: PMC11095028 DOI: 10.1186/s12909-024-05534-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 05/08/2024] [Indexed: 05/19/2024]
Abstract
BACKGROUND An illness script is a specific script format geared to represent patient-oriented clinical knowledge organized around enabling conditions, faults (i.e., pathophysiological process), and consequences. Generative artificial intelligence (AI) stands out as an educational aid in continuing medical education. The effortless creation of a typical illness script by generative AI could help the comprehension of key features of diseases and increase diagnostic accuracy. No systematic summary of specific examples of illness scripts has been reported since illness scripts are unique to each physician. OBJECTIVE This study investigated whether generative AI can generate illness scripts. METHODS We utilized ChatGPT-4, a generative AI, to create illness scripts for 184 diseases based on the diseases and conditions integral to the National Model Core Curriculum in Japan for undergraduate medical education (2022 revised edition) and primary care specialist training in Japan. Three physicians applied a three-tier grading scale: "A" denotes that the content of each disease's illness script proves sufficient for training medical students, "B" denotes that it is partially lacking but acceptable, and "C" denotes that it is deficient in multiple respects. RESULTS By leveraging ChatGPT-4, we successfully generated each component of the illness script for 184 diseases without any omission. The illness scripts received "A," "B," and "C" ratings of 56.0% (103/184), 28.3% (52/184), and 15.8% (29/184), respectively. CONCLUSION Useful illness scripts were seamlessly and instantaneously created using ChatGPT-4 by employing prompts appropriate for medical students. The technology-driven illness script is a valuable tool for introducing medical students to key features of diseases.
Collapse
Affiliation(s)
- Yasutaka Yanagita
- Department of General Medicine, Chiba University Hospital, 1-8-1, Inohana, Chuo-Ku, Chiba, Chiba Pref, Japan.
| | - Daiki Yokokawa
- Department of General Medicine, Chiba University Hospital, 1-8-1, Inohana, Chuo-Ku, Chiba, Chiba Pref, Japan
| | - Fumitoshi Fukuzawa
- Department of General Medicine, Chiba University Hospital, 1-8-1, Inohana, Chuo-Ku, Chiba, Chiba Pref, Japan
| | - Shun Uchida
- Uchida Internal Medicine Clinic, Saitama, Japan
| | - Takanori Uehara
- Department of General Medicine, Chiba University Hospital, 1-8-1, Inohana, Chuo-Ku, Chiba, Chiba Pref, Japan
| | - Masatomi Ikusaka
- Department of General Medicine, Chiba University Hospital, 1-8-1, Inohana, Chuo-Ku, Chiba, Chiba Pref, Japan
| |
Collapse
|
13
|
Bharatha A, Ojeh N, Fazle Rabbi AM, Campbell MH, Krishnamurthy K, Layne-Yarde RNA, Kumar A, Springer DCR, Connell KL, Majumder MAA. Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom's Taxonomy. ADVANCES IN MEDICAL EDUCATION AND PRACTICE 2024; 15:393-400. [PMID: 38751805 PMCID: PMC11094742 DOI: 10.2147/amep.s457408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 05/01/2024] [Indexed: 05/18/2024]
Abstract
Introduction This research investigated the capabilities of ChatGPT-4 compared to medical students in answering MCQs using the revised Bloom's Taxonomy as a benchmark. Methods A cross-sectional study was conducted at The University of the West Indies, Barbados. ChatGPT-4 and medical students were assessed on MCQs from various medical courses using computer-based testing. Results The study included 304 MCQs. Students demonstrated good knowledge, with 78% correctly answering at least 90% of the questions. However, ChatGPT-4 achieved a higher overall score (73.7%) compared to students (66.7%). Course type significantly affected ChatGPT-4's performance, but revised Bloom's Taxonomy levels did not. A detailed association check between program levels and Bloom's taxonomy levels for correct answers by ChatGPT-4 showed a highly significant correlation (p<0.001), reflecting a concentration of "remember-level" questions in preclinical and "evaluate-level" questions in clinical courses. Discussion The study highlights ChatGPT-4's proficiency in standardized tests but indicates limitations in clinical reasoning and practical skills. This performance discrepancy suggests that the effectiveness of artificial intelligence (AI) varies based on course content. Conclusion While ChatGPT-4 shows promise as an educational tool, its role should be supplementary, with strategic integration into medical education to leverage its strengths and address limitations. Further research is needed to explore AI's impact on medical education and student performance across educational levels and courses.
Collapse
Affiliation(s)
- Ambadasu Bharatha
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | - Nkemcho Ojeh
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | | | - Michael H Campbell
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | | | | | - Alok Kumar
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | - Dale C R Springer
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | - Kenneth L Connell
- Faculty of Medical Sciences, The University of the West Indies, Bridgetown, Barbados
| | | |
Collapse
|
14
|
Jedrzejczak WW, Skarzynski PH, Raj-Koziak D, Sanfins MD, Hatzopoulos S, Kochanek K. ChatGPT for Tinnitus Information and Support: Response Accuracy and Retest after Three and Six Months. Brain Sci 2024; 14:465. [PMID: 38790444 PMCID: PMC11118795 DOI: 10.3390/brainsci14050465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Revised: 05/03/2024] [Accepted: 05/05/2024] [Indexed: 05/26/2024] Open
Abstract
Testing of ChatGPT has recently been performed over a diverse range of topics. However, most of these assessments have been based on broad domains of knowledge. Here, we test ChatGPT's knowledge of tinnitus, an important but specialized aspect of audiology and otolaryngology. Testing involved evaluating ChatGPT's answers to a defined set of 10 questions on tinnitus. Furthermore, given the technology is advancing quickly, we re-evaluated the responses to the same 10 questions 3 and 6 months later. The accuracy of the responses was rated by 6 experts (the authors) using a Likert scale ranging from 1 to 5. Most of ChatGPT's responses were rated as satisfactory or better. However, we did detect a few instances where the responses were not accurate and might be considered somewhat misleading. Over the first 3 months, the ratings generally improved, but there was no more significant improvement at 6 months. In our judgment, ChatGPT provided unexpectedly good responses, given that the questions were quite specific. Although no potentially harmful errors were identified, some mistakes could be seen as somewhat misleading. ChatGPT shows great potential if further developed by experts in specific areas, but for now, it is not yet ready for serious application.
Collapse
Affiliation(s)
- W. Wiktor Jedrzejczak
- Department of Experimental Audiology, World Hearing Center, Institute of Physiology and Pathology of Hearing, 05-830 Kajetany, Poland;
| | - Piotr H. Skarzynski
- Department of Teleaudiology and Screening, World Hearing Center, Institute of Physiology and Pathology of Hearing, 05-830 Kajetany, Poland; (P.H.S.); (M.D.S.)
- Institute of Sensory Organs, 05-830 Kajetany, Poland
- Heart Failure and Cardiac Rehabilitation Department, Faculty of Medicine, Medical University of Warsaw, 03-242 Warsaw, Poland
| | - Danuta Raj-Koziak
- Tinnitus Department, World Hearing Center, Institute of Physiology and Pathology of Hearing, 05-830 Kajetany, Poland;
| | - Milaine Dominici Sanfins
- Department of Teleaudiology and Screening, World Hearing Center, Institute of Physiology and Pathology of Hearing, 05-830 Kajetany, Poland; (P.H.S.); (M.D.S.)
- Speech-Hearing-Language Department, Audiology Discipline, Universidade Federal de São Paulo, São Paulo 04023062, Brazil
| | - Stavros Hatzopoulos
- ENT and Audiology Unit, Department of Neurosciences and Rehabilitation, University of Ferrara, 44121 Ferrara, Italy;
| | - Krzysztof Kochanek
- Department of Experimental Audiology, World Hearing Center, Institute of Physiology and Pathology of Hearing, 05-830 Kajetany, Poland;
| |
Collapse
|
15
|
Zhu L, Mou W, Hong C, Yang T, Lai Y, Qi C, Lin A, Zhang J, Luo P. The Evaluation of Generative AI Should Include Repetition to Assess Stability. JMIR Mhealth Uhealth 2024; 12:e57978. [PMID: 38688841 PMCID: PMC11106698 DOI: 10.2196/57978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 04/30/2024] [Indexed: 05/02/2024] Open
Abstract
The increasing interest in the potential applications of generative artificial intelligence (AI) models like ChatGPT in health care has prompted numerous studies to explore its performance in various medical contexts. However, evaluating ChatGPT poses unique challenges due to the inherent randomness in its responses. Unlike traditional AI models, ChatGPT generates different responses for the same input, making it imperative to assess its stability through repetition. This commentary highlights the importance of including repetition in the evaluation of ChatGPT to ensure the reliability of conclusions drawn from its performance. Similar to biological experiments, which often require multiple repetitions for validity, we argue that assessing generative AI models like ChatGPT demands a similar approach. Failure to acknowledge the impact of repetition can lead to biased conclusions and undermine the credibility of research findings. We urge researchers to incorporate appropriate repetition in their studies from the outset and transparently report their methods to enhance the robustness and reproducibility of findings in this rapidly evolving field.
Collapse
Affiliation(s)
- Lingxuan Zhu
- Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Weiming Mou
- Department of Urology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Chenglin Hong
- Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Tao Yang
- Department of Medical Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Yancheng Lai
- Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Chang Qi
- Institute of Logic and Computation, TU Wien, Austria
| | - Anqi Lin
- Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Jian Zhang
- Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Peng Luo
- Department of Oncology, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| |
Collapse
|
16
|
Kawahara T, Sumi Y. GPT-4/4V's performance on the Japanese National Medical Licensing Examination. MEDICAL TEACHER 2024:1-8. [PMID: 38648547 DOI: 10.1080/0142159x.2024.2342545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 04/09/2024] [Indexed: 04/25/2024]
Abstract
BACKGROUND Recent advances in Artificial Intelligence (AI) are changing the medical world, and AI will likely replace many of the actions performed by medical professionals. The overall clinical ability of the AI has been evaluated by its ability to answer a text-based national medical examination. This study uniquely assesses the performance of Open AI's ChatGPT against all Japanese National Medical Licensing Examination (NMLE), including images, illustrations, and pictures. METHODS We obtained the questions of the past six years of the NMLE (112th to 117th) from the Japanese Ministry of Health, Labour and Welfare website. We converted them to JavaScript Object Notation (JSON) format. We created an application programming interface (API) to output correct answers using GPT-4 for questions without images and GPT4-V(ision) or GPT4 console for questions with images. RESULTS The percentage of image questions was 723/2400 (30.1%) over the past six years. In all years, GPT-4/4V exceeded the minimum score the examinee should score. In total, over the six years, the percentage of correct answers for basic medical knowledge questions was 665/905 (73.5%); for clinical knowledge questions, 1143/1531 (74.7%); and for image questions 497/723 (68.7%), respectively. CONCLUSIONS Regarding medical knowledge, GPT-4/4V met the minimum criteria regardless of whether the questions included images, illustrations, and pictures. Our study sheds light on the potential utility of AI in medical education.
Collapse
Affiliation(s)
- Tomoki Kawahara
- Department of Clinical Information Applied Sciences, Tokyo Medical and Dental University, Tokyo, Japan
| | - Yuki Sumi
- Department of Clinical Information Applied Sciences, Tokyo Medical and Dental University, Tokyo, Japan
| |
Collapse
|
17
|
Pinto VBP, de Azevedo MF, Wroclawski ML, Gentile G, Jesus VLM, de Bessa Junior J, Nahas WC, Sacomani CAR, Sandhu JS, Gomes CM. Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence. Neurourol Urodyn 2024; 43:935-941. [PMID: 38451040 DOI: 10.1002/nau.25442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 02/24/2024] [Accepted: 02/27/2024] [Indexed: 03/08/2024]
Abstract
INTRODUCTION Artificial intelligence (AI) shows immense potential in medicine and Chat generative pretrained transformer (ChatGPT) has been used for different purposes in the field. However, it may not match the complexity and nuance of certain medical scenarios. This study evaluates the accuracy of ChatGPT 3.5 and 4 in providing recommendations regarding the management of postprostatectomy urinary incontinence (PPUI), considering The Incontinence After Prostate Treatment: AUA/SUFU Guideline as the best practice benchmark. MATERIALS AND METHODS A set of questions based on the AUA/SUFU Guideline was prepared. Queries included 10 conceptual questions and 10 case-based questions. All questions were open and entered into the ChatGPT with a recommendation to limit the answer to 200 words, for greater objectivity. Responses were graded as correct (1 point); partially correct (0.5 point), or incorrect (0 point). Performances of versions 3.5 and 4 of ChatGPT were analyzed overall and separately for the conceptual and the case-based questions. RESULTS ChatGPT 3.5 scored 11.5 out of 20 points (57.5% accuracy), while ChatGPT 4 scored 18 (90.0%; p = 0.031). In the conceptual questions, ChatGPT 3.5 provided accurate answers to six questions along with one partially correct response and three incorrect answers, with a final score of 6.5. In contrast, ChatGPT 4 provided correct answers to eight questions and partially correct answers to two questions, scoring 9.0. In the case-based questions, ChatGPT 3.5 scored 5.0, while ChatGPT 4 scored 9.0. The domains where ChatGPT performed worst were evaluation, treatment options, surgical complications, and special situations. CONCLUSION ChatGPT 4 demonstrated superior performance compared to ChatGPT 3.5 in providing recommendations for the management of PPUI, using the AUA/SUFU Guideline as a benchmark. Continuous monitoring is essential for evaluating the development and precision of AI-generated medical information.
Collapse
Affiliation(s)
- Vicktor B P Pinto
- Division of Urology, University of Sao Paulo School of Medicine, Sao Paulo, Brazil
| | - Matheus F de Azevedo
- Division of Urology, University of Sao Paulo School of Medicine, Sao Paulo, Brazil
| | - Marcelo L Wroclawski
- Division of Urology, ABC Medical School, Sao Paulo, Brazil
- Department of Urology, Albert Einstein Jewish Hospital, Sao Paulo, Brazil
- Department of Urologic Oncology, BP-a Beneficência Portuguesa de São Paulo, Sao Paulo, Brazil
| | - Guilherme Gentile
- Division of Urology, University of Sao Paulo School of Medicine, Sao Paulo, Brazil
| | - Vinicius L M Jesus
- Division of Urology, University of Sao Paulo School of Medicine, Sao Paulo, Brazil
| | | | - William C Nahas
- Division of Urology, University of Sao Paulo School of Medicine, Sao Paulo, Brazil
| | - Carlos A R Sacomani
- Innovation and Information Technology Sector, AC Camargo Cancer Hospital, Sao Paulo, Brazil
| | - Jaspreet S Sandhu
- Department of Surgery/Urology, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | - Cristiano M Gomes
- Division of Urology, University of Sao Paulo School of Medicine, Sao Paulo, Brazil
| |
Collapse
|
18
|
Noda M, Ueno T, Koshu R, Takaso Y, Shimada MD, Saito C, Sugimoto H, Fushiki H, Ito M, Nomura A, Yoshizaki T. Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study. JMIR MEDICAL EDUCATION 2024; 10:e57054. [PMID: 38546736 PMCID: PMC11009855 DOI: 10.2196/57054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 02/22/2024] [Accepted: 03/09/2024] [Indexed: 04/14/2024]
Abstract
BACKGROUND Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. OBJECTIVE This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. METHODS Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. RESULTS The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). CONCLUSIONS Examination of artificial intelligence's answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed.
Collapse
Affiliation(s)
- Masao Noda
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Takayoshi Ueno
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Ryota Koshu
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Yuji Takaso
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Mari Dias Shimada
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Chizu Saito
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Hisashi Sugimoto
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Hiroaki Fushiki
- Department of Otolaryngology, Mejiro University Ear Institute Clinic, Saitama, Japan
| | - Makoto Ito
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Akihiro Nomura
- College of Transdisciplinary Sciences for Innovation, Kanazawa University, Kanazawa, Japan
| | - Tomokazu Yoshizaki
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| |
Collapse
|
19
|
Nakao T, Miki S, Nakamura Y, Kikuchi T, Nomura Y, Hanaoka S, Yoshikawa T, Abe O. Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study. JMIR MEDICAL EDUCATION 2024; 10:e54393. [PMID: 38470459 DOI: 10.2196/54393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 12/26/2023] [Accepted: 02/16/2024] [Indexed: 03/13/2024]
Abstract
BACKGROUND Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. OBJECTIVE We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. METHODS We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. RESULTS Among the 108 questions with images, GPT-4V's accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P≥.99), respectively. CONCLUSIONS The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination.
Collapse
Affiliation(s)
- Takahiro Nakao
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Soichiro Miki
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Yuta Nakamura
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Tomohiro Kikuchi
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
- Department of Radiology, School of Medicine, Jichi Medical University, Shimotsuke, Tochigi, Japan
| | - Yukihiro Nomura
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
- Center for Frontier Medical Engineering, Chiba University, Inage-ku, Chiba, Japan
| | - Shouhei Hanaoka
- Department of Radiology, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Takeharu Yoshikawa
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Osamu Abe
- Department of Radiology, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| |
Collapse
|
20
|
Sato H, Ogasawara K. ChatGPT (GPT-4) passed the Japanese National License Examination for Pharmacists in 2022, answering all items including those with diagrams: a descriptive study. JOURNAL OF EDUCATIONAL EVALUATION FOR HEALTH PROFESSIONS 2024; 21:4. [PMID: 38413129 PMCID: PMC10948916 DOI: 10.3352/jeehp.2024.21.4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Accepted: 02/28/2024] [Indexed: 02/29/2024]
Abstract
PURPOSE The objective of this study was to assess the performance of ChatGPT (GPT-4) on all items, including those with diagrams, in the Japanese National License Examination for Pharmacists (JNLEP) and compare it with the previous GPT-3.5 model’s performance. METHODS The 107th JNLEP, conducted in 2022, with 344 items input into the GPT-4 model, was targeted for this study. Separately, 284 items, excluding those with diagrams, were entered into the GPT-3.5 model. The answers were categorized and analyzed to determine accuracy rates based on categories, subjects, and presence or absence of diagrams. The accuracy rates were compared to the main passing criteria (overall accuracy rate ≥62.9%). RESULTS The overall accuracy rate for all items in the 107th JNLEP in GPT-4 was 72.5%, successfully meeting all the passing criteria. For the set of items without diagrams, the accuracy rate was 80.0%, which was significantly higher than that of the GPT-3.5 model (43.5%). The GPT-4 model demonstrated an accuracy rate of 36.1% for items that included diagrams. CONCLUSION Advancements that allow GPT-4 to process images have made it possible for LLMs to answer all items in medical-related license examinations. This study’s findings confirm that ChatGPT (GPT-4) possesses sufficient knowledge to meet the passing criteria.
Collapse
Affiliation(s)
- Hiroyasu Sato
- Department of Pharmacy, Abashiri Kosei General Hospital, Abashiri, Japan
| | - Katsuhiko Ogasawara
- Graduate School of Health Sciences, Hokkaido University, Sapporo, Japan
- Graduate School of Engineering, Muroran Institute of Technology, Muroran, Japan
| |
Collapse
|
21
|
Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res 2024; 13:e54704. [PMID: 38276872 PMCID: PMC10905357 DOI: 10.2196/54704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 12/18/2023] [Accepted: 01/26/2024] [Indexed: 01/27/2024] Open
Abstract
BACKGROUND Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence. OBJECTIVE This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice. METHODS A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with "ChatGPT," "Bing," or "Bard" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability. RESULTS The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 (P<.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the "Model" item, followed by the "Specificity" item, while the lowest scores were recorded for the "Randomization" item (classified as suboptimal) and "Individual factors" item (classified as satisfactory). CONCLUSIONS The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
- Department of Translational Medicine, Faculty of Medicine, Lund University, Malmo, Sweden
| | - Muna Barakat
- Department of Clinical Pharmacy and Therapeutics, Faculty of Pharmacy, Applied Science Private University, Amman, Jordan
| | - Mohammed Sallam
- Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates
| |
Collapse
|
22
|
Kim JH, Kim SK, Choi J, Lee Y. Reliability of ChatGPT for performing triage task in the emergency department using the Korean Triage and Acuity Scale. Digit Health 2024; 10:20552076241227132. [PMID: 38250148 PMCID: PMC10798071 DOI: 10.1177/20552076241227132] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 12/28/2023] [Indexed: 01/23/2024] Open
Abstract
Background Artificial intelligence (AI) technology can enable more efficient decision-making in healthcare settings. There is a growing interest in improving the speed and accuracy of AI systems in providing responses for given tasks in healthcare settings. Objective This study aimed to assess the reliability of ChatGPT in determining emergency department (ED) triage accuracy using the Korean Triage and Acuity Scale (KTAS). Methods Two hundred and two virtual patient cases were built. The gold standard triage classification for each case was established by an experienced ED physician. Three other human raters (ED paramedics) were involved and rated the virtual cases individually. The virtual cases were also rated by two different versions of the chat generative pre-trained transformer (ChatGPT, 3.5 and 4.0). Inter-rater reliability was examined using Fleiss' kappa and intra-class correlation coefficient (ICC). Results The kappa values for the agreement between the four human raters and ChatGPTs were .523 (version 4.0) and .320 (version 3.5). Of the five levels, the performance was poor when rating patients at levels 1 and 5, as well as case scenarios with additional text descriptions. There were differences in the accuracy of the different versions of GPTs. The ICC between version 3.5 and the gold standard was .520, and that between version 4.0 and the gold standard was .802. Conclusions A substantial level of inter-rater reliability was revealed when GPTs were used as KTAS raters. The current study showed the potential of using GPT in emergency healthcare settings. Considering the shortage of experienced manpower, this AI method may help improve triaging accuracy.
Collapse
Affiliation(s)
- Jae Hyuk Kim
- Department of Emergency Medicine, Mokpo Hankook Hospital, Jeonnam, South Korea
| | - Sun Kyung Kim
- Department of Nursing, Mokpo National University, Jeonnam, South Korea
- Department of Biomedicine, Health & Life Convergence Sciences, Biomedical and Healthcare Research Institute, Jeonnam, South Korea
| | - Jongmyung Choi
- Department of Computer Engineering, Mokpo National University, Jeonnam, South Korea
| | - Youngho Lee
- Department of Computer Engineering, Mokpo National University, Jeonnam, South Korea
| |
Collapse
|