Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology 2023:230582. [PMID: 37191485 DOI: 10.1148/radiol.230582] [Citation(s) in RCA: 127] [Impact Index Per Article: 127.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]

For:	Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology 2023:230582. [PMID: 37191485 DOI: 10.1148/radiol.230582] [Citation(s) in RCA: 127] [Impact Index Per Article: 127.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]

Number

Cited by Other Article(s)

Wu J, Wu X, Qiu Z, Li M, Lin S, Zhang Y, Zheng Y, Yuan C, Yang J. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inform Assoc 2024;31:2054-2064. [PMID: 38684792 PMCID: PMC11339525 DOI: 10.1093/jamia/ocae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 03/14/2024] [Accepted: 04/02/2024] [Indexed: 05/02/2024] Open

Abstract

OBJECTIVES

Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance.

MATERIALS AND METHODS

The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381 149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT (GPT-3.5), GPT-4, Baichuan2-7B, Baichuan2-13B, and QWEN-72B in CNMLE-2022 and further investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 distinct perspectives.

RESULTS

Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE framework, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT's performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70, affirming the effectiveness and robustness of the framework. It also enabled a smaller Baichuan2-13B to pass the examination, showcasing the great potential in low-resource settings.

DISCUSSION AND CONCLUSION

This study shed light on the optimal practices to enhance the capabilities of LLMs in non-English medical scenarios. By synergizing medical knowledge through in-context learning, LLMs can extend clinical insight beyond language barriers in healthcare, significantly reducing language-related disparities of LLM applications and ensuring global benefit in this field.

Collapse

Zhao Y, Coppola A, Karamchandani U, Amiras D, Gupte CM. Artificial intelligence applied to magnetic resonance imaging reliably detects the presence, but not the location, of meniscus tears: a systematic review and meta-analysis. Eur Radiol 2024;34:5954-5964. [PMID: 38386028 PMCID: PMC11364796 DOI: 10.1007/s00330-024-10625-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2023] [Revised: 12/24/2023] [Accepted: 01/13/2024] [Indexed: 02/23/2024]

Abstract

OBJECTIVES

To review and compare the accuracy of convolutional neural networks (CNN) for the diagnosis of meniscal tears in the current literature and analyze the decision-making processes utilized by these CNN algorithms.

MATERIALS AND METHODS

PubMed, MEDLINE, EMBASE, and Cochrane databases up to December 2022 were searched in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) statement. Risk of analysis was used for all identified articles. Predictive performance values, including sensitivity and specificity, were extracted for quantitative analysis. The meta-analysis was divided between AI prediction models identifying the presence of meniscus tears and the location of meniscus tears.

RESULTS

Eleven articles were included in the final review, with a total of 13,467 patients and 57,551 images. Heterogeneity was statistically significantly large for the sensitivity of the tear identification analysis (I2 = 79%). A higher level of accuracy was observed in identifying the presence of a meniscal tear over locating tears in specific regions of the meniscus (AUC, 0.939 vs 0.905). Pooled sensitivity and specificity were 0.87 (95% confidence interval (CI) 0.80-0.91) and 0.89 (95% CI 0.83-0.93) for meniscus tear identification and 0.88 (95% CI 0.82-0.91) and 0.84 (95% CI 0.81-0.85) for locating the tears.

CONCLUSIONS

AI prediction models achieved favorable performance in the diagnosis, but not location, of meniscus tears. Further studies on the clinical utilities of deep learning should include standardized reporting, external validation, and full reports of the predictive performances of these models, with a view to localizing tears more accurately.

CLINICAL RELEVANCE STATEMENT

Meniscus tears are hard to diagnose in the knee magnetic resonance images. AI prediction models may play an important role in improving the diagnostic accuracy of clinicians and radiologists.

KEY POINTS

• Artificial intelligence (AI) provides great potential in improving the diagnosis of meniscus tears. • The pooled diagnostic performance for artificial intelligence (AI) in identifying meniscus tears was better (sensitivity 87%, specificity 89%) than locating the tears (sensitivity 88%, specificity 84%). • AI is good at confirming the diagnosis of meniscus tears, but future work is required to guide the management of the disease.

Collapse

Ray PP. Integrating AI in radiology: insights from GPT-generated reports and multimodal LLM performance on European Board of Radiology examinations. Jpn J Radiol 2024;42:1083-1084. [PMID: 38647884 DOI: 10.1007/s11604-024-01576-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Accepted: 04/15/2024] [Indexed: 04/25/2024]

Hayden N, Gilbert S, Poisson LM, Griffith B, Klochko C. Performance of GPT-4 with Vision on Text- and Image-based ACR Diagnostic Radiology In-Training Examination Questions. Radiology 2024;312:e240153. [PMID: 39225605 DOI: 10.1148/radiol.240153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]

Abstract

Background Recent advancements, including image processing capabilities, present new potential applications of large language models such as ChatGPT (OpenAI), a generative pretrained transformer, in radiology. However, baseline performance of ChatGPT in radiology-related tasks is understudied. Purpose To evaluate the performance of GPT-4 with vision (GPT-4V) on radiology in-training examination questions, including those with images, to gauge the model's baseline knowledge in radiology. Materials and Methods In this prospective study, conducted between September 2023 and March 2024, the September 2023 release of GPT-4V was assessed using 386 retired questions (189 image-based and 197 text-only questions) from the American College of Radiology Diagnostic Radiology In-Training Examinations. Nine question pairs were identified as duplicates; only the first instance of each duplicate was considered in ChatGPT's assessment. A subanalysis assessed the impact of different zero-shot prompts on performance. Statistical analysis included χ2 tests of independence to ascertain whether the performance of GPT-4V varied between question types or subspecialty. The McNemar test was used to evaluate performance differences between the prompts, with Benjamin-Hochberg adjustment of the P values conducted to control the false discovery rate (FDR). A P value threshold of less than.05 denoted statistical significance. Results GPT-4V correctly answered 246 (65.3%) of the 377 unique questions, with significantly higher accuracy on text-only questions (81.5%, 159 of 195) than on image-based questions (47.8%, 87 of 182) (χ2 test, P < .001). Subanalysis revealed differences between prompts on text-based questions, where chain-of-thought prompting outperformed long instruction by 6.1% (McNemar, P = .02; FDR = 0.063), basic prompting by 6.8% (P = .009, FDR = 0.044), and the original prompting style by 8.9% (P = .001, FDR = 0.014). No differences were observed between prompts on image-based questions with P values of .27 to >.99. Conclusion While GPT-4V demonstrated a level of competence in text-based questions, it showed deficits interpreting radiologic images. © RSNA, 2024 See also the editorial by Deng in this issue.

Collapse

Reith TP, D'Alessandro DM, D'Alessandro MP. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr Radiol 2024;54:1729-1737. [PMID: 39133401 DOI: 10.1007/s00247-024-06025-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 07/31/2024] [Accepted: 08/01/2024] [Indexed: 08/13/2024]

Deng F. Multimodal Models Are Still a Novice at Radiology Vision. Radiology 2024;312:e242286. [PMID: 39225607 DOI: 10.1148/radiol.242286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]

Crim J. Bone radiographs: sometimes overlooked, often difficult to read, and still important. Skeletal Radiol 2024;53:1687-1698. [PMID: 37914896 DOI: 10.1007/s00256-023-04498-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 10/21/2023] [Accepted: 10/22/2023] [Indexed: 11/03/2023]

Mitsuyama Y, Tatekawa H, Takita H, Sasaki F, Tashiro A, Oue S, Walston SL, Nonomiya Y, Shintani A, Miki Y, Ueda D. Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors. Eur Radiol 2024:10.1007/s00330-024-11032-8. [PMID: 39198333 DOI: 10.1007/s00330-024-11032-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 07/08/2024] [Accepted: 08/06/2024] [Indexed: 09/01/2024]

Abstract

OBJECTIVES

Large language models like GPT-4 have demonstrated potential for diagnosis in radiology. Previous studies investigating this potential primarily utilized quizzes from academic journals. This study aimed to assess the diagnostic capabilities of GPT-4-based Chat Generative Pre-trained Transformer (ChatGPT) using actual clinical radiology reports of brain tumors and compare its performance with that of neuroradiologists and general radiologists.

METHODS

We collected brain MRI reports written in Japanese from preoperative brain tumor patients at two institutions from January 2017 to December 2021. The MRI reports were translated into English by radiologists. GPT-4 and five radiologists were presented with the same textual findings from the reports and asked to suggest differential and final diagnoses. The pathological diagnosis of the excised tumor served as the ground truth. McNemar's test and Fisher's exact test were used for statistical analysis.

RESULTS

In a study analyzing 150 radiological reports, GPT-4 achieved a final diagnostic accuracy of 73%, while radiologists' accuracy ranged from 65 to 79%. GPT-4's final diagnostic accuracy using reports from neuroradiologists was higher at 80%, compared to 60% using those from general radiologists. In the realm of differential diagnoses, GPT-4's accuracy was 94%, while radiologists' fell between 73 and 89%. Notably, for these differential diagnoses, GPT-4's accuracy remained consistent whether reports were from neuroradiologists or general radiologists.

CONCLUSION

GPT-4 exhibited good diagnostic capability, comparable to neuroradiologists in differentiating brain tumors from MRI reports. GPT-4 can be a second opinion for neuroradiologists on final diagnoses and a guidance tool for general radiologists and residents.

CLINICAL RELEVANCE STATEMENT

This study evaluated GPT-4-based ChatGPT's diagnostic capabilities using real-world clinical MRI reports from brain tumor cases, revealing that its accuracy in interpreting brain tumors from MRI findings is competitive with radiologists.

KEY POINTS

We investigated the diagnostic accuracy of GPT-4 using real-world clinical MRI reports of brain tumors. GPT-4 achieved final and differential diagnostic accuracy that is comparable with neuroradiologists. GPT-4 has the potential to improve the diagnostic process in clinical radiology.

Collapse

Affiliation(s)

Yasuhito Mitsuyama Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
Hiroyuki Tatekawa Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
Hirotaka Takita Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
Fumi Sasaki Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
Akane Tashiro Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
Satoshi Oue Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
Shannon L Walston Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
Yuta Nonomiya Department of Medical Statistics, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
Ayumi Shintani Department of Medical Statistics, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
Yukio Miki Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
Daiju Ueda Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan. Center for Health Science Innovation, Osaka Metropolitan University, 1-4-3, Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan.

Collapse

Warren BE, Alkhalifah F, Ahrari A, Min A, Fawzy A, Annamalai G, Jaberi A, Beecroft R, Kachura JR, Mafeld SC. Feasibility of Artificial Intelligence Powered Adverse Event Analysis: Using a Large Language Model to Analyze Microwave Ablation Malfunction Data. Can Assoc Radiol J 2024:8465371241269436. [PMID: 39169480 DOI: 10.1177/08465371241269436] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/23/2024] Open

Affiliation(s)

Blair E Warren Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
Fahd Alkhalifah Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
Aida Ahrari Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
Adam Min Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
Aly Fawzy Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
Ganesan Annamalai Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
Arash Jaberi Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
Robert Beecroft Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
John R Kachura Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
Sebastian C Mafeld Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada

Collapse

Ray PP. Need of Fine-Tuned Radiology Aware Open-Source Large Language Models for Neuroradiology. Clin Neuroradiol 2024:10.1007/s00062-024-01454-8. [PMID: 39158608 DOI: 10.1007/s00062-024-01454-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Accepted: 08/05/2024] [Indexed: 08/20/2024]

Holderried F, Stegemann-Philipps C, Herrmann-Werner A, Festl-Wietek T, Holderried M, Eickhoff C, Mahling M. A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study. JMIR MEDICAL EDUCATION 2024;10:e59213. [PMID: 39150749 PMCID: PMC11364946 DOI: 10.2196/59213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 05/21/2024] [Accepted: 06/27/2024] [Indexed: 08/17/2024]

Abstract

BACKGROUND

Although history taking is fundamental for diagnosing medical conditions, teaching and providing feedback on the skill can be challenging due to resource constraints. Virtual simulated patients and web-based chatbots have thus emerged as educational tools, with recent advancements in artificial intelligence (AI) such as large language models (LLMs) enhancing their realism and potential to provide feedback.

OBJECTIVE

In our study, we aimed to evaluate the effectiveness of a Generative Pretrained Transformer (GPT) 4 model to provide structured feedback on medical students' performance in history taking with a simulated patient.

METHODS

We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. To that end, we designed a chatbot to simulate patients' responses and provide immediate feedback on the comprehensiveness of the students' history taking. Students' interactions with the chatbot were analyzed, and feedback from the chatbot was compared with feedback from a human rater. We measured interrater reliability and performed a descriptive analysis to assess the quality of feedback.

RESULTS

Most of the study's participants were in their third year of medical school. A total of 1894 question-answer pairs from 106 conversations were included in our analysis. GPT-4's role-play and responses were medically plausible in more than 99% of cases. Interrater reliability between GPT-4 and the human rater showed "almost perfect" agreement (Cohen κ=0.832). Less agreement (κ<0.6) detected for 8 out of 45 feedback categories highlighted topics about which the model's assessments were overly specific or diverged from human judgement.

CONCLUSIONS

The GPT model was effective in providing structured feedback on history-taking dialogs provided by medical students. Although we unraveled some limitations regarding the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. Our findings, thus, advocate the careful integration of AI-driven feedback mechanisms in medical training and highlight important aspects when LLMs are used in that context.

Collapse

Sadeq MA, Ghorab RMF, Ashry MH, Abozaid AM, Banihani HA, Salem M, Aisheh MTA, Abuzahra S, Mourid MR, Assker MM, Ayyad M, Moawad MHED. AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study. Sci Rep 2024;14:18859. [PMID: 39143077 PMCID: PMC11324724 DOI: 10.1038/s41598-024-68996-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 07/30/2024] [Indexed: 08/16/2024] Open

Affiliation(s)

Mohammed Ahmed Sadeq Misr University for Science and Technology, 6th of October, Egypt. Medical Research Platform (MRP), Giza, Egypt. Emergency Medicine Department, Elsheikh Zayed Specialized Hospital, Elsheikh Zayed City, Egypt.
Reem Mohamed Farouk Ghorab Misr University for Science and Technology, 6th of October, Egypt Medical Research Platform (MRP), Giza, Egypt Emergency Medicine Department, Elsheikh Zayed Specialized Hospital, Elsheikh Zayed City, Egypt
Mohamed Hady Ashry Medical Research Platform (MRP), Giza, Egypt School of Medicine, New Giza University (NGU), Giza, Egypt
Ahmed Mohamed Abozaid Medical Research Platform (MRP), Giza, Egypt Faculty of Medicine, Tanta University, Tanta, Egypt
Haneen A Banihani Medical Research Platform (MRP), Giza, Egypt Faculty of Medicine, University of Jordan, Amman, Jordan
Moustafa Salem Medical Research Platform (MRP), Giza, Egypt Faculty of Medicine, Mansoura University, Mansoura, Egypt
Mohammed Tawfiq Abu Aisheh Medical Research Platform (MRP), Giza, Egypt Department of Medicine, College of Medicine and Health Sciences, An-Najah National University, Nablus, 44839, Palestine
Saad Abuzahra Medical Research Platform (MRP), Giza, Egypt Department of Medicine, College of Medicine and Health Sciences, An-Najah National University, Nablus, 44839, Palestine
Marina Ramzy Mourid Medical Research Platform (MRP), Giza, Egypt Faculty of Medicine, Alexandria University, Alexandria, Egypt
Mohamad Monif Assker Medical Research Platform (MRP), Giza, Egypt Sheikh Khalifa Medical City, Abu Dhabi, UAE
Mohammed Ayyad Medical Research Platform (MRP), Giza, Egypt Faculty of Medicine, Al-Quds University, Jerusalem, Palestine
Mostafa Hossam El Din Moawad Medical Research Platform (MRP), Giza, Egypt Faculty of Pharmacy Clinical Department, Alexandria University, Alexandria, Egypt Faculty of Medicine, Suez Canal University, Ismailia, Egypt

Collapse

Fatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: A cross-disciplinary systematic review of ChatGPT's (artificial intelligence) role in research, clinical practice, education, and patient interaction. Medicine (Baltimore) 2024;103:e39250. [PMID: 39121303 PMCID: PMC11315549 DOI: 10.1097/md.0000000000039250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 07/19/2024] [Indexed: 08/11/2024] Open

Beşler MS. The performance of the multimodal large language model GPT-4 on the European board of radiology examination sample test. Jpn J Radiol 2024;42:927. [PMID: 38568429 DOI: 10.1007/s11604-024-01565-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 03/24/2024] [Indexed: 07/30/2024]

Cao JJ, Kwon DH, Ghaziani TT, Kwo P, Tse G, Kesselman A, Kamaya A, Tse JR. Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability. Abdom Radiol (NY) 2024:10.1007/s00261-024-04501-7. [PMID: 39088019 DOI: 10.1007/s00261-024-04501-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 07/10/2024] [Accepted: 07/13/2024] [Indexed: 08/02/2024]

Abstract

PURPOSE

To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.

METHODS

Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests.

RESULTS

Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).

CONCLUSION

Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.

Collapse

Adams LC, Truhn D, Busch F, Dorfner F, Nawabi J, Makowski MR, Bressem KK. Llama 3 Challenges Proprietary State-of-the-Art Large Language Models in Radiology Board-style Examination Questions. Radiology 2024;312:e241191. [PMID: 39136566 DOI: 10.1148/radiol.241191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/06/2024]

Affiliation(s)

Lisa C Adams From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
Daniel Truhn From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
Felix Busch From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
Felix Dorfner From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
Jawed Nawabi From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
Marcus R Makowski From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
Keno K Bressem From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)

Collapse

Barak-Corren Y, Wolf R, Rozenblum R, Creedon JK, Lipsett SC, Lyons TW, Michelson KA, Miller KA, Shapiro DJ, Reis BY, Fine AM. Harnessing the Power of Generative AI for Clinical Summaries: Perspectives From Emergency Physicians. Ann Emerg Med 2024;84:128-138. [PMID: 38483426 DOI: 10.1016/j.annemergmed.2024.01.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/24/2024] [Accepted: 01/29/2024] [Indexed: 04/14/2024]

Abstract

STUDY OBJECTIVE

The workload of clinical documentation contributes to health care costs and professional burnout. The advent of generative artificial intelligence language models presents a promising solution. The perspective of clinicians may contribute to effective and responsible implementation of such tools. This study sought to evaluate 3 uses for generative artificial intelligence for clinical documentation in pediatric emergency medicine, measuring time savings, effort reduction, and physician attitudes and identifying potential risks and barriers.

METHODS

This mixed-methods study was performed with 10 pediatric emergency medicine attending physicians from a single pediatric emergency department. Participants were asked to write a supervisory note for 4 clinical scenarios, with varying levels of complexity, twice without any assistance and twice with the assistance of ChatGPT Version 4.0. Participants evaluated 2 additional ChatGPT-generated clinical summaries: a structured handoff and a visit summary for a family written at an 8th grade reading level. Finally, a semistructured interview was performed to assess physicians' perspective on the use of ChatGPT in pediatric emergency medicine. Main outcomes and measures included between subjects' comparisons of the effort and time taken to complete the supervisory note with and without ChatGPT assistance. Effort was measured using a self-reported Likert scale of 0 to 10. Physicians' scoring of and attitude toward the ChatGPT-generated summaries were measured using a 0 to 10 Likert scale and open-ended questions. Summaries were scored for completeness, accuracy, efficiency, readability, and overall satisfaction. A thematic analysis was performed to analyze the content of the open-ended questions and to identify key themes.

RESULTS

ChatGPT yielded a 40% reduction in time and a 33% decrease in effort for supervisory notes in intricate cases, with no discernible effect on simpler notes. ChatGPT-generated summaries for structured handoffs and family letters were highly rated, ranging from 7.0 to 9.0 out of 10, and most participants favored their inclusion in clinical practice. However, there were several critical reservations, out of which a set of general recommendations for applying ChatGPT to clinical summaries was formulated.

CONCLUSION

Pediatric emergency medicine attendings in our study perceived that ChatGPT can deliver high-quality summaries while saving time and effort in many scenarios, but not all.

Collapse

D'Anna G, Van Cauter S, Thurnher M, Van Goethem J, Haller S. Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard. Neuroradiology 2024;66:1245-1250. [PMID: 38705899 DOI: 10.1007/s00234-024-03371-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Accepted: 04/30/2024] [Indexed: 05/07/2024]

Kim SE, Lee JH, Choi BS, Han HS, Lee MC, Ro DH. Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4. Clin Orthop Surg 2024;16:669-673. [PMID: 39092297 PMCID: PMC11262944 DOI: 10.4055/cios23179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 01/29/2024] [Accepted: 01/29/2024] [Indexed: 08/04/2024] Open

Hirano Y, Hanaoka S, Nakao T, Miki S, Kikuchi T, Nakamura Y, Nomura Y, Yoshikawa T, Abe O. GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination. Jpn J Radiol 2024;42:918-926. [PMID: 38733472 PMCID: PMC11286662 DOI: 10.1007/s11604-024-01561-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 03/16/2024] [Indexed: 05/13/2024]

Naja F, Taktouk M, Matbouli D, Khaleel S, Maher A, Uzun B, Alameddine M, Nasreddine L. Artificial intelligence chatbots for the nutrition management of diabetes and the metabolic syndrome. Eur J Clin Nutr 2024:10.1038/s41430-024-01476-y. [PMID: 39060542 DOI: 10.1038/s41430-024-01476-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 07/16/2024] [Accepted: 07/17/2024] [Indexed: 07/28/2024]

Cherif H, Moussa C, Missaoui AM, Salouage I, Mokaddem S, Dhahri B. Appraisal of ChatGPT's Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination. JMIR MEDICAL EDUCATION 2024;10:e52818. [PMID: 39042876 PMCID: PMC11303904 DOI: 10.2196/52818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 02/05/2024] [Accepted: 02/26/2024] [Indexed: 07/25/2024]

Abstract

BACKGROUND

The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education.

OBJECTIVE

This study aimed to evaluate ChatGPT's performance in a pulmonology examination through a comparative analysis with that of third-year medical students.

METHODS

In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution's 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students.

RESULTS

V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students.

CONCLUSIONS

While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources.

Collapse

Wu Q, Wu Q, Li H, Wang Y, Bai Y, Wu Y, Yu X, Li X, Dong P, Xue J, Shen D, Wang M. Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study. JMIR Med Inform 2024;12:e55799. [PMID: 39018102 PMCID: PMC11292156 DOI: 10.2196/55799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 02/02/2024] [Accepted: 05/25/2024] [Indexed: 07/18/2024] Open

Abstract

BACKGROUND

Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored.

OBJECTIVE

This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies.

METHODS

This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ.

RESULTS

Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2's performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018.

CONCLUSIONS

When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.

Collapse

Wada A, Akashi T, Shih G, Hagiwara A, Nishizawa M, Hayakawa Y, Kikuta J, Shimoji K, Sano K, Kamagata K, Nakanishi A, Aoki S. Optimizing GPT-4 Turbo Diagnostic Accuracy in Neuroradiology through Prompt Engineering and Confidence Thresholds. Diagnostics (Basel) 2024;14:1541. [PMID: 39061677 PMCID: PMC11276551 DOI: 10.3390/diagnostics14141541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 07/02/2024] [Accepted: 07/10/2024] [Indexed: 07/28/2024] Open

Builoff V, Shanbhag A, Miller RJ, Dey D, Liang JX, Flood K, Bourque JM, Chareonthaitawee P, Phillips LM, Slomka PJ. Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.07.16.24310297. [PMID: 39072028 PMCID: PMC11275690 DOI: 10.1101/2024.07.16.24310297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]

Horiuchi D, Tatekawa H, Oura T, Shimono T, Walston SL, Takita H, Matsushita S, Mitsuyama Y, Miki Y, Ueda D. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol 2024:10.1007/s00330-024-10902-5. [PMID: 38995378 DOI: 10.1007/s00330-024-10902-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Revised: 05/02/2024] [Accepted: 06/24/2024] [Indexed: 07/13/2024]

Abstract

OBJECTIVES

To compare the diagnostic accuracy of Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT-4 with vision (GPT-4V) based ChatGPT, and radiologists in musculoskeletal radiology.

MATERIALS AND METHODS

We included 106 "Test Yourself" cases from Skeletal Radiology between January 2014 and September 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Two radiologists (a radiology resident and a board-certified radiologist) independently provided diagnoses for all cases. The diagnostic accuracy rates were determined based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists.

RESULTS

GPT-4-based ChatGPT significantly outperformed GPT-4V-based ChatGPT (p < 0.001) with accuracy rates of 43% (46/106) and 8% (9/106), respectively. The radiology resident and the board-certified radiologist achieved accuracy rates of 41% (43/106) and 53% (56/106). The diagnostic accuracy of GPT-4-based ChatGPT was comparable to that of the radiology resident, but was lower than that of the board-certified radiologist although the differences were not significant (p = 0.78 and 0.22, respectively). The diagnostic accuracy of GPT-4V-based ChatGPT was significantly lower than those of both radiologists (p < 0.001 and < 0.001, respectively).

CONCLUSION

GPT-4-based ChatGPT demonstrated significantly higher diagnostic accuracy than GPT-4V-based ChatGPT. While GPT-4-based ChatGPT's diagnostic performance was comparable to radiology residents, it did not reach the performance level of board-certified radiologists in musculoskeletal radiology.

CLINICAL RELEVANCE STATEMENT

GPT-4-based ChatGPT outperformed GPT-4V-based ChatGPT and was comparable to radiology residents, but it did not reach the level of board-certified radiologists in musculoskeletal radiology. Radiologists should comprehend ChatGPT's current performance as a diagnostic tool for optimal utilization.

KEY POINTS

This study compared the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in musculoskeletal radiology. GPT-4-based ChatGPT was comparable to radiology residents, but did not reach the level of board-certified radiologists. When utilizing ChatGPT, it is crucial to input appropriate descriptions of imaging findings rather than the images.

Collapse

Haider SA, Pressman SM, Borna S, Gomez-Cabello CA, Sehgal A, Leibovich BC, Forte AJ. Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems. Diagnostics (Basel) 2024;14:1491. [PMID: 39061628 PMCID: PMC11275570 DOI: 10.3390/diagnostics14141491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 06/25/2024] [Accepted: 07/09/2024] [Indexed: 07/28/2024] Open

Sacoransky E, Kwan BYM, Soboleski D. ChatGPT and assistive AI in structured radiology reporting: A systematic review. Curr Probl Diagn Radiol 2024:S0363-0188(24)00113-0. [PMID: 39004580 DOI: 10.1067/j.cpradiol.2024.07.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 06/08/2024] [Accepted: 07/08/2024] [Indexed: 07/16/2024]

Keshavarz P, Bagherieh S, Nabipoorashrafi SA, Chalian H, Rahsepar AA, Kim GHJ, Hassani C, Raman SS, Bedayat A. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024;105:251-265. [PMID: 38679540 DOI: 10.1016/j.diii.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/11/2024] [Accepted: 04/16/2024] [Indexed: 05/01/2024]

Abstract

PURPOSE

The purpose of this study was to systematically review the reported performances of ChatGPT, identify potential limitations, and explore future directions for its integration, optimization, and ethical considerations in radiology applications.

MATERIALS AND METHODS

After a comprehensive review of PubMed, Web of Science, Embase, and Google Scholar databases, a cohort of published studies was identified up to January 1, 2024, utilizing ChatGPT for clinical radiology applications.

RESULTS

Out of 861 studies derived, 44 studies evaluated the performance of ChatGPT; among these, 37 (37/44; 84.1%) demonstrated high performance, and seven (7/44; 15.9%) indicated it had a lower performance in providing information on diagnosis and clinical decision support (6/44; 13.6%) and patient communication and educational content (1/44; 2.3%). Twenty-four (24/44; 54.5%) studies reported the proportion of ChatGPT's performance. Among these, 19 (19/24; 79.2%) studies recorded a median accuracy of 70.5%, and in five (5/24; 20.8%) studies, there was a median agreement of 83.6% between ChatGPT outcomes and reference standards [radiologists' decision or guidelines], generally confirming ChatGPT's high accuracy in these studies. Eleven studies compared two recent ChatGPT versions, and in ten (10/11; 90.9%), ChatGPTv4 outperformed v3.5, showing notable enhancements in addressing higher-order thinking questions, better comprehension of radiology terms, and improved accuracy in describing images. Risks and concerns about using ChatGPT included biased responses, limited originality, and the potential for inaccurate information leading to misinformation, hallucinations, improper citations and fake references, cybersecurity vulnerabilities, and patient privacy risks.

CONCLUSION

Although ChatGPT's effectiveness has been shown in 84.1% of radiology studies, there are still multiple pitfalls and limitations to address. It is too soon to confirm its complete proficiency and accuracy, and more extensive multicenter studies utilizing diverse datasets and pre-training techniques are required to verify ChatGPT's role in radiology.

Collapse

Nishino M, Ballard DH. Multimodal Large Language Models to Solve Image-based Diagnostic Challenges: The Next Big Wave is Already Here. Radiology 2024;312:e241379. [PMID: 38980181 DOI: 10.1148/radiol.241379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]

Payne DL, Purohit K, Borrero WM, Chung K, Hao M, Mpoy M, Jin M, Prasanna P, Hill V. Performance of GPT-4 on the American College of Radiology In-training Examination: Evaluating Accuracy, Model Drift, and Fine-tuning. Acad Radiol 2024;31:3046-3054. [PMID: 38653599 DOI: 10.1016/j.acra.2024.04.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 04/01/2024] [Accepted: 04/06/2024] [Indexed: 04/25/2024]

Abstract

RATIONALE AND OBJECTIVES

In our study, we evaluate GPT-4's performance on the American College of Radiology (ACR) 2022 Diagnostic Radiology In-Training Examination (DXIT). We perform multiple experiments across time points to assess for model drift, as well as after fine-tuning to assess for differences in accuracy.

MATERIALS AND METHODS

Questions were sequentially input into GPT-4 with a standardized prompt. Each answer was recorded and overall accuracy was calculated, as was logic-adjusted accuracy, and accuracy on image-based questions. This experiment was repeated several months later to assess for model drift, then again after the performance of fine-tuning to assess for changes in GPT's performance.

RESULTS

GPT-4 achieved 58.5% overall accuracy, lower than the PGY-3 average (61.9%) but higher than the PGY-2 average (52.8%). Adjusted accuracy was 52.8%. GPT-4 showed significantly higher (p = 0.012) confidence for correct answers (87.1%) compared to incorrect (84.0%). Performance on image-based questions was significantly poorer (p < 0.001) at 45.4% compared to text-only questions (80.0%), with adjusted accuracy for image-based questions of 36.4%. When the questions were repeated, GPT-4 chose a different answer 25.5% of the time and there was no change in accuracy. Fine-tuning did not improve accuracy.

CONCLUSION

GPT-4 performed between PGY-2 and PGY-3 levels on the 2022 DXIT, significantly poorer on image-based questions, and with large variability in answer choices across time points. Exploratory experiments in fine-tuning did not improve performance. This study underscores the potential and risks of using minimally-prompted general AI models in interpreting radiologic images as a diagnostic tool. Implementers of general AI radiology systems should exercise caution given the possibility of spurious yet confident responses.

Collapse

McIlvain G, Oechtering TH, Shammi UA, Bhayana R, Hutter J, Moy L, Schweitzer M. Chatbots for Literature Review and Research-Insights from a Panel Discussion at the Annual Meeting of the International Society of Magnetic Resonance in Medicine (ISMRM) 2023. J Magn Reson Imaging 2024;60:390-392. [PMID: 37795851 DOI: 10.1002/jmri.29036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 09/07/2023] [Accepted: 09/19/2023] [Indexed: 10/06/2023] Open

Le Guellec B, Lefèvre A, Geay C, Shorten L, Bruge C, Hacein-Bey L, Amouyel P, Pruvo JP, Kuchcinski G, Hamroun A. Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports. Radiol Artif Intell 2024;6:e230364. [PMID: 38717292 PMCID: PMC11294959 DOI: 10.1148/ryai.230364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 03/08/2024] [Accepted: 04/23/2024] [Indexed: 06/21/2024]

Abstract

Purpose To assess the performance of a local open-source large language model (LLM) in various information extraction tasks from real-life emergency brain MRI reports. Materials and Methods All consecutive emergency brain MRI reports written in 2022 from a French quaternary center were retrospectively reviewed. Two radiologists identified MRI scans that were performed in the emergency department for headaches. Four radiologists scored the reports' conclusions as either normal or abnormal. Abnormalities were labeled as either headache-causing or incidental. Vicuna (LMSYS Org), an open-source LLM, performed the same tasks. Vicuna's performance metrics were evaluated using the radiologists' consensus as the reference standard. Results Among the 2398 reports during the study period, radiologists identified 595 that included headaches in the indication (median age of patients, 35 years [IQR, 26-51 years]; 68% [403 of 595] women). A positive finding was reported in 227 of 595 (38%) cases, 136 of which could explain the headache. The LLM had a sensitivity of 98.0% (95% CI: 96.5, 99.0) and specificity of 99.3% (95% CI: 98.8, 99.7) for detecting the presence of headache in the clinical context, a sensitivity of 99.4% (95% CI: 98.3, 99.9) and specificity of 98.6% (95% CI: 92.2, 100.0) for the use of contrast medium injection, a sensitivity of 96.0% (95% CI: 92.5, 98.2) and specificity of 98.9% (95% CI: 97.2, 99.7) for study categorization as either normal or abnormal, and a sensitivity of 88.2% (95% CI: 81.6, 93.1) and specificity of 73% (95% CI: 62, 81) for causal inference between MRI findings and headache. Conclusion An open-source LLM was able to extract information from free-text radiology reports with excellent accuracy without requiring further training. Keywords: Large Language Model (LLM), Generative Pretrained Transformers (GPT), Open Source, Information Extraction, Report, Brain, MRI Supplemental material is available for this article. Published under a CC BY 4.0 license. See also the commentary by Akinci D'Antonoli and Bluethgen in this issue.

Collapse

Affiliation(s)

Bastien Le Guellec From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
Alexandre Lefèvre From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
Charlotte Geay From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
Lucas Shorten From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
Cyril Bruge From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
Lotfi Hacein-Bey From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
Philippe Amouyel From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
Jean-Pierre Pruvo From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
Gregory Kuchcinski From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
Aghiles Hamroun From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)

Collapse

Suh PS, Shim WH, Suh CH, Heo H, Park CR, Eom HJ, Park KJ, Choe J, Kim PH, Park HJ, Ahn Y, Park HY, Choi Y, Woo CY, Park H. Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases. Radiology 2024;312:e240273. [PMID: 38980179 DOI: 10.1148/radiol.240273] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]

Abstract

Background The diagnostic abilities of multimodal large language models (LLMs) using direct image inputs and the impact of the temperature parameter of LLMs remain unexplored. Purpose To investigate the ability of GPT-4V and Gemini Pro Vision in generating differential diagnoses at different temperatures compared with radiologists using Radiology Diagnosis Please cases. Materials and Methods This retrospective study included Diagnosis Please cases published from January 2008 to October 2023. Input images included original images and captures of the textual patient history and figure legends (without imaging findings) from PDF files of each case. The LLMs were tasked with providing three differential diagnoses, repeated five times at temperatures 0, 0.5, and 1. Eight subspecialty-trained radiologists solved cases. An experienced radiologist compared generated and final diagnoses, considering the result correct if the generated diagnoses included the final diagnosis after five repetitions. Accuracy was assessed across models, temperatures, and radiology subspecialties, with statistical significance set at P < .007 after Bonferroni correction for multiple comparisons across the LLMs at the three temperatures and with radiologists. Results A total of 190 cases were included in neuroradiology (n = 53), multisystem (n = 27), gastrointestinal (n = 25), genitourinary (n = 23), musculoskeletal (n = 17), chest (n = 16), cardiovascular (n = 12), pediatric (n = 12), and breast (n = 5) subspecialties. Overall accuracy improved with increasing temperature settings (0, 0.5, 1) for both GPT-4V (41% [78 of 190 cases], 45% [86 of 190 cases], 49% [93 of 190 cases], respectively) and Gemini Pro Vision (29% [55 of 190 cases], 36% [69 of 190 cases], 39% [74 of 190 cases], respectively), although there was no evidence of a statistically significant difference after Bonferroni adjustment (GPT-4V, P = .12; Gemini Pro Vision, P = .04). The overall accuracy of radiologists (61% [115 of 190 cases]) was higher than that of Gemini Pro Vision at temperature 1 (T1) (P < .001), while no statistically significant difference was observed between radiologists and GPT-4V at T1 after Bonferroni adjustment (P = .02). Radiologists (range, 45%-88%) outperformed the LLMs at T1 (range, 24%-75%) in most subspecialties. Conclusion Using direct radiologic image inputs, GPT-4V and Gemini Pro Vision showed improved diagnostic accuracy with increasing temperature settings. Although GPT-4V slightly underperformed compared with radiologists, it nonetheless demonstrated promising potential as a supportive tool in diagnostic decision-making. © RSNA, 2024 See also the editorial by Nishino and Ballard in this issue.

Collapse

Affiliation(s)

Pae Sun Suh From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Woo Hyun Shim From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Chong Hyun Suh From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Hwon Heo From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Chae Ri Park From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Hye Joung Eom From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Kye Jin Park From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Jooae Choe From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Pyeong Hwa Kim From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Hyo Jung Park From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Yura Ahn From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Ho Young Park From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Yoonseok Choi From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Chang-Yun Woo From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
Hyungjun Park From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)

Collapse

Rossettini G, Rodeghiero L, Corradi F, Cook C, Pillastrini P, Turolla A, Castellini G, Chiappinotto S, Gianola S, Palese A. Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC MEDICAL EDUCATION 2024;24:694. [PMID: 38926809 PMCID: PMC11210096 DOI: 10.1186/s12909-024-05630-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 06/04/2024] [Indexed: 06/28/2024]

Abstract

BACKGROUND

Artificial intelligence (AI) chatbots are emerging educational tools for students in healthcare science. However, assessing their accuracy is essential prior to adoption in educational settings. This study aimed to assess the accuracy of predicting the correct answers from three AI chatbots (ChatGPT-4, Microsoft Copilot and Google Gemini) in the Italian entrance standardized examination test of healthcare science degrees (CINECA test). Secondarily, we assessed the narrative coherence of the AI chatbots' responses (i.e., text output) based on three qualitative metrics: the logical rationale behind the chosen answer, the presence of information internal to the question, and presence of information external to the question.

METHODS

An observational cross-sectional design was performed in September of 2023. Accuracy of the three chatbots was evaluated for the CINECA test, where questions were formatted using a multiple-choice structure with a single best answer. The outcome is binary (correct or incorrect). Chi-squared test and a post hoc analysis with Bonferroni correction assessed differences among chatbots performance in accuracy. A p-value of < 0.05 was considered statistically significant. A sensitivity analysis was performed, excluding answers that were not applicable (e.g., images). Narrative coherence was analyzed by absolute and relative frequencies of correct answers and errors.

RESULTS

Overall, of the 820 CINECA multiple-choice questions inputted into all chatbots, 20 questions were not imported in ChatGPT-4 (n = 808) and Google Gemini (n = 808) due to technical limitations. We found statistically significant differences in the ChatGPT-4 vs Google Gemini and Microsoft Copilot vs Google Gemini comparisons (p-value < 0.001). The narrative coherence of AI chatbots revealed "Logical reasoning" as the prevalent correct answer (n = 622, 81.5%) and "Logical error" as the prevalent incorrect answer (n = 40, 88.9%).

CONCLUSIONS

Our main findings reveal that: (A) AI chatbots performed well; (B) ChatGPT-4 and Microsoft Copilot performed better than Google Gemini; and (C) their narrative coherence is primarily logical. Although AI chatbots showed promising accuracy in predicting the correct answer in the Italian entrance university standardized examination test, we encourage candidates to cautiously incorporate this new technology to supplement their learning rather than a primary resource.

TRIAL REGISTRATION

Not required.

Collapse

Tong L, Wang J, Rapaka S, Garg PS. Can ChatGPT generate practice question explanations for medical students, a new faculty teaching tool? MEDICAL TEACHER 2024:1-5. [PMID: 38900675 DOI: 10.1080/0142159x.2024.2363486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 05/30/2024] [Indexed: 06/22/2024]

Abstract

INTRODUCTION

Multiple-choice questions (MCQs) are frequently used for formative assessment in medical school but often lack sufficient answer explanations given time-restraints of faculty. Chat Generated Pre-trained Transformer (ChatGPT) has emerged as a potential student learning aid and faculty teaching tool. This study aims to evaluate ChatGPT's performance in answering and providing explanations for MCQs.

METHOD

Ninety-four faculty-generated MCQs were collected from the pre-clerkship curriculum at a US medical school. ChatGPT's accuracy in answering MCQ's were tracked on first attempt without an answer prompt (Pass 1) and after being given a prompt for the correct answer (Pass 2). Explanations provided by ChatGPT were compared with faculty-generated explanations, and a 3-point evaluation scale was used to assess accuracy and thoroughness compared to faculty-generated answers.

RESULTS

On first attempt, ChatGPT demonstrated a 75% accuracy in correctly answering faculty-generated MCQs. Among correctly answered questions, 66.4% of ChatGPT's explanations matched faculty explanations, and 89.1% captured some key aspects without providing inaccurate information. The amount of inaccurately generated explanations increases significantly if the questions was not answered correctly on the first pass (2.7% if correct on first pass vs. 34.6% if incorrect on first pass, p < 0.001).

CONCLUSION

ChatGPT shows promise in assisting faculty and students with explanations for practice MCQ's but should be used with caution. Faculty should review explanations and supplement to ensure coverage of learning objectives. Students can benefit from ChatGPT for immediate feedback through explanations if ChatGPT answers the question correctly on the first try. If the question is answered incorrectly students should remain cautious of the explanation and seek clarification from instructors.

Collapse

Kaba E, Akkaya S. Performance of Different Large Language Models in the Sample Test of the European Cardiovascular Radiology Board Examination. Acad Radiol 2024:S1076-6332(24)00369-6. [PMID: 38902112 DOI: 10.1016/j.acra.2024.06.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Accepted: 06/04/2024] [Indexed: 06/22/2024]

Suwała S, Szulc P, Guzowski C, Kamińska B, Dorobiała J, Wojciechowska K, Berska M, Kubicka O, Kosturkiewicz O, Kosztulska B, Rajewska A, Junik R. ChatGPT-3.5 passes Poland's medical final examination-Is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med 2024;12:20503121241257777. [PMID: 38895543 PMCID: PMC11185017 DOI: 10.1177/20503121241257777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 05/08/2024] [Indexed: 06/21/2024] Open

Abstract

Objectives

ChatGPT is an advanced chatbot based on Large Language Model that has the ability to answer questions. Undoubtedly, ChatGPT is capable of transforming communication, education, and customer support; however, can it play the role of a doctor? In Poland, prior to obtaining a medical diploma, candidates must successfully pass the Medical Final Examination.

Methods

The purpose of this research was to determine how well ChatGPT performed on the Polish Medical Final Examination, which passing is required to become a doctor in Poland (an exam is considered passed if at least 56% of the tasks are answered correctly). A total of 2138 categorized Medical Final Examination questions (from 11 examination sessions held between 2013-2015 and 2021-2023) were presented to ChatGPT-3.5 from 19 to 26 May 2023. For further analysis, the questions were divided into quintiles based on difficulty and duration, as well as question types (simple A-type or complex K-type). The answers provided by ChatGPT were compared to the official answer key, reviewed for any changes resulting from the advancement of medical knowledge.

Results

ChatGPT correctly answered 53.4%-64.9% of questions. In 8 out of 11 exam sessions, ChatGPT achieved the scores required to successfully pass the examination (60%). The correlation between the efficacy of artificial intelligence and the level of complexity, difficulty, and length of a question was found to be negative. AI outperformed humans in one category: psychiatry (77.18% vs. 70.25%, p = 0.081).

Conclusions

The performance of artificial intelligence is deemed satisfactory; however, it is observed to be markedly inferior to that of human graduates in the majority of instances. Despite its potential utility in many medical areas, ChatGPT is constrained by its inherent limitations that prevent it from entirely supplanting human expertise and knowledge.

Collapse

Affiliation(s)

Szymon Suwała Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Paulina Szulc Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Cezary Guzowski Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Barbara Kamińska Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Jakub Dorobiała Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Karolina Wojciechowska Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Maria Berska Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Olga Kubicka Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Oliwia Kosturkiewicz Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Bernadetta Kosztulska Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Alicja Rajewska Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
Roman Junik Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland

Collapse

Longwell JB, Hirsch I, Binder F, Gonzalez Conchas GA, Mau D, Jang R, Krishnan RG, Grant RC. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open 2024;7:e2417641. [PMID: 38888919 PMCID: PMC11185976 DOI: 10.1001/jamanetworkopen.2024.17641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 04/18/2024] [Indexed: 06/20/2024] Open

Abstract

Importance

Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information.

Objective

To evaluate the accuracy and safety of LLM answers on medical oncology examination questions.

Design, Setting, and Participants

This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs.

Main Outcomes and Measures

The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm.

Results

Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm.

Conclusions and Relevance

In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.

Collapse

Jenko N, Ariyaratne S, Jeys L, Evans S, Iyengar KP, Botchu R. An evaluation of AI generated literature reviews in musculoskeletal radiology. Surgeon 2024;22:194-197. [PMID: 38218659 DOI: 10.1016/j.surge.2023.12.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 12/20/2023] [Accepted: 12/27/2023] [Indexed: 01/15/2024]

Taesotikul S, Singhan W, Taesotikul T. ChatGPT vs pharmacy students in the pharmacotherapy time-limit test: A comparative study in Thailand. CURRENTS IN PHARMACY TEACHING & LEARNING 2024;16:404-410. [PMID: 38641483 DOI: 10.1016/j.cptl.2024.04.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 04/03/2024] [Accepted: 04/04/2024] [Indexed: 04/21/2024]

Bhayana R, Nanda B, Dehkharghanian T, Deng Y, Bhambra N, Elias G, Datta D, Kambadakone A, Shwaartz CG, Moulton CA, Henault D, Gallinger S, Krishna S. Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer. Radiology 2024;311:e233117. [PMID: 38888478 DOI: 10.1148/radiol.233117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/20/2024]

Abstract

Background Structured radiology reports for pancreatic ductal adenocarcinoma (PDAC) improve surgical decision-making over free-text reports, but radiologist adoption is variable. Resectability criteria are applied inconsistently. Purpose To evaluate the performance of large language models (LLMs) in automatically creating PDAC synoptic reports from original reports and to explore performance in categorizing tumor resectability. Materials and Methods In this institutional review board-approved retrospective study, 180 consecutive PDAC staging CT reports on patients referred to the authors' European Society for Medical Oncology-designated cancer center from January to December 2018 were included. Reports were reviewed by two radiologists to establish the reference standard for 14 key findings and National Comprehensive Cancer Network (NCCN) resectability category. GPT-3.5 and GPT-4 (accessed September 18-29, 2023) were prompted to create synoptic reports from original reports with the same 14 features, and their performance was evaluated (recall, precision, F1 score). To categorize resectability, three prompting strategies (default knowledge, in-context knowledge, chain-of-thought) were used for both LLMs. Hepatopancreaticobiliary surgeons reviewed original and artificial intelligence (AI)-generated reports to determine resectability, with accuracy and review time compared. The McNemar test, t test, Wilcoxon signed-rank test, and mixed effects logistic regression models were used where appropriate. Results GPT-4 outperformed GPT-3.5 in the creation of synoptic reports (F1 score: 0.997 vs 0.967, respectively). Compared with GPT-3.5, GPT-4 achieved equal or higher F1 scores for all 14 extracted features. GPT-4 had higher precision than GPT-3.5 for extracting superior mesenteric artery involvement (100% vs 88.8%, respectively). For categorizing resectability, GPT-4 outperformed GPT-3.5 for each prompting strategy. For GPT-4, chain-of-thought prompting was most accurate, outperforming in-context knowledge prompting (92% vs 83%, respectively; P = .002), which outperformed the default knowledge strategy (83% vs 67%, P < .001). Surgeons were more accurate in categorizing resectability using AI-generated reports than original reports (83% vs 76%, respectively; P = .03), while spending less time on each report (58%; 95% CI: 0.53, 0.62). Conclusion GPT-4 created near-perfect PDAC synoptic reports from original reports. GPT-4 with chain-of-thought achieved high accuracy in categorizing resectability. Surgeons were more accurate and efficient using AI-generated reports. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Chang in this issue.

Collapse

Affiliation(s)

Rajesh Bhayana From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Bipin Nanda From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Taher Dehkharghanian From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Yangqing Deng From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Nishaant Bhambra From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Gavin Elias From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Daksh Datta From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Avinash Kambadakone From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Chaya G Shwaartz From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Carol-Anne Moulton From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
David Henault From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Steven Gallinger From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
Satheesh Krishna From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)

Collapse

Sparks CA, Kraeutler MJ, Chester GA, Contrada EV, Zhu E, Fasulo SM, Scillia AJ. Inadequate Performance of ChatGPT on Orthopedic Board-Style Written Exams. Cureus 2024;16:e62643. [PMID: 39036109 PMCID: PMC11258215 DOI: 10.7759/cureus.62643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/17/2024] [Indexed: 07/23/2024] Open

Abstract

BACKGROUND

Chat Generative Pre-Trained Transformer (ChatGPT) is an artificial intelligence (AI) chatbot capable of delivering human-like responses to a seemingly infinite number of inquiries. For the technology to perform certain healthcare-related tasks or act as a study aid, the technology should have up-to-date knowledge and the ability to reason through medical information. The purpose of this study was to assess the orthopedic knowledge and reasoning ability of ChatGPT by querying it with orthopedic board-style questions.

METHODOLOGY

We queried ChatGPT (GPT-3.5) with a total of 472 questions from the Orthobullets dataset (n = 239), the 2022 Orthopaedic In-Training Examination (OITE) (n = 124), and the 2021 OITE (n = 109). The importance, difficulty, and category were recorded for questions from the Orthobullets question bank. Responses were assessed for answer choice correctness if the explanation given matched that of the dataset, answer integrity, and reason for incorrectness.

RESULTS

ChatGPT correctly answered 55.9% (264/472) of questions and, of those answered correctly, gave an explanation that matched that of the dataset for 92.8% (245/264) of the questions. The chatbot used information internal to the question in all responses (100%) and used information external to the question (98.3%) as well as logical reasoning (96.4%) in most responses. There was no significant difference in the proportion of questions answered correctly between the datasets (P = 0.62). There was no significant difference in the proportion of questions answered correctly by question category (P = 0.67), importance (P = 0.95), or difficulty (P = 0.87) within the Orthobullets dataset questions. ChatGPT mostly got questions incorrect due to information error (i.e., failure to identify the information required to answer the question) (81.7% of incorrect responses).

CONCLUSIONS

ChatGPT performs below a threshold likely to pass the American Board of Orthopedic Surgery (ABOS) Part I written exam. The chatbot's performance on the 2022 and 2021 OITEs was between the average performance of an intern and to second-year resident. A major limitation of the current model is the failure to identify the information required to correctly answer the questions.

Collapse

Altamimi I, Alhumimidi A, Alshehri S, Alrumayan A, Al-khlaiwi T, Meo SA, Temsah MH. The scientific knowledge of three large language models in cardiology: multiple-choice questions examination-based performance. Ann Med Surg (Lond) 2024;86:3261-3266. [PMID: 38846858 PMCID: PMC11152788 DOI: 10.1097/ms9.0000000000002120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Accepted: 04/16/2024] [Indexed: 06/09/2024] Open

Koga S, Du W. Integrating AI in medicine: Lessons from Chat-GPT's limitations in medical imaging. Dig Liver Dis 2024;56:1114-1115. [PMID: 38429138 DOI: 10.1016/j.dld.2024.02.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Accepted: 02/19/2024] [Indexed: 03/03/2024]

Mokkarala M, Bentley H, Gomez C, Jiao A, Zaki-Metias KM. The New American Board of Radiology Certifying Oral Examination: How Should Diagnostic Radiology Graduate Medical Education Evolve? Radiographics 2024;44:e240016. [PMID: 38722783 DOI: 10.1148/rg.240016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/21/2024]

Affiliation(s)

Mahati Mokkarala From the Department of Radiology, Mallinckrodt Institute of Radiology, 510 S Kingshighway Blvd #8131, St Louis, MO 63108 (M.M.); Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada (H.B.); Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, Ga (C.G.); Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (A.J.); and Department of Radiology, Trinity Health Oakland Hospital, Wayne State University School of Medicine, Pontiac, Mich (K.M.Z.M.)
Helena Bentley From the Department of Radiology, Mallinckrodt Institute of Radiology, 510 S Kingshighway Blvd #8131, St Louis, MO 63108 (M.M.); Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada (H.B.); Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, Ga (C.G.); Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (A.J.); and Department of Radiology, Trinity Health Oakland Hospital, Wayne State University School of Medicine, Pontiac, Mich (K.M.Z.M.)
Christian Gomez From the Department of Radiology, Mallinckrodt Institute of Radiology, 510 S Kingshighway Blvd #8131, St Louis, MO 63108 (M.M.); Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada (H.B.); Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, Ga (C.G.); Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (A.J.); and Department of Radiology, Trinity Health Oakland Hospital, Wayne State University School of Medicine, Pontiac, Mich (K.M.Z.M.)
Albert Jiao From the Department of Radiology, Mallinckrodt Institute of Radiology, 510 S Kingshighway Blvd #8131, St Louis, MO 63108 (M.M.); Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada (H.B.); Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, Ga (C.G.); Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (A.J.); and Department of Radiology, Trinity Health Oakland Hospital, Wayne State University School of Medicine, Pontiac, Mich (K.M.Z.M.)
Kaitlin M Zaki-Metias From the Department of Radiology, Mallinckrodt Institute of Radiology, 510 S Kingshighway Blvd #8131, St Louis, MO 63108 (M.M.); Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada (H.B.); Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, Ga (C.G.); Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (A.J.); and Department of Radiology, Trinity Health Oakland Hospital, Wayne State University School of Medicine, Pontiac, Mich (K.M.Z.M.)

Collapse

Mousavi M, Shafiee S, Harley JM, Cheung JCK, Abbasgholizadeh Rahimi S. Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada. Fam Med Community Health 2024;12:e002626. [PMID: 38806403 PMCID: PMC11138270 DOI: 10.1136/fmch-2023-002626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/30/2024] Open

Abstract

INTRODUCTION

The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical education, and its performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC).

METHOD

Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews' score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds.

RESULT

According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer's scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p<0.001). Similarly, the Reviewers' Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed those of GPT-3.5 (OR: 2.23; 95% CI: 1.22 to 4.06; p=0.009). Running the GPTs after a one week interval, regeneration of the prompt or using or not using the prompt did not significantly change the CFPC score percentage.

CONCLUSION

In our study, we used GPT-3.5 and GPT-4 to answer complex, open-ended sample questions of the CFPC exam and showed that more than 70% of the answers were accurate, and GPT-4 outperformed GPT-3.5 in responding to the questions. Large language models such as GPTs seem promising for assisting candidates of the CFPC exam by providing potential answers. However, their use for family medicine education and exam preparation needs further studies.

Collapse

Horiuchi D, Tatekawa H, Oura T, Oue S, Walston SL, Takita H, Matsushita S, Mitsuyama Y, Shimono T, Miki Y, Ueda D. Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases. Clin Neuroradiol 2024:10.1007/s00062-024-01426-y. [PMID: 38806794 DOI: 10.1007/s00062-024-01426-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 05/06/2024] [Indexed: 05/30/2024]

Abstract

PURPOSE

To compare the diagnostic performance among Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT‑4 with vision (GPT-4V) based ChatGPT, and radiologists in challenging neuroradiology cases.

METHODS

We collected 32 consecutive "Freiburg Neuropathology Case Conference" cases from the journal Clinical Neuroradiology between March 2016 and December 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Six radiologists (three radiology residents and three board-certified radiologists) independently reviewed all cases and provided diagnoses. ChatGPT and radiologists' diagnostic accuracy rates were evaluated based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists.

RESULTS

GPT‑4 and GPT-4V-based ChatGPTs achieved accuracy rates of 22% (7/32) and 16% (5/32), respectively. Radiologists achieved the following accuracy rates: three radiology residents 28% (9/32), 31% (10/32), and 28% (9/32); and three board-certified radiologists 38% (12/32), 47% (15/32), and 44% (14/32). GPT-4-based ChatGPT's diagnostic accuracy was lower than each radiologist, although not significantly (all p > 0.07). GPT-4V-based ChatGPT's diagnostic accuracy was also lower than each radiologist and significantly lower than two board-certified radiologists (p = 0.02 and 0.03) (not significant for radiology residents and one board-certified radiologist [all p > 0.09]).

CONCLUSION

While GPT-4-based ChatGPT demonstrated relatively higher diagnostic performance than GPT-4V-based ChatGPT, the diagnostic performance of GPT‑4 and GPT-4V-based ChatGPTs did not reach the performance level of either radiology residents or board-certified radiologists in challenging neuroradiology cases.

Collapse

Duggan R, Tsuruda KM. ChatGPT performance on radiation technologist and therapist entry to practice exams. J Med Imaging Radiat Sci 2024;55:101426. [PMID: 38797622 DOI: 10.1016/j.jmir.2024.04.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 04/24/2024] [Accepted: 04/29/2024] [Indexed: 05/29/2024]

Igarashi Y, Nakahara K, Norii T, Miyake N, Tagami T, Yokobori S. Performance of a Large Language Model on Japanese Emergency Medicine Board Certification Examinations. J NIPPON MED SCH 2024;91:155-161. [PMID: 38432929 DOI: 10.1272/jnms.jnms.2024_91-205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2024]