1
|
Wu J, Wu X, Qiu Z, Li M, Lin S, Zhang Y, Zheng Y, Yuan C, Yang J. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inform Assoc 2024; 31:2054-2064. [PMID: 38684792 PMCID: PMC11339525 DOI: 10.1093/jamia/ocae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 03/14/2024] [Accepted: 04/02/2024] [Indexed: 05/02/2024] Open
Abstract
OBJECTIVES Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance. MATERIALS AND METHODS The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381 149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT (GPT-3.5), GPT-4, Baichuan2-7B, Baichuan2-13B, and QWEN-72B in CNMLE-2022 and further investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 distinct perspectives. RESULTS Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE framework, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT's performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70, affirming the effectiveness and robustness of the framework. It also enabled a smaller Baichuan2-13B to pass the examination, showcasing the great potential in low-resource settings. DISCUSSION AND CONCLUSION This study shed light on the optimal practices to enhance the capabilities of LLMs in non-English medical scenarios. By synergizing medical knowledge through in-context learning, LLMs can extend clinical insight beyond language barriers in healthcare, significantly reducing language-related disparities of LLM applications and ensuring global benefit in this field.
Collapse
Affiliation(s)
- Jiageng Wu
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Xian Wu
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Zhaopeng Qiu
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Minghui Li
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Shixu Lin
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Yingying Zhang
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Yefeng Zheng
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Changzheng Yuan
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Department of Nutrition, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States
| | - Jie Yang
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, United States
| |
Collapse
|
2
|
Zhao Y, Coppola A, Karamchandani U, Amiras D, Gupte CM. Artificial intelligence applied to magnetic resonance imaging reliably detects the presence, but not the location, of meniscus tears: a systematic review and meta-analysis. Eur Radiol 2024; 34:5954-5964. [PMID: 38386028 PMCID: PMC11364796 DOI: 10.1007/s00330-024-10625-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2023] [Revised: 12/24/2023] [Accepted: 01/13/2024] [Indexed: 02/23/2024]
Abstract
OBJECTIVES To review and compare the accuracy of convolutional neural networks (CNN) for the diagnosis of meniscal tears in the current literature and analyze the decision-making processes utilized by these CNN algorithms. MATERIALS AND METHODS PubMed, MEDLINE, EMBASE, and Cochrane databases up to December 2022 were searched in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) statement. Risk of analysis was used for all identified articles. Predictive performance values, including sensitivity and specificity, were extracted for quantitative analysis. The meta-analysis was divided between AI prediction models identifying the presence of meniscus tears and the location of meniscus tears. RESULTS Eleven articles were included in the final review, with a total of 13,467 patients and 57,551 images. Heterogeneity was statistically significantly large for the sensitivity of the tear identification analysis (I2 = 79%). A higher level of accuracy was observed in identifying the presence of a meniscal tear over locating tears in specific regions of the meniscus (AUC, 0.939 vs 0.905). Pooled sensitivity and specificity were 0.87 (95% confidence interval (CI) 0.80-0.91) and 0.89 (95% CI 0.83-0.93) for meniscus tear identification and 0.88 (95% CI 0.82-0.91) and 0.84 (95% CI 0.81-0.85) for locating the tears. CONCLUSIONS AI prediction models achieved favorable performance in the diagnosis, but not location, of meniscus tears. Further studies on the clinical utilities of deep learning should include standardized reporting, external validation, and full reports of the predictive performances of these models, with a view to localizing tears more accurately. CLINICAL RELEVANCE STATEMENT Meniscus tears are hard to diagnose in the knee magnetic resonance images. AI prediction models may play an important role in improving the diagnostic accuracy of clinicians and radiologists. KEY POINTS • Artificial intelligence (AI) provides great potential in improving the diagnosis of meniscus tears. • The pooled diagnostic performance for artificial intelligence (AI) in identifying meniscus tears was better (sensitivity 87%, specificity 89%) than locating the tears (sensitivity 88%, specificity 84%). • AI is good at confirming the diagnosis of meniscus tears, but future work is required to guide the management of the disease.
Collapse
Affiliation(s)
- Yi Zhao
- Imperial College London School of Medicine, Exhibition Rd, South Kensington, London, SW7 2BU, UK.
| | - Andrew Coppola
- Imperial College London School of Medicine, Exhibition Rd, South Kensington, London, SW7 2BU, UK
| | | | - Dimitri Amiras
- Imperial College London School of Medicine, Exhibition Rd, South Kensington, London, SW7 2BU, UK
- Imperial College London NHS Trust, London, UK
| | - Chinmay M Gupte
- Imperial College London School of Medicine, Exhibition Rd, South Kensington, London, SW7 2BU, UK
- Imperial College London NHS Trust, London, UK
| |
Collapse
|
3
|
Ray PP. Integrating AI in radiology: insights from GPT-generated reports and multimodal LLM performance on European Board of Radiology examinations. Jpn J Radiol 2024; 42:1083-1084. [PMID: 38647884 DOI: 10.1007/s11604-024-01576-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Accepted: 04/15/2024] [Indexed: 04/25/2024]
|
4
|
Hayden N, Gilbert S, Poisson LM, Griffith B, Klochko C. Performance of GPT-4 with Vision on Text- and Image-based ACR Diagnostic Radiology In-Training Examination Questions. Radiology 2024; 312:e240153. [PMID: 39225605 DOI: 10.1148/radiol.240153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
Background Recent advancements, including image processing capabilities, present new potential applications of large language models such as ChatGPT (OpenAI), a generative pretrained transformer, in radiology. However, baseline performance of ChatGPT in radiology-related tasks is understudied. Purpose To evaluate the performance of GPT-4 with vision (GPT-4V) on radiology in-training examination questions, including those with images, to gauge the model's baseline knowledge in radiology. Materials and Methods In this prospective study, conducted between September 2023 and March 2024, the September 2023 release of GPT-4V was assessed using 386 retired questions (189 image-based and 197 text-only questions) from the American College of Radiology Diagnostic Radiology In-Training Examinations. Nine question pairs were identified as duplicates; only the first instance of each duplicate was considered in ChatGPT's assessment. A subanalysis assessed the impact of different zero-shot prompts on performance. Statistical analysis included χ2 tests of independence to ascertain whether the performance of GPT-4V varied between question types or subspecialty. The McNemar test was used to evaluate performance differences between the prompts, with Benjamin-Hochberg adjustment of the P values conducted to control the false discovery rate (FDR). A P value threshold of less than.05 denoted statistical significance. Results GPT-4V correctly answered 246 (65.3%) of the 377 unique questions, with significantly higher accuracy on text-only questions (81.5%, 159 of 195) than on image-based questions (47.8%, 87 of 182) (χ2 test, P < .001). Subanalysis revealed differences between prompts on text-based questions, where chain-of-thought prompting outperformed long instruction by 6.1% (McNemar, P = .02; FDR = 0.063), basic prompting by 6.8% (P = .009, FDR = 0.044), and the original prompting style by 8.9% (P = .001, FDR = 0.014). No differences were observed between prompts on image-based questions with P values of .27 to >.99. Conclusion While GPT-4V demonstrated a level of competence in text-based questions, it showed deficits interpreting radiologic images. © RSNA, 2024 See also the editorial by Deng in this issue.
Collapse
Affiliation(s)
- Nolan Hayden
- From the Department of Diagnostic Radiology, Henry Ford Health, 2799 W Grand Blvd, Detroit, MI, 48202 (N.H., B.G., C.K.); Michigan State University College of Osteopathic Medicine, East Lansing, Mich (S.G.); and Department of Public Health Sciences, Henry Ford Health, Michigan State University Health Sciences, Detroit, Mich (L.M.P.)
| | - Spencer Gilbert
- From the Department of Diagnostic Radiology, Henry Ford Health, 2799 W Grand Blvd, Detroit, MI, 48202 (N.H., B.G., C.K.); Michigan State University College of Osteopathic Medicine, East Lansing, Mich (S.G.); and Department of Public Health Sciences, Henry Ford Health, Michigan State University Health Sciences, Detroit, Mich (L.M.P.)
| | - Laila M Poisson
- From the Department of Diagnostic Radiology, Henry Ford Health, 2799 W Grand Blvd, Detroit, MI, 48202 (N.H., B.G., C.K.); Michigan State University College of Osteopathic Medicine, East Lansing, Mich (S.G.); and Department of Public Health Sciences, Henry Ford Health, Michigan State University Health Sciences, Detroit, Mich (L.M.P.)
| | - Brent Griffith
- From the Department of Diagnostic Radiology, Henry Ford Health, 2799 W Grand Blvd, Detroit, MI, 48202 (N.H., B.G., C.K.); Michigan State University College of Osteopathic Medicine, East Lansing, Mich (S.G.); and Department of Public Health Sciences, Henry Ford Health, Michigan State University Health Sciences, Detroit, Mich (L.M.P.)
| | - Chad Klochko
- From the Department of Diagnostic Radiology, Henry Ford Health, 2799 W Grand Blvd, Detroit, MI, 48202 (N.H., B.G., C.K.); Michigan State University College of Osteopathic Medicine, East Lansing, Mich (S.G.); and Department of Public Health Sciences, Henry Ford Health, Michigan State University Health Sciences, Detroit, Mich (L.M.P.)
| |
Collapse
|
5
|
Reith TP, D'Alessandro DM, D'Alessandro MP. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr Radiol 2024; 54:1729-1737. [PMID: 39133401 DOI: 10.1007/s00247-024-06025-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 07/31/2024] [Accepted: 08/01/2024] [Indexed: 08/13/2024]
Abstract
BACKGROUND There is a dearth of artificial intelligence (AI) development and research dedicated to pediatric radiology. The newest iterations of large language models (LLMs) like ChatGPT can process image and video input in addition to text. They are thus theoretically capable of providing impressions of input radiological images. OBJECTIVE To assess the ability of multimodal LLMs to interpret pediatric radiological images. MATERIALS AND METHODS Thirty medically significant cases were collected and submitted to GPT-4 (OpenAI, San Francisco, CA), Gemini 1.5 Pro (Google, Mountain View, CA), and Claude 3 Opus (Anthropic, San Francisco, CA) with a short history for a total of 90 images. AI responses were recorded and independently assessed for accuracy by a resident and attending physician. 95% confidence intervals were determined using the adjusted Wald method. RESULTS Overall, the models correctly diagnosed 27.8% (25/90) of images (95% CI=19.5-37.8%), were partially correct for 13.3% (12/90) of images (95% CI=2.7-26.4%), and were incorrect for 58.9% (53/90) of images (95% CI=48.6-68.5%). CONCLUSION Multimodal LLMs are not yet capable of interpreting pediatric radiological images.
Collapse
Affiliation(s)
- Thomas P Reith
- Department of Radiology, University of Iowa Hospitals and Clinics, Iowa City, IA, 52242, USA.
| | - Donna M D'Alessandro
- Department of Pediatrics, University of Iowa Hospitals and Clinics, Iowa City, IA, 52242, USA
| | - Michael P D'Alessandro
- Department of Radiology, University of Iowa Hospitals and Clinics, Iowa City, IA, 52242, USA
| |
Collapse
|
6
|
Deng F. Multimodal Models Are Still a Novice at Radiology Vision. Radiology 2024; 312:e242286. [PMID: 39225607 DOI: 10.1148/radiol.242286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Affiliation(s)
- Francis Deng
- From the Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, 600 N Wolfe St, Baltimore, MD 21287
| |
Collapse
|
7
|
Crim J. Bone radiographs: sometimes overlooked, often difficult to read, and still important. Skeletal Radiol 2024; 53:1687-1698. [PMID: 37914896 DOI: 10.1007/s00256-023-04498-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 10/21/2023] [Accepted: 10/22/2023] [Indexed: 11/03/2023]
Affiliation(s)
- Julia Crim
- University of Missouri at Columbia, Columbia, MO, USA.
| |
Collapse
|
8
|
Mitsuyama Y, Tatekawa H, Takita H, Sasaki F, Tashiro A, Oue S, Walston SL, Nonomiya Y, Shintani A, Miki Y, Ueda D. Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors. Eur Radiol 2024:10.1007/s00330-024-11032-8. [PMID: 39198333 DOI: 10.1007/s00330-024-11032-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 07/08/2024] [Accepted: 08/06/2024] [Indexed: 09/01/2024]
Abstract
OBJECTIVES Large language models like GPT-4 have demonstrated potential for diagnosis in radiology. Previous studies investigating this potential primarily utilized quizzes from academic journals. This study aimed to assess the diagnostic capabilities of GPT-4-based Chat Generative Pre-trained Transformer (ChatGPT) using actual clinical radiology reports of brain tumors and compare its performance with that of neuroradiologists and general radiologists. METHODS We collected brain MRI reports written in Japanese from preoperative brain tumor patients at two institutions from January 2017 to December 2021. The MRI reports were translated into English by radiologists. GPT-4 and five radiologists were presented with the same textual findings from the reports and asked to suggest differential and final diagnoses. The pathological diagnosis of the excised tumor served as the ground truth. McNemar's test and Fisher's exact test were used for statistical analysis. RESULTS In a study analyzing 150 radiological reports, GPT-4 achieved a final diagnostic accuracy of 73%, while radiologists' accuracy ranged from 65 to 79%. GPT-4's final diagnostic accuracy using reports from neuroradiologists was higher at 80%, compared to 60% using those from general radiologists. In the realm of differential diagnoses, GPT-4's accuracy was 94%, while radiologists' fell between 73 and 89%. Notably, for these differential diagnoses, GPT-4's accuracy remained consistent whether reports were from neuroradiologists or general radiologists. CONCLUSION GPT-4 exhibited good diagnostic capability, comparable to neuroradiologists in differentiating brain tumors from MRI reports. GPT-4 can be a second opinion for neuroradiologists on final diagnoses and a guidance tool for general radiologists and residents. CLINICAL RELEVANCE STATEMENT This study evaluated GPT-4-based ChatGPT's diagnostic capabilities using real-world clinical MRI reports from brain tumor cases, revealing that its accuracy in interpreting brain tumors from MRI findings is competitive with radiologists. KEY POINTS We investigated the diagnostic accuracy of GPT-4 using real-world clinical MRI reports of brain tumors. GPT-4 achieved final and differential diagnostic accuracy that is comparable with neuroradiologists. GPT-4 has the potential to improve the diagnostic process in clinical radiology.
Collapse
Affiliation(s)
- Yasuhito Mitsuyama
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Hiroyuki Tatekawa
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Hirotaka Takita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Fumi Sasaki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Akane Tashiro
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Satoshi Oue
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Shannon L Walston
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Yuta Nonomiya
- Department of Medical Statistics, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Ayumi Shintani
- Department of Medical Statistics, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Yukio Miki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan.
- Center for Health Science Innovation, Osaka Metropolitan University, 1-4-3, Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan.
| |
Collapse
|
9
|
Warren BE, Alkhalifah F, Ahrari A, Min A, Fawzy A, Annamalai G, Jaberi A, Beecroft R, Kachura JR, Mafeld SC. Feasibility of Artificial Intelligence Powered Adverse Event Analysis: Using a Large Language Model to Analyze Microwave Ablation Malfunction Data. Can Assoc Radiol J 2024:8465371241269436. [PMID: 39169480 DOI: 10.1177/08465371241269436] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/23/2024] Open
Abstract
Objectives: Determine if a large language model (LLM, GPT-4) can label and consolidate and analyze interventional radiology (IR) microwave ablation device safety event data into meaningful summaries similar to humans. Methods: Microwave ablation safety data from January 1, 2011 to October 31, 2023 were collected and type of failure was categorized by human readers. Using GPT-4 and iterative prompt development, the data were classified. Iterative summarization of the reports was performed using GPT-4 to generate a final summary of the large text corpus. Results: Training (n = 25), validation (n = 639), and test (n = 79) data were split to reflect real-world deployment of an LLM for this task. GPT-4 demonstrated high accuracy in the multiclass classification problem of microwave ablation device data (accuracy [95% CI]: training data 96.0% [79.7, 99.9], validation 86.4% [83.5, 89.0], test 87.3% [78.0, 93.8]). The text content was distilled through GPT-4 and iterative summarization prompts. A final summary was created which reflected the clinically relevant insights from the microwave ablation data relative to human interpretation but had inaccurate event class counts. Conclusion: The LLM emulated the human analysis, suggesting feasibility of using LLMs to process large volumes of IR safety data as a tool for clinicians. It accurately labelled microwave ablation device event data by type of malfunction through few-shot learning. Content distillation was used to analyze a large text corpus (>650 reports) and generate an insightful summary which was like the human interpretation.
Collapse
Affiliation(s)
- Blair E Warren
- Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
- Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
| | - Fahd Alkhalifah
- Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
- Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
| | - Aida Ahrari
- Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
- Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
| | - Adam Min
- Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
- Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
| | - Aly Fawzy
- Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
| | - Ganesan Annamalai
- Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
- Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
| | - Arash Jaberi
- Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
- Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
| | - Robert Beecroft
- Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
- Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
| | - John R Kachura
- Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
- Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
| | - Sebastian C Mafeld
- Department of Medical Imaging, University of Toronto, Temerty Faculty of Medicine, Toronto, ON, Canada
- Division of Vascular and Interventional Radiology, Joint Department of Medical Imaging, University Health Network, Toronto, ON, Canada
| |
Collapse
|
10
|
Ray PP. Need of Fine-Tuned Radiology Aware Open-Source Large Language Models for Neuroradiology. Clin Neuroradiol 2024:10.1007/s00062-024-01454-8. [PMID: 39158608 DOI: 10.1007/s00062-024-01454-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Accepted: 08/05/2024] [Indexed: 08/20/2024]
Affiliation(s)
- Partha Pratim Ray
- Department of Computer Applications, Sikkim University, 6th Mile, PO-Tadong, 737102, Gangtok, Sikkim, India.
| |
Collapse
|
11
|
Holderried F, Stegemann-Philipps C, Herrmann-Werner A, Festl-Wietek T, Holderried M, Eickhoff C, Mahling M. A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study. JMIR MEDICAL EDUCATION 2024; 10:e59213. [PMID: 39150749 PMCID: PMC11364946 DOI: 10.2196/59213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 05/21/2024] [Accepted: 06/27/2024] [Indexed: 08/17/2024]
Abstract
BACKGROUND Although history taking is fundamental for diagnosing medical conditions, teaching and providing feedback on the skill can be challenging due to resource constraints. Virtual simulated patients and web-based chatbots have thus emerged as educational tools, with recent advancements in artificial intelligence (AI) such as large language models (LLMs) enhancing their realism and potential to provide feedback. OBJECTIVE In our study, we aimed to evaluate the effectiveness of a Generative Pretrained Transformer (GPT) 4 model to provide structured feedback on medical students' performance in history taking with a simulated patient. METHODS We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. To that end, we designed a chatbot to simulate patients' responses and provide immediate feedback on the comprehensiveness of the students' history taking. Students' interactions with the chatbot were analyzed, and feedback from the chatbot was compared with feedback from a human rater. We measured interrater reliability and performed a descriptive analysis to assess the quality of feedback. RESULTS Most of the study's participants were in their third year of medical school. A total of 1894 question-answer pairs from 106 conversations were included in our analysis. GPT-4's role-play and responses were medically plausible in more than 99% of cases. Interrater reliability between GPT-4 and the human rater showed "almost perfect" agreement (Cohen κ=0.832). Less agreement (κ<0.6) detected for 8 out of 45 feedback categories highlighted topics about which the model's assessments were overly specific or diverged from human judgement. CONCLUSIONS The GPT model was effective in providing structured feedback on history-taking dialogs provided by medical students. Although we unraveled some limitations regarding the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. Our findings, thus, advocate the careful integration of AI-driven feedback mechanisms in medical training and highlight important aspects when LLMs are used in that context.
Collapse
Affiliation(s)
- Friederike Holderried
- Tübingen Institute for Medical Education (TIME), Medical Faculty, University of Tübingen, Tübingen, Germany
| | | | - Anne Herrmann-Werner
- Tübingen Institute for Medical Education (TIME), Medical Faculty, University of Tübingen, Tübingen, Germany
| | - Teresa Festl-Wietek
- Tübingen Institute for Medical Education (TIME), Medical Faculty, University of Tübingen, Tübingen, Germany
| | - Martin Holderried
- Department of Medical Development, Process and Quality Management, University Hospital Tübingen, Tübingen, Germany
| | - Carsten Eickhoff
- Institute for Applied Medical Informatics, University of Tübingen, Tübingen, Germany
| | - Moritz Mahling
- Tübingen Institute for Medical Education (TIME), Medical Faculty, University of Tübingen, Tübingen, Germany
- Department of Medical Development, Process and Quality Management, University Hospital Tübingen, Tübingen, Germany
| |
Collapse
|
12
|
Sadeq MA, Ghorab RMF, Ashry MH, Abozaid AM, Banihani HA, Salem M, Aisheh MTA, Abuzahra S, Mourid MR, Assker MM, Ayyad M, Moawad MHED. AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study. Sci Rep 2024; 14:18859. [PMID: 39143077 PMCID: PMC11324724 DOI: 10.1038/s41598-024-68996-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 07/30/2024] [Indexed: 08/16/2024] Open
Abstract
Large language models (LLMs) like ChatGPT have potential applications in medical education such as helping students study for their licensing exams by discussing unclear questions with them. However, they require evaluation on these complex tasks. The purpose of this study was to evaluate how well publicly accessible LLMs performed on simulated UK medical board exam questions. 423 board-style questions from 9 UK exams (MRCS, MRCP, etc.) were answered by seven LLMs (ChatGPT-3.5, ChatGPT-4, Bard, Perplexity, Claude, Bing, Claude Instant). There were 406 multiple-choice, 13 true/false, and 4 "choose N" questions covering topics in surgery, pediatrics, and other disciplines. The accuracy of the output was graded. Statistics were used to analyze differences among LLMs. Leaked questions were excluded from the primary analysis. ChatGPT 4.0 scored (78.2%), Bing (67.2%), Claude (64.4%), and Claude Instant (62.9%). Perplexity scored the lowest (56.1%). Scores differed significantly between LLMs overall (p < 0.001) and in pairwise comparisons. All LLMs scored higher on multiple-choice vs true/false or "choose N" questions. LLMs demonstrated limitations in answering certain questions, indicating refinements needed before primary reliance in medical education. However, their expanding capabilities suggest a potential to improve training if thoughtfully implemented. Further research should explore specialty specific LLMs and optimal integration into medical curricula.
Collapse
Affiliation(s)
- Mohammed Ahmed Sadeq
- Misr University for Science and Technology, 6th of October, Egypt.
- Medical Research Platform (MRP), Giza, Egypt.
- Emergency Medicine Department, Elsheikh Zayed Specialized Hospital, Elsheikh Zayed City, Egypt.
| | - Reem Mohamed Farouk Ghorab
- Misr University for Science and Technology, 6th of October, Egypt
- Medical Research Platform (MRP), Giza, Egypt
- Emergency Medicine Department, Elsheikh Zayed Specialized Hospital, Elsheikh Zayed City, Egypt
| | - Mohamed Hady Ashry
- Medical Research Platform (MRP), Giza, Egypt
- School of Medicine, New Giza University (NGU), Giza, Egypt
| | - Ahmed Mohamed Abozaid
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Medicine, Tanta University, Tanta, Egypt
| | - Haneen A Banihani
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Medicine, University of Jordan, Amman, Jordan
| | - Moustafa Salem
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Medicine, Mansoura University, Mansoura, Egypt
| | - Mohammed Tawfiq Abu Aisheh
- Medical Research Platform (MRP), Giza, Egypt
- Department of Medicine, College of Medicine and Health Sciences, An-Najah National University, Nablus, 44839, Palestine
| | - Saad Abuzahra
- Medical Research Platform (MRP), Giza, Egypt
- Department of Medicine, College of Medicine and Health Sciences, An-Najah National University, Nablus, 44839, Palestine
| | - Marina Ramzy Mourid
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Medicine, Alexandria University, Alexandria, Egypt
| | - Mohamad Monif Assker
- Medical Research Platform (MRP), Giza, Egypt
- Sheikh Khalifa Medical City, Abu Dhabi, UAE
| | - Mohammed Ayyad
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Medicine, Al-Quds University, Jerusalem, Palestine
| | - Mostafa Hossam El Din Moawad
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Pharmacy Clinical Department, Alexandria University, Alexandria, Egypt
- Faculty of Medicine, Suez Canal University, Ismailia, Egypt
| |
Collapse
|
13
|
Fatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: A cross-disciplinary systematic review of ChatGPT's (artificial intelligence) role in research, clinical practice, education, and patient interaction. Medicine (Baltimore) 2024; 103:e39250. [PMID: 39121303 PMCID: PMC11315549 DOI: 10.1097/md.0000000000039250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 07/19/2024] [Indexed: 08/11/2024] Open
Abstract
BACKGROUND ChatGPT, a powerful AI language model, has gained increasing prominence in medicine, offering potential applications in healthcare, clinical decision support, patient communication, and medical research. This systematic review aims to comprehensively assess the applications of ChatGPT in healthcare education, research, writing, patient communication, and practice while also delineating potential limitations and areas for improvement. METHOD Our comprehensive database search retrieved relevant papers from PubMed, Medline and Scopus. After the screening process, 83 studies met the inclusion criteria. This review includes original studies comprising case reports, analytical studies, and editorials with original findings. RESULT ChatGPT is useful for scientific research and academic writing, and assists with grammar, clarity, and coherence. This helps non-English speakers and improves accessibility by breaking down linguistic barriers. However, its limitations include probable inaccuracy and ethical issues, such as bias and plagiarism. ChatGPT streamlines workflows and offers diagnostic and educational potential in healthcare but exhibits biases and lacks emotional sensitivity. It is useful in inpatient communication, but requires up-to-date data and faces concerns about the accuracy of information and hallucinatory responses. CONCLUSION Given the potential for ChatGPT to transform healthcare education, research, and practice, it is essential to approach its adoption in these areas with caution due to its inherent limitations.
Collapse
Affiliation(s)
- Afia Fatima
- Department of Medicine, Jinnah Sindh Medical University, Karachi, Pakistan
| | | | - Khadija Alam
- Department of Medicine, Liaquat National Medical College, Karachi, Pakistan
| | | | | |
Collapse
|
14
|
Beşler MS. The performance of the multimodal large language model GPT-4 on the European board of radiology examination sample test. Jpn J Radiol 2024; 42:927. [PMID: 38568429 DOI: 10.1007/s11604-024-01565-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 03/24/2024] [Indexed: 07/30/2024]
Affiliation(s)
- Muhammed Said Beşler
- Department of Radiology, Kahramanmaraş Necip Fazıl City Hospital, Kahramanmaraş, Turkey.
| |
Collapse
|
15
|
Cao JJ, Kwon DH, Ghaziani TT, Kwo P, Tse G, Kesselman A, Kamaya A, Tse JR. Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability. Abdom Radiol (NY) 2024:10.1007/s00261-024-04501-7. [PMID: 39088019 DOI: 10.1007/s00261-024-04501-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 07/10/2024] [Accepted: 07/13/2024] [Indexed: 08/02/2024]
Abstract
PURPOSE To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management. METHODS Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests. RESULTS Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001). CONCLUSION Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.
Collapse
Affiliation(s)
- Jennie J Cao
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA
| | - Daniel H Kwon
- Department of Medicine, San Francisco School of Medicine, University of California, 505 Parnassus Ave, MC1286C, San Francisco, CA, 94144, USA
| | - Tara T Ghaziani
- Department of Medicine, Stanford University School of Medicine, 430 Broadway St MC 6341, Redwood City, CA, 94063, USA
| | - Paul Kwo
- Department of Medicine, Stanford University School of Medicine, 430 Broadway St MC 6341, Redwood City, CA, 94063, USA
| | - Gary Tse
- Department of Radiological Sciences, Los Angeles David Geffen School of Medicine, University of California, 757 Westwood Plaza Los Angeles, Los Angeles, CA, 90095, USA
| | - Andrew Kesselman
- Department of Radiology, Stanford University School of Medicine, 875 Blake Wilbur Drive Palo Alto, Stanford, CA, 94304, USA
| | - Aya Kamaya
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA
| | - Justin R Tse
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA.
| |
Collapse
|
16
|
Adams LC, Truhn D, Busch F, Dorfner F, Nawabi J, Makowski MR, Bressem KK. Llama 3 Challenges Proprietary State-of-the-Art Large Language Models in Radiology Board-style Examination Questions. Radiology 2024; 312:e241191. [PMID: 39136566 DOI: 10.1148/radiol.241191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/06/2024]
Affiliation(s)
- Lisa C Adams
- From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
| | - Daniel Truhn
- From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
| | - Felix Busch
- From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
| | - Felix Dorfner
- From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
| | - Jawed Nawabi
- From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
| | - Marcus R Makowski
- From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
| | - Keno K Bressem
- From the Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany (L.C.A., M.R.M., K.K.B.); Department of Radiology, University Hospital RWTH Aachen, Aachen, Germany (D.T.); Departments of Radiology (F.B., F.D.) and Neuroradiology (J.N.), Charité-Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany; and Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.)
| |
Collapse
|
17
|
Barak-Corren Y, Wolf R, Rozenblum R, Creedon JK, Lipsett SC, Lyons TW, Michelson KA, Miller KA, Shapiro DJ, Reis BY, Fine AM. Harnessing the Power of Generative AI for Clinical Summaries: Perspectives From Emergency Physicians. Ann Emerg Med 2024; 84:128-138. [PMID: 38483426 DOI: 10.1016/j.annemergmed.2024.01.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/24/2024] [Accepted: 01/29/2024] [Indexed: 04/14/2024]
Abstract
STUDY OBJECTIVE The workload of clinical documentation contributes to health care costs and professional burnout. The advent of generative artificial intelligence language models presents a promising solution. The perspective of clinicians may contribute to effective and responsible implementation of such tools. This study sought to evaluate 3 uses for generative artificial intelligence for clinical documentation in pediatric emergency medicine, measuring time savings, effort reduction, and physician attitudes and identifying potential risks and barriers. METHODS This mixed-methods study was performed with 10 pediatric emergency medicine attending physicians from a single pediatric emergency department. Participants were asked to write a supervisory note for 4 clinical scenarios, with varying levels of complexity, twice without any assistance and twice with the assistance of ChatGPT Version 4.0. Participants evaluated 2 additional ChatGPT-generated clinical summaries: a structured handoff and a visit summary for a family written at an 8th grade reading level. Finally, a semistructured interview was performed to assess physicians' perspective on the use of ChatGPT in pediatric emergency medicine. Main outcomes and measures included between subjects' comparisons of the effort and time taken to complete the supervisory note with and without ChatGPT assistance. Effort was measured using a self-reported Likert scale of 0 to 10. Physicians' scoring of and attitude toward the ChatGPT-generated summaries were measured using a 0 to 10 Likert scale and open-ended questions. Summaries were scored for completeness, accuracy, efficiency, readability, and overall satisfaction. A thematic analysis was performed to analyze the content of the open-ended questions and to identify key themes. RESULTS ChatGPT yielded a 40% reduction in time and a 33% decrease in effort for supervisory notes in intricate cases, with no discernible effect on simpler notes. ChatGPT-generated summaries for structured handoffs and family letters were highly rated, ranging from 7.0 to 9.0 out of 10, and most participants favored their inclusion in clinical practice. However, there were several critical reservations, out of which a set of general recommendations for applying ChatGPT to clinical summaries was formulated. CONCLUSION Pediatric emergency medicine attendings in our study perceived that ChatGPT can deliver high-quality summaries while saving time and effort in many scenarios, but not all.
Collapse
Affiliation(s)
- Yuval Barak-Corren
- Predictive Medicine Group, Computational Health Informatics Program, Boston Children's Hospital, Boston, MA; Division of Cardiology, Children's Hospital of Philadelphia, Philadelphia, PA.
| | - Rebecca Wolf
- Emergency Medicine Boston Children's Hospital, Boston, MA
| | - Ronen Rozenblum
- Harvard Medical School Boston, MA; Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA
| | - Jessica K Creedon
- Emergency Medicine Boston Children's Hospital, Boston, MA; Harvard Medical School Boston, MA
| | - Susan C Lipsett
- Emergency Medicine Boston Children's Hospital, Boston, MA; Harvard Medical School Boston, MA
| | - Todd W Lyons
- Emergency Medicine Boston Children's Hospital, Boston, MA; Harvard Medical School Boston, MA
| | | | - Kelsey A Miller
- Emergency Medicine Boston Children's Hospital, Boston, MA; Harvard Medical School Boston, MA
| | - Daniel J Shapiro
- Division of Pediatric Emergency Medicine, University of California, San Francisco, San Francisco, CA
| | - Ben Y Reis
- Predictive Medicine Group, Computational Health Informatics Program, Boston Children's Hospital, Boston, MA; Harvard Medical School Boston, MA
| | - Andrew M Fine
- Emergency Medicine Boston Children's Hospital, Boston, MA; Harvard Medical School Boston, MA
| |
Collapse
|
18
|
D'Anna G, Van Cauter S, Thurnher M, Van Goethem J, Haller S. Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard. Neuroradiology 2024; 66:1245-1250. [PMID: 38705899 DOI: 10.1007/s00234-024-03371-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Accepted: 04/30/2024] [Indexed: 05/07/2024]
Abstract
We compared different LLMs, notably chatGPT, GPT4, and Google Bard and we tested whether their performance differs in subspeciality domains, in executing examinations from four different courses of the European Society of Neuroradiology (ESNR) notably anatomy/embryology, neuro-oncology, head and neck and pediatrics. Written exams of ESNR were used as input data, related to anatomy/embryology (30 questions), neuro-oncology (50 questions), head and neck (50 questions), and pediatrics (50 questions). All exams together, and each exam separately were introduced to the three LLMs: chatGPT 3.5, GPT4, and Google Bard. Statistical analyses included a group-wise Friedman test followed by a pair-wise Wilcoxon test with multiple comparison corrections. Overall, there was a significant difference between the 3 LLMs (p < 0.0001), with GPT4 having the highest accuracy (70%), followed by chatGPT 3.5 (54%) and Google Bard (36%). The pair-wise comparison showed significant differences between chatGPT vs GPT 4 (p < 0.0001), chatGPT vs Bard (p < 0. 0023), and GPT4 vs Bard (p < 0.0001). Analyses per subspecialty showed the highest difference between the best LLM (GPT4, 70%) versus the worst LLM (Google Bard, 24%) in the head and neck exam, while the difference was least pronounced in neuro-oncology (GPT4, 62% vs Google Bard, 48%). We observed significant differences in the performance of the three different LLMs in the running of official exams organized by ESNR. Overall GPT 4 performed best, and Google Bard performed worst. This difference varied depending on subspeciality and was most pronounced in head and neck subspeciality.
Collapse
Affiliation(s)
- Gennaro D'Anna
- Neuroimaging Unit, ASST Ovest Milanese, Legnano, Milan, Italy.
| | - Sofie Van Cauter
- Department of Medical Imaging, Ziekenhuis Oost-Limburg, Genk, Belgium
- Department of Medicine and Life Sciences, Hasselt University, Hasselt, Belgium
| | - Majda Thurnher
- Department for Biomedical Imaging and Image-Guided Therapy, Medical University of Vienna, Vienna, Austria
| | - Johan Van Goethem
- Department of Medical and Molecular Imaging, VITAZ, Sint-Niklaas, Belgium
- Department of Radiology, University Hospital Antwerp, Antwerp, Belgium
| | - Sven Haller
- CIMC-Centre d'Imagerie Médicale de Cornavin, Geneva, Switzerland
- Department of Surgical Sciences, Radiology, Uppsala University, Uppsala, Sweden
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
- Department of Radiology, Beijing Tiantan Hospital, Capital Medical University, Beijing, People's Republic of China
| |
Collapse
|
19
|
Kim SE, Lee JH, Choi BS, Han HS, Lee MC, Ro DH. Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4. Clin Orthop Surg 2024; 16:669-673. [PMID: 39092297 PMCID: PMC11262944 DOI: 10.4055/cios23179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 01/29/2024] [Accepted: 01/29/2024] [Indexed: 08/04/2024] Open
Abstract
Background The application of artificial intelligence and large language models in the medical field requires an evaluation of their accuracy in providing medical information. This study aimed to assess the performance of Chat Generative Pre-trained Transformer (ChatGPT) models 3.5 and 4 in solving orthopedic board-style questions. Methods A total of 160 text-only questions from the Orthopedic Surgery Department at Seoul National University Hospital, conforming to the format of the Korean Orthopedic Association board certification examinations, were input into the ChatGPT 3.5 and ChatGPT 4 programs. The questions were divided into 11 subcategories. The accuracy rates of the initial answers provided by Chat GPT 3.5 and ChatGPT 4 were analyzed. In addition, inconsistency rates of answers were evaluated by regenerating the responses. Results ChatGPT 3.5 answered 37.5% of the questions correctly, while ChatGPT 4 showed an accuracy rate of 60.0% (p < 0.001). ChatGPT 4 demonstrated superior performance across most subcategories, except for the tumor-related questions. The rates of inconsistency in answers were 47.5% for ChatGPT 3.5 and 9.4% for ChatGPT 4. Conclusions ChatGPT 4 showed the ability to pass orthopedic board-style examinations, outperforming ChatGPT 3.5 in accuracy rate. However, inconsistencies in response generation and instances of incorrect answers with misleading explanations require caution when applying ChatGPT in clinical settings or for educational purposes.
Collapse
Affiliation(s)
- Sung Eun Kim
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Ji Han Lee
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Byung Sun Choi
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Hyuk-Soo Han
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Myung Chul Lee
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Du Hyun Ro
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| |
Collapse
|
20
|
Hirano Y, Hanaoka S, Nakao T, Miki S, Kikuchi T, Nakamura Y, Nomura Y, Yoshikawa T, Abe O. GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination. Jpn J Radiol 2024; 42:918-926. [PMID: 38733472 PMCID: PMC11286662 DOI: 10.1007/s11604-024-01561-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 03/16/2024] [Indexed: 05/13/2024]
Abstract
PURPOSE To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI's latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE). MATERIALS AND METHODS The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar's exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test. RESULTS The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses. CONCLUSION No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.
Collapse
Affiliation(s)
- Yuichiro Hirano
- Department of Radiology, The International University of Health and Welfare Narita Hospital, 852 Hatakeda, Narita, Chiba, Japan.
- Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan.
| | - Shouhei Hanaoka
- Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| | - Takahiro Nakao
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| | - Soichiro Miki
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| | - Tomohiro Kikuchi
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
- Department of Radiology, School of Medicine, Jichi Medical University, 3311-1 Yakushiji, Shimotsuke, Tochigi, Japan
| | - Yuta Nakamura
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| | - Yukihiro Nomura
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
- Center for Frontier Medical Engineering, Chiba University, 1-33 Yayoicho, Inage-Ku, Chiba, Japan
| | - Takeharu Yoshikawa
- Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| | - Osamu Abe
- Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan
| |
Collapse
|
21
|
Naja F, Taktouk M, Matbouli D, Khaleel S, Maher A, Uzun B, Alameddine M, Nasreddine L. Artificial intelligence chatbots for the nutrition management of diabetes and the metabolic syndrome. Eur J Clin Nutr 2024:10.1038/s41430-024-01476-y. [PMID: 39060542 DOI: 10.1038/s41430-024-01476-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 07/16/2024] [Accepted: 07/17/2024] [Indexed: 07/28/2024]
Abstract
BACKGROUND Recently, there has been a growing interest in exploring AI-driven chatbots, such as ChatGPT, as a resource for disease management and education. OBJECTIVE The study aims to evaluate ChatGPT's accuracy and quality/clarity in providing nutritional management for Type 2 Diabetes (T2DM), the Metabolic syndrome (MetS) and its components, in accordance with the Academy of Nutrition and Dietetics' guidelines. METHODS Three nutrition management-related domains were considered: (1) Dietary management, (2) Nutrition care process (NCP) and (3) Menu planning (1500 kcal). A total of 63 prompts were used. Two experienced dietitians evaluated the chatbot output's concordance with the guidelines. RESULTS Both dietitians provided similar assessments for most conditions examined in the study. Gaps in the ChatGPT-derived outputs were identified and included weight loss recommendations, energy deficit, anthropometric assessment, specific nutrients of concern and the adoption of specific dietary interventions. Gaps in physical activity recommendations were also observed, highlighting ChatGPT's limitations in providing holistic lifestyle interventions. Within the NCP, the generated output provided incomplete examples of diagnostic documentation statements and had significant gaps in the monitoring and evaluation step. In the 1500 kcal one-day menus, the amounts of carbohydrates, fat, vitamin D and calcium were discordant with dietary recommendations. Regarding clarity, dietitians rated the output as either good or excellent. CONCLUSION Although ChatGPT is an increasingly available resource for practitioners, users are encouraged to consider the gaps identified in this study in the dietary management of T2DM and the MetS.
Collapse
Affiliation(s)
- Farah Naja
- Department of Clinical Nutrition and Dietetics, College of Health Sciences, Research Institute of Medical and Health Sciences (RIMHS), University of Sharjah, Sharjah, United Arab Emirates
- Department of Nutrition and Food Sciences, Faculty of Agricultural and Food Sciences, American University of Beirut (AUB), Beirut, Lebanon
| | - Mandy Taktouk
- Department of Nutrition and Food Sciences, Faculty of Agricultural and Food Sciences, American University of Beirut (AUB), Beirut, Lebanon
| | - Dana Matbouli
- Department of Nutrition and Food Sciences, Faculty of Agricultural and Food Sciences, American University of Beirut (AUB), Beirut, Lebanon
| | - Sharfa Khaleel
- Department of Clinical Nutrition and Dietetics, College of Health Sciences, Research Institute of Medical and Health Sciences (RIMHS), University of Sharjah, Sharjah, United Arab Emirates
| | - Ayah Maher
- Department of Clinical Nutrition and Dietetics, College of Health Sciences, Research Institute of Medical and Health Sciences (RIMHS), University of Sharjah, Sharjah, United Arab Emirates
| | - Berna Uzun
- Department of Mathematics, Near East University, Nicosia, Turkey
| | | | - Lara Nasreddine
- Department of Nutrition and Food Sciences, Faculty of Agricultural and Food Sciences, American University of Beirut (AUB), Beirut, Lebanon.
| |
Collapse
|
22
|
Cherif H, Moussa C, Missaoui AM, Salouage I, Mokaddem S, Dhahri B. Appraisal of ChatGPT's Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination. JMIR MEDICAL EDUCATION 2024; 10:e52818. [PMID: 39042876 PMCID: PMC11303904 DOI: 10.2196/52818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 02/05/2024] [Accepted: 02/26/2024] [Indexed: 07/25/2024]
Abstract
BACKGROUND The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. OBJECTIVE This study aimed to evaluate ChatGPT's performance in a pulmonology examination through a comparative analysis with that of third-year medical students. METHODS In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution's 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. RESULTS V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. CONCLUSIONS While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources.
Collapse
Affiliation(s)
- Hela Cherif
- Faculté de Médecine de Tunis, Université de Tunis El Manar, Tunis, Tunisia
| | - Chirine Moussa
- Faculté de Médecine de Tunis, Université de Tunis El Manar, Tunis, Tunisia
| | | | - Issam Salouage
- Faculté de Médecine de Tunis, Université de Tunis El Manar, Tunis, Tunisia
| | - Salma Mokaddem
- Faculté de Médecine de Tunis, Université de Tunis El Manar, Tunis, Tunisia
| | - Besma Dhahri
- Faculté de Médecine de Tunis, Université de Tunis El Manar, Tunis, Tunisia
| |
Collapse
|
23
|
Wu Q, Wu Q, Li H, Wang Y, Bai Y, Wu Y, Yu X, Li X, Dong P, Xue J, Shen D, Wang M. Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study. JMIR Med Inform 2024; 12:e55799. [PMID: 39018102 PMCID: PMC11292156 DOI: 10.2196/55799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 02/02/2024] [Accepted: 05/25/2024] [Indexed: 07/18/2024] Open
Abstract
BACKGROUND Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored. OBJECTIVE This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies. METHODS This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ. RESULTS Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2's performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018. CONCLUSIONS When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.
Collapse
Affiliation(s)
- Qingxia Wu
- Department of Medical Imaging, Henan Provincial People's Hospital & People's Hospital of Zhengzhou University, Zhengzhou, China
| | - Qingxia Wu
- Research Intelligence Department, Beijing United Imaging Research Institute of Intelligent Imaging, Beijing, China
- Research and Collaboration, United Imaging Intelligence (Beijing) Co, Ltd, Beijing, China
| | - Huali Li
- Department of Radiology, Luoyang Central Hospital, Luoyang, China
| | - Yan Wang
- Department of Medical Imaging, Henan Provincial People's Hospital & People's Hospital of Zhengzhou University, Zhengzhou, China
| | - Yan Bai
- Department of Medical Imaging, Henan Provincial People's Hospital & People's Hospital of Zhengzhou University, Zhengzhou, China
| | - Yaping Wu
- Department of Medical Imaging, Henan Provincial People's Hospital & People's Hospital of Zhengzhou University, Zhengzhou, China
| | - Xuan Yu
- Department of Medical Imaging, Henan Provincial People's Hospital & People's Hospital of Zhengzhou University, Zhengzhou, China
| | - Xiaodong Li
- Department of Medical Imaging, Henan Provincial People's Hospital & People's Hospital of Zhengzhou University, Zhengzhou, China
| | - Pei Dong
- Research Intelligence Department, Beijing United Imaging Research Institute of Intelligent Imaging, Beijing, China
- Research and Collaboration, United Imaging Intelligence (Beijing) Co, Ltd, Beijing, China
| | - Jon Xue
- Research and Collaboration, Shanghai United Imaging Intelligence Co, Ltd, Shanghai, China
| | - Dinggang Shen
- Research and Collaboration, Shanghai United Imaging Intelligence Co, Ltd, Shanghai, China
- School of Biomedical Engineering, Shanghai Tech University, Shanghai, China
| | - Meiyun Wang
- Department of Medical Imaging, Henan Provincial People's Hospital & People's Hospital of Zhengzhou University, Zhengzhou, China
- Biomedical Research Institute, Henan Academy of Sciences, Zhengzhou, China
| |
Collapse
|
24
|
Wada A, Akashi T, Shih G, Hagiwara A, Nishizawa M, Hayakawa Y, Kikuta J, Shimoji K, Sano K, Kamagata K, Nakanishi A, Aoki S. Optimizing GPT-4 Turbo Diagnostic Accuracy in Neuroradiology through Prompt Engineering and Confidence Thresholds. Diagnostics (Basel) 2024; 14:1541. [PMID: 39061677 PMCID: PMC11276551 DOI: 10.3390/diagnostics14141541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 07/02/2024] [Accepted: 07/10/2024] [Indexed: 07/28/2024] Open
Abstract
BACKGROUND AND OBJECTIVES Integrating large language models (LLMs) such as GPT-4 Turbo into diagnostic imaging faces a significant challenge, with current misdiagnosis rates ranging from 30-50%. This study evaluates how prompt engineering and confidence thresholds can improve diagnostic accuracy in neuroradiology. METHODS We analyze 751 neuroradiology cases from the American Journal of Neuroradiology using GPT-4 Turbo with customized prompts to improve diagnostic precision. RESULTS Initially, GPT-4 Turbo achieved a baseline diagnostic accuracy of 55.1%. By reformatting responses to list five diagnostic candidates and applying a 90% confidence threshold, the highest precision of the diagnosis increased to 72.9%, with the candidate list providing the correct diagnosis at 85.9%, reducing the misdiagnosis rate to 14.1%. However, this threshold reduced the number of cases that responded. CONCLUSIONS Strategic prompt engineering and high confidence thresholds significantly reduce misdiagnoses and improve the precision of the LLM diagnostic in neuroradiology. More research is needed to optimize these approaches for broader clinical implementation, balancing accuracy and utility.
Collapse
Affiliation(s)
- Akihiko Wada
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Toshiaki Akashi
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - George Shih
- Clinical Radiology, Weill Cornell Medical College, New York, NY 10065, USA
| | - Akifumi Hagiwara
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Mitsuo Nishizawa
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Yayoi Hayakawa
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Junko Kikuta
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Keigo Shimoji
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Katsuhiro Sano
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Koji Kamagata
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Atsushi Nakanishi
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| | - Shigeki Aoki
- Department of Radiology, Juntendo University Graduate School of Medicine, Tokyo 113-8421, Japan
| |
Collapse
|
25
|
Builoff V, Shanbhag A, Miller RJ, Dey D, Liang JX, Flood K, Bourque JM, Chareonthaitawee P, Phillips LM, Slomka PJ. Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.07.16.24310297. [PMID: 39072028 PMCID: PMC11275690 DOI: 10.1101/2024.07.16.24310297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Background Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. Objectives This study assesses four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.) - in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. Methods We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions. Results GPT-4, Gemini, GPT4-Turbo, and GPT-4o correctly answered median percentiles of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.9% - 61.3%) and 63.1% (62.5 - 64.3%) of questions, respectively. GPT4o significantly outperformed other models (p=0.007 vs. GPT-4Turbo, p<0.001 vs. GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (p<0.001, p<0.001, and p=0.001), while Gemini performed worse on image-based questions (p<0.001 for all). Conclusion GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.
Collapse
|
26
|
Horiuchi D, Tatekawa H, Oura T, Shimono T, Walston SL, Takita H, Matsushita S, Mitsuyama Y, Miki Y, Ueda D. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol 2024:10.1007/s00330-024-10902-5. [PMID: 38995378 DOI: 10.1007/s00330-024-10902-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Revised: 05/02/2024] [Accepted: 06/24/2024] [Indexed: 07/13/2024]
Abstract
OBJECTIVES To compare the diagnostic accuracy of Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT-4 with vision (GPT-4V) based ChatGPT, and radiologists in musculoskeletal radiology. MATERIALS AND METHODS We included 106 "Test Yourself" cases from Skeletal Radiology between January 2014 and September 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Two radiologists (a radiology resident and a board-certified radiologist) independently provided diagnoses for all cases. The diagnostic accuracy rates were determined based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists. RESULTS GPT-4-based ChatGPT significantly outperformed GPT-4V-based ChatGPT (p < 0.001) with accuracy rates of 43% (46/106) and 8% (9/106), respectively. The radiology resident and the board-certified radiologist achieved accuracy rates of 41% (43/106) and 53% (56/106). The diagnostic accuracy of GPT-4-based ChatGPT was comparable to that of the radiology resident, but was lower than that of the board-certified radiologist although the differences were not significant (p = 0.78 and 0.22, respectively). The diagnostic accuracy of GPT-4V-based ChatGPT was significantly lower than those of both radiologists (p < 0.001 and < 0.001, respectively). CONCLUSION GPT-4-based ChatGPT demonstrated significantly higher diagnostic accuracy than GPT-4V-based ChatGPT. While GPT-4-based ChatGPT's diagnostic performance was comparable to radiology residents, it did not reach the performance level of board-certified radiologists in musculoskeletal radiology. CLINICAL RELEVANCE STATEMENT GPT-4-based ChatGPT outperformed GPT-4V-based ChatGPT and was comparable to radiology residents, but it did not reach the level of board-certified radiologists in musculoskeletal radiology. Radiologists should comprehend ChatGPT's current performance as a diagnostic tool for optimal utilization. KEY POINTS This study compared the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in musculoskeletal radiology. GPT-4-based ChatGPT was comparable to radiology residents, but did not reach the level of board-certified radiologists. When utilizing ChatGPT, it is crucial to input appropriate descriptions of imaging findings rather than the images.
Collapse
Affiliation(s)
- Daisuke Horiuchi
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hiroyuki Tatekawa
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Tatsushi Oura
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Taro Shimono
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Shannon L Walston
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hirotaka Takita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Shu Matsushita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Yasuhito Mitsuyama
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Yukio Miki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
| |
Collapse
|
27
|
Haider SA, Pressman SM, Borna S, Gomez-Cabello CA, Sehgal A, Leibovich BC, Forte AJ. Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems. Diagnostics (Basel) 2024; 14:1491. [PMID: 39061628 PMCID: PMC11275570 DOI: 10.3390/diagnostics14141491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 06/25/2024] [Accepted: 07/09/2024] [Indexed: 07/28/2024] Open
Abstract
Medical researchers are increasingly utilizing advanced LLMs like ChatGPT-4 and Gemini to enhance diagnostic processes in the medical field. This research focuses on their ability to comprehend and apply complex medical classification systems for breast conditions, which can significantly aid plastic surgeons in making informed decisions for diagnosis and treatment, ultimately leading to improved patient outcomes. Fifty clinical scenarios were created to evaluate the classification accuracy of each LLM across five established breast-related classification systems. Scores from 0 to 2 were assigned to LLM responses to denote incorrect, partially correct, or completely correct classifications. Descriptive statistics were employed to compare the performances of ChatGPT-4 and Gemini. Gemini exhibited superior overall performance, achieving 98% accuracy compared to ChatGPT-4's 71%. While both models performed well in the Baker classification for capsular contracture and UTSW classification for gynecomastia, Gemini consistently outperformed ChatGPT-4 in other systems, such as the Fischer Grade Classification for gender-affirming mastectomy, Kajava Classification for ectopic breast tissue, and Regnault Classification for breast ptosis. With further development, integrating LLMs into plastic surgery practice will likely enhance diagnostic support and decision making.
Collapse
Affiliation(s)
- Syed Ali Haider
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | - Sahar Borna
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | - Ajai Sehgal
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| | - Bradley C. Leibovich
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
- Department of Urology, Mayo Clinic, Rochester, MN 55905, USA
| | - Antonio Jorge Forte
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
28
|
Sacoransky E, Kwan BYM, Soboleski D. ChatGPT and assistive AI in structured radiology reporting: A systematic review. Curr Probl Diagn Radiol 2024:S0363-0188(24)00113-0. [PMID: 39004580 DOI: 10.1067/j.cpradiol.2024.07.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 06/08/2024] [Accepted: 07/08/2024] [Indexed: 07/16/2024]
Abstract
INTRODUCTION The rise of transformer-based large language models (LLMs), such as ChatGPT, has captured global attention with recent advancements in artificial intelligence (AI). ChatGPT demonstrates growing potential in structured radiology reporting-a field where AI has traditionally focused on image analysis. METHODS A comprehensive search of MEDLINE and Embase was conducted from inception through May 2024, and primary studies discussing ChatGPT's role in structured radiology reporting were selected based on their content. RESULTS Of the 268 articles screened, eight were ultimately included in this review. These articles explored various applications of ChatGPT, such as generating structured reports from unstructured reports, extracting data from free text, generating impressions from radiology findings and creating structured reports from imaging data. All studies demonstrated optimism regarding ChatGPT's potential to aid radiologists, though common critiques included data privacy concerns, reliability, medical errors, and lack of medical-specific training. CONCLUSION ChatGPT and assistive AI have significant potential to transform radiology reporting, enhancing accuracy and standardization while optimizing healthcare resources. Future developments may involve integrating dynamic few-shot prompting, ChatGPT, and Retrieval Augmented Generation (RAG) into diagnostic workflows. Continued research, development, and ethical oversight are crucial to fully realize AI's potential in radiology.
Collapse
Affiliation(s)
- Ethan Sacoransky
- Queen's University School of Medicine, 15 Arch St, Kingston, ON K7L 3L4, Canada.
| | - Benjamin Y M Kwan
- Queen's University School of Medicine, 15 Arch St, Kingston, ON K7L 3L4, Canada; Department of Diagnostic Radiology, Kingston Health Sciences Centre, Kingston, ON, Canada
| | - Donald Soboleski
- Queen's University School of Medicine, 15 Arch St, Kingston, ON K7L 3L4, Canada; Department of Diagnostic Radiology, Kingston Health Sciences Centre, Kingston, ON, Canada
| |
Collapse
|
29
|
Keshavarz P, Bagherieh S, Nabipoorashrafi SA, Chalian H, Rahsepar AA, Kim GHJ, Hassani C, Raman SS, Bedayat A. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024; 105:251-265. [PMID: 38679540 DOI: 10.1016/j.diii.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/11/2024] [Accepted: 04/16/2024] [Indexed: 05/01/2024]
Abstract
PURPOSE The purpose of this study was to systematically review the reported performances of ChatGPT, identify potential limitations, and explore future directions for its integration, optimization, and ethical considerations in radiology applications. MATERIALS AND METHODS After a comprehensive review of PubMed, Web of Science, Embase, and Google Scholar databases, a cohort of published studies was identified up to January 1, 2024, utilizing ChatGPT for clinical radiology applications. RESULTS Out of 861 studies derived, 44 studies evaluated the performance of ChatGPT; among these, 37 (37/44; 84.1%) demonstrated high performance, and seven (7/44; 15.9%) indicated it had a lower performance in providing information on diagnosis and clinical decision support (6/44; 13.6%) and patient communication and educational content (1/44; 2.3%). Twenty-four (24/44; 54.5%) studies reported the proportion of ChatGPT's performance. Among these, 19 (19/24; 79.2%) studies recorded a median accuracy of 70.5%, and in five (5/24; 20.8%) studies, there was a median agreement of 83.6% between ChatGPT outcomes and reference standards [radiologists' decision or guidelines], generally confirming ChatGPT's high accuracy in these studies. Eleven studies compared two recent ChatGPT versions, and in ten (10/11; 90.9%), ChatGPTv4 outperformed v3.5, showing notable enhancements in addressing higher-order thinking questions, better comprehension of radiology terms, and improved accuracy in describing images. Risks and concerns about using ChatGPT included biased responses, limited originality, and the potential for inaccurate information leading to misinformation, hallucinations, improper citations and fake references, cybersecurity vulnerabilities, and patient privacy risks. CONCLUSION Although ChatGPT's effectiveness has been shown in 84.1% of radiology studies, there are still multiple pitfalls and limitations to address. It is too soon to confirm its complete proficiency and accuracy, and more extensive multicenter studies utilizing diverse datasets and pre-training techniques are required to verify ChatGPT's role in radiology.
Collapse
Affiliation(s)
- Pedram Keshavarz
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA; School of Science and Technology, The University of Georgia, Tbilisi 0171, Georgia
| | - Sara Bagherieh
- Independent Clinical Radiology Researcher, Los Angeles, CA 90024, USA
| | | | - Hamid Chalian
- Department of Radiology, Cardiothoracic Imaging, University of Washington, Seattle, WA 98195, USA
| | - Amir Ali Rahsepar
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Grace Hyun J Kim
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA; Department of Radiological Sciences, Center for Computer Vision and Imaging Biomarkers, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Cameron Hassani
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Steven S Raman
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Arash Bedayat
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA.
| |
Collapse
|
30
|
Nishino M, Ballard DH. Multimodal Large Language Models to Solve Image-based Diagnostic Challenges: The Next Big Wave is Already Here. Radiology 2024; 312:e241379. [PMID: 38980181 DOI: 10.1148/radiol.241379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Affiliation(s)
- Mizuki Nishino
- From the Department of Radiology, Brigham and Women's Hospital and Dana-Farber Cancer Institute, 450 Brookline Ave, Boston MA 02215 (M.N.); and Mallinckrodt Institute of Radiology, Washington University School of Medicine, St Louis, Mo (D.H.B.)
| | - David H Ballard
- From the Department of Radiology, Brigham and Women's Hospital and Dana-Farber Cancer Institute, 450 Brookline Ave, Boston MA 02215 (M.N.); and Mallinckrodt Institute of Radiology, Washington University School of Medicine, St Louis, Mo (D.H.B.)
| |
Collapse
|
31
|
Payne DL, Purohit K, Borrero WM, Chung K, Hao M, Mpoy M, Jin M, Prasanna P, Hill V. Performance of GPT-4 on the American College of Radiology In-training Examination: Evaluating Accuracy, Model Drift, and Fine-tuning. Acad Radiol 2024; 31:3046-3054. [PMID: 38653599 DOI: 10.1016/j.acra.2024.04.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 04/01/2024] [Accepted: 04/06/2024] [Indexed: 04/25/2024]
Abstract
RATIONALE AND OBJECTIVES In our study, we evaluate GPT-4's performance on the American College of Radiology (ACR) 2022 Diagnostic Radiology In-Training Examination (DXIT). We perform multiple experiments across time points to assess for model drift, as well as after fine-tuning to assess for differences in accuracy. MATERIALS AND METHODS Questions were sequentially input into GPT-4 with a standardized prompt. Each answer was recorded and overall accuracy was calculated, as was logic-adjusted accuracy, and accuracy on image-based questions. This experiment was repeated several months later to assess for model drift, then again after the performance of fine-tuning to assess for changes in GPT's performance. RESULTS GPT-4 achieved 58.5% overall accuracy, lower than the PGY-3 average (61.9%) but higher than the PGY-2 average (52.8%). Adjusted accuracy was 52.8%. GPT-4 showed significantly higher (p = 0.012) confidence for correct answers (87.1%) compared to incorrect (84.0%). Performance on image-based questions was significantly poorer (p < 0.001) at 45.4% compared to text-only questions (80.0%), with adjusted accuracy for image-based questions of 36.4%. When the questions were repeated, GPT-4 chose a different answer 25.5% of the time and there was no change in accuracy. Fine-tuning did not improve accuracy. CONCLUSION GPT-4 performed between PGY-2 and PGY-3 levels on the 2022 DXIT, significantly poorer on image-based questions, and with large variability in answer choices across time points. Exploratory experiments in fine-tuning did not improve performance. This study underscores the potential and risks of using minimally-prompted general AI models in interpreting radiologic images as a diagnostic tool. Implementers of general AI radiology systems should exercise caution given the possibility of spurious yet confident responses.
Collapse
Affiliation(s)
- David L Payne
- Stony Brook University Hospital Department of Radiology, 101 Nicolls Road, Stony Brook, New York 11794, USA (D.L.P., K.P., W.M.B., K.C., M.H., M.M., M.J.); Stony Brook University Department of Biomedical Informatics, 1 Lauterbur Drive, Stony Brook, New York 11794, USA (D.L.P., P.P.).
| | - Kush Purohit
- Stony Brook University Hospital Department of Radiology, 101 Nicolls Road, Stony Brook, New York 11794, USA (D.L.P., K.P., W.M.B., K.C., M.H., M.M., M.J.)
| | - Walter Morales Borrero
- Stony Brook University Hospital Department of Radiology, 101 Nicolls Road, Stony Brook, New York 11794, USA (D.L.P., K.P., W.M.B., K.C., M.H., M.M., M.J.)
| | - Katherine Chung
- Stony Brook University Hospital Department of Radiology, 101 Nicolls Road, Stony Brook, New York 11794, USA (D.L.P., K.P., W.M.B., K.C., M.H., M.M., M.J.)
| | - Max Hao
- Stony Brook University Hospital Department of Radiology, 101 Nicolls Road, Stony Brook, New York 11794, USA (D.L.P., K.P., W.M.B., K.C., M.H., M.M., M.J.)
| | - Mutshipay Mpoy
- Stony Brook University Hospital Department of Radiology, 101 Nicolls Road, Stony Brook, New York 11794, USA (D.L.P., K.P., W.M.B., K.C., M.H., M.M., M.J.)
| | - Michael Jin
- Stony Brook University Hospital Department of Radiology, 101 Nicolls Road, Stony Brook, New York 11794, USA (D.L.P., K.P., W.M.B., K.C., M.H., M.M., M.J.)
| | - Prateek Prasanna
- Stony Brook University Department of Biomedical Informatics, 1 Lauterbur Drive, Stony Brook, New York 11794, USA (D.L.P., P.P.)
| | - Virginia Hill
- Northwestern University Feinberg School of Medicine Department of Radiology, 676 North Clair Street, Chicago, Illinois 60611, USA (V.H.)
| |
Collapse
|
32
|
McIlvain G, Oechtering TH, Shammi UA, Bhayana R, Hutter J, Moy L, Schweitzer M. Chatbots for Literature Review and Research-Insights from a Panel Discussion at the Annual Meeting of the International Society of Magnetic Resonance in Medicine (ISMRM) 2023. J Magn Reson Imaging 2024; 60:390-392. [PMID: 37795851 DOI: 10.1002/jmri.29036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 09/07/2023] [Accepted: 09/19/2023] [Indexed: 10/06/2023] Open
Affiliation(s)
- Grace McIlvain
- Department of Biomedical Engineering, Columbia University, New York City, New York, USA
| | - Thekla H Oechtering
- Department of Radiology, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, USA
- Department of Radiology and Nuclear Medicine, University of Luebeck, Lübeck, Germany
| | - Ummul Afia Shammi
- Chemical and Biomedical Engineering, University of Missouri, Columbia, Missouri, USA
| | - Rajesh Bhayana
- Department of Medical Imaging, University Health Network Mount Sinai Hospital and Women's College Hospital University of Toronto, Toronto, Ontario, Canada
| | - Jana Hutter
- Centre for the Developing Brain, King's College London, UK
| | - Linda Moy
- Department of Radiology, New York University School of Medicine, New York City, New York, USA
| | - Mark Schweitzer
- Wayne State University School of Medicine, Detroit, Michigan, USA
| |
Collapse
|
33
|
Le Guellec B, Lefèvre A, Geay C, Shorten L, Bruge C, Hacein-Bey L, Amouyel P, Pruvo JP, Kuchcinski G, Hamroun A. Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports. Radiol Artif Intell 2024; 6:e230364. [PMID: 38717292 PMCID: PMC11294959 DOI: 10.1148/ryai.230364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 03/08/2024] [Accepted: 04/23/2024] [Indexed: 06/21/2024]
Abstract
Purpose To assess the performance of a local open-source large language model (LLM) in various information extraction tasks from real-life emergency brain MRI reports. Materials and Methods All consecutive emergency brain MRI reports written in 2022 from a French quaternary center were retrospectively reviewed. Two radiologists identified MRI scans that were performed in the emergency department for headaches. Four radiologists scored the reports' conclusions as either normal or abnormal. Abnormalities were labeled as either headache-causing or incidental. Vicuna (LMSYS Org), an open-source LLM, performed the same tasks. Vicuna's performance metrics were evaluated using the radiologists' consensus as the reference standard. Results Among the 2398 reports during the study period, radiologists identified 595 that included headaches in the indication (median age of patients, 35 years [IQR, 26-51 years]; 68% [403 of 595] women). A positive finding was reported in 227 of 595 (38%) cases, 136 of which could explain the headache. The LLM had a sensitivity of 98.0% (95% CI: 96.5, 99.0) and specificity of 99.3% (95% CI: 98.8, 99.7) for detecting the presence of headache in the clinical context, a sensitivity of 99.4% (95% CI: 98.3, 99.9) and specificity of 98.6% (95% CI: 92.2, 100.0) for the use of contrast medium injection, a sensitivity of 96.0% (95% CI: 92.5, 98.2) and specificity of 98.9% (95% CI: 97.2, 99.7) for study categorization as either normal or abnormal, and a sensitivity of 88.2% (95% CI: 81.6, 93.1) and specificity of 73% (95% CI: 62, 81) for causal inference between MRI findings and headache. Conclusion An open-source LLM was able to extract information from free-text radiology reports with excellent accuracy without requiring further training. Keywords: Large Language Model (LLM), Generative Pretrained Transformers (GPT), Open Source, Information Extraction, Report, Brain, MRI Supplemental material is available for this article. Published under a CC BY 4.0 license. See also the commentary by Akinci D'Antonoli and Bluethgen in this issue.
Collapse
Affiliation(s)
- Bastien Le Guellec
- From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
| | - Alexandre Lefèvre
- From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
| | - Charlotte Geay
- From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
| | - Lucas Shorten
- From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
| | - Cyril Bruge
- From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
| | - Lotfi Hacein-Bey
- From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
| | - Philippe Amouyel
- From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
| | - Jean-Pierre Pruvo
- From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
| | - Gregory Kuchcinski
- From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
| | - Aghiles Hamroun
- From the Department of Neuroradiology (B.L.G., A.L., C.B., J.P.P., G.K.), Department of Public Health (B.L.G., P.A., A.H.), and INclude Health Data Warehouse (C.G., L.S.), CHU Lille–Université Lille, Rue Emile Laine, 59000 Lille, France; Department of Radiology, UC Davis Health, Sacramento, Calif (L.H.B.); Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de risque et déterminants moléculaires des maladies liées au vieillissement, Lille, France (P.A., A.H.); INSERM, U1172–LilNCog-Lille Neuroscience & Cognition, Université Lille, Lille, France (J.P.P., G.K.); and UAR 2014-US 41-PLBS–Plateformes Lilloises en Biologie & Santé, Université Lille, Lille, France (J.P.P., G.K.)
| |
Collapse
|
34
|
Suh PS, Shim WH, Suh CH, Heo H, Park CR, Eom HJ, Park KJ, Choe J, Kim PH, Park HJ, Ahn Y, Park HY, Choi Y, Woo CY, Park H. Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases. Radiology 2024; 312:e240273. [PMID: 38980179 DOI: 10.1148/radiol.240273] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Background The diagnostic abilities of multimodal large language models (LLMs) using direct image inputs and the impact of the temperature parameter of LLMs remain unexplored. Purpose To investigate the ability of GPT-4V and Gemini Pro Vision in generating differential diagnoses at different temperatures compared with radiologists using Radiology Diagnosis Please cases. Materials and Methods This retrospective study included Diagnosis Please cases published from January 2008 to October 2023. Input images included original images and captures of the textual patient history and figure legends (without imaging findings) from PDF files of each case. The LLMs were tasked with providing three differential diagnoses, repeated five times at temperatures 0, 0.5, and 1. Eight subspecialty-trained radiologists solved cases. An experienced radiologist compared generated and final diagnoses, considering the result correct if the generated diagnoses included the final diagnosis after five repetitions. Accuracy was assessed across models, temperatures, and radiology subspecialties, with statistical significance set at P < .007 after Bonferroni correction for multiple comparisons across the LLMs at the three temperatures and with radiologists. Results A total of 190 cases were included in neuroradiology (n = 53), multisystem (n = 27), gastrointestinal (n = 25), genitourinary (n = 23), musculoskeletal (n = 17), chest (n = 16), cardiovascular (n = 12), pediatric (n = 12), and breast (n = 5) subspecialties. Overall accuracy improved with increasing temperature settings (0, 0.5, 1) for both GPT-4V (41% [78 of 190 cases], 45% [86 of 190 cases], 49% [93 of 190 cases], respectively) and Gemini Pro Vision (29% [55 of 190 cases], 36% [69 of 190 cases], 39% [74 of 190 cases], respectively), although there was no evidence of a statistically significant difference after Bonferroni adjustment (GPT-4V, P = .12; Gemini Pro Vision, P = .04). The overall accuracy of radiologists (61% [115 of 190 cases]) was higher than that of Gemini Pro Vision at temperature 1 (T1) (P < .001), while no statistically significant difference was observed between radiologists and GPT-4V at T1 after Bonferroni adjustment (P = .02). Radiologists (range, 45%-88%) outperformed the LLMs at T1 (range, 24%-75%) in most subspecialties. Conclusion Using direct radiologic image inputs, GPT-4V and Gemini Pro Vision showed improved diagnostic accuracy with increasing temperature settings. Although GPT-4V slightly underperformed compared with radiologists, it nonetheless demonstrated promising potential as a supportive tool in diagnostic decision-making. © RSNA, 2024 See also the editorial by Nishino and Ballard in this issue.
Collapse
Affiliation(s)
- Pae Sun Suh
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Woo Hyun Shim
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Chong Hyun Suh
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Hwon Heo
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Chae Ri Park
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Hye Joung Eom
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Kye Jin Park
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Jooae Choe
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Pyeong Hwa Kim
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Hyo Jung Park
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Yura Ahn
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Ho Young Park
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Yoonseok Choi
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Chang-Yun Woo
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| | - Hyungjun Park
- From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.)
| |
Collapse
|
35
|
Rossettini G, Rodeghiero L, Corradi F, Cook C, Pillastrini P, Turolla A, Castellini G, Chiappinotto S, Gianola S, Palese A. Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC MEDICAL EDUCATION 2024; 24:694. [PMID: 38926809 PMCID: PMC11210096 DOI: 10.1186/s12909-024-05630-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 06/04/2024] [Indexed: 06/28/2024]
Abstract
BACKGROUND Artificial intelligence (AI) chatbots are emerging educational tools for students in healthcare science. However, assessing their accuracy is essential prior to adoption in educational settings. This study aimed to assess the accuracy of predicting the correct answers from three AI chatbots (ChatGPT-4, Microsoft Copilot and Google Gemini) in the Italian entrance standardized examination test of healthcare science degrees (CINECA test). Secondarily, we assessed the narrative coherence of the AI chatbots' responses (i.e., text output) based on three qualitative metrics: the logical rationale behind the chosen answer, the presence of information internal to the question, and presence of information external to the question. METHODS An observational cross-sectional design was performed in September of 2023. Accuracy of the three chatbots was evaluated for the CINECA test, where questions were formatted using a multiple-choice structure with a single best answer. The outcome is binary (correct or incorrect). Chi-squared test and a post hoc analysis with Bonferroni correction assessed differences among chatbots performance in accuracy. A p-value of < 0.05 was considered statistically significant. A sensitivity analysis was performed, excluding answers that were not applicable (e.g., images). Narrative coherence was analyzed by absolute and relative frequencies of correct answers and errors. RESULTS Overall, of the 820 CINECA multiple-choice questions inputted into all chatbots, 20 questions were not imported in ChatGPT-4 (n = 808) and Google Gemini (n = 808) due to technical limitations. We found statistically significant differences in the ChatGPT-4 vs Google Gemini and Microsoft Copilot vs Google Gemini comparisons (p-value < 0.001). The narrative coherence of AI chatbots revealed "Logical reasoning" as the prevalent correct answer (n = 622, 81.5%) and "Logical error" as the prevalent incorrect answer (n = 40, 88.9%). CONCLUSIONS Our main findings reveal that: (A) AI chatbots performed well; (B) ChatGPT-4 and Microsoft Copilot performed better than Google Gemini; and (C) their narrative coherence is primarily logical. Although AI chatbots showed promising accuracy in predicting the correct answer in the Italian entrance university standardized examination test, we encourage candidates to cautiously incorporate this new technology to supplement their learning rather than a primary resource. TRIAL REGISTRATION Not required.
Collapse
Affiliation(s)
- Giacomo Rossettini
- School of Physiotherapy, University of Verona, Verona, Italy.
- Department of Physiotherapy, Faculty of Sport Sciences, Universidad Europea de Madrid, Villaviciosa de Odón, 28670, Spain.
| | - Lia Rodeghiero
- Department of Rehabilitation, Hospital of Merano (SABES-ASDAA), Teaching Hospital of Paracelsus Medical University (PMU), Merano-Meran, Italy.
| | | | - Chad Cook
- Department of Orthopaedics, Duke University, Durham, NC, USA
- Duke Clinical Research Institute, Duke University, Durham, NC, USA
- Department of Population Health Sciences, Duke University, Durham, NC, USA
| | - Paolo Pillastrini
- Department of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy
- Unit of Occupational Medicine, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, Bologna, Italy
| | - Andrea Turolla
- Department of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy
- Unit of Occupational Medicine, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, Bologna, Italy
| | - Greta Castellini
- Unit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| | | | - Silvia Gianola
- Unit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, Italy.
| | - Alvisa Palese
- Department of Medical Sciences, University of Udine, Udine, Italy.
| |
Collapse
|
36
|
Tong L, Wang J, Rapaka S, Garg PS. Can ChatGPT generate practice question explanations for medical students, a new faculty teaching tool? MEDICAL TEACHER 2024:1-5. [PMID: 38900675 DOI: 10.1080/0142159x.2024.2363486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 05/30/2024] [Indexed: 06/22/2024]
Abstract
INTRODUCTION Multiple-choice questions (MCQs) are frequently used for formative assessment in medical school but often lack sufficient answer explanations given time-restraints of faculty. Chat Generated Pre-trained Transformer (ChatGPT) has emerged as a potential student learning aid and faculty teaching tool. This study aims to evaluate ChatGPT's performance in answering and providing explanations for MCQs. METHOD Ninety-four faculty-generated MCQs were collected from the pre-clerkship curriculum at a US medical school. ChatGPT's accuracy in answering MCQ's were tracked on first attempt without an answer prompt (Pass 1) and after being given a prompt for the correct answer (Pass 2). Explanations provided by ChatGPT were compared with faculty-generated explanations, and a 3-point evaluation scale was used to assess accuracy and thoroughness compared to faculty-generated answers. RESULTS On first attempt, ChatGPT demonstrated a 75% accuracy in correctly answering faculty-generated MCQs. Among correctly answered questions, 66.4% of ChatGPT's explanations matched faculty explanations, and 89.1% captured some key aspects without providing inaccurate information. The amount of inaccurately generated explanations increases significantly if the questions was not answered correctly on the first pass (2.7% if correct on first pass vs. 34.6% if incorrect on first pass, p < 0.001). CONCLUSION ChatGPT shows promise in assisting faculty and students with explanations for practice MCQ's but should be used with caution. Faculty should review explanations and supplement to ensure coverage of learning objectives. Students can benefit from ChatGPT for immediate feedback through explanations if ChatGPT answers the question correctly on the first try. If the question is answered incorrectly students should remain cautious of the explanation and seek clarification from instructors.
Collapse
Affiliation(s)
- Lilin Tong
- Boston University Chobanian and Avedisian School of Medicine, Boston, MA, USA
| | - Jennifer Wang
- Boston University Chobanian and Avedisian School of Medicine, Boston, MA, USA
| | - Srikar Rapaka
- Boston University Chobanian and Avedisian School of Medicine, Boston, MA, USA
| | - Priya S Garg
- Medical Education Office and Department of Pediatrics, Boston University Chobanian and Avedisian School of Medicine, Boston, MA, USA
| |
Collapse
|
37
|
Kaba E, Akkaya S. Performance of Different Large Language Models in the Sample Test of the European Cardiovascular Radiology Board Examination. Acad Radiol 2024:S1076-6332(24)00369-6. [PMID: 38902112 DOI: 10.1016/j.acra.2024.06.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Accepted: 06/04/2024] [Indexed: 06/22/2024]
Affiliation(s)
- Esat Kaba
- Recep Tayyip Erdogan University, Department of Radiology, Rize, Turkey.
| | - Selçuk Akkaya
- Karadeniz Technical University, Department of Radiology, Trabzon, Turkey
| |
Collapse
|
38
|
Suwała S, Szulc P, Guzowski C, Kamińska B, Dorobiała J, Wojciechowska K, Berska M, Kubicka O, Kosturkiewicz O, Kosztulska B, Rajewska A, Junik R. ChatGPT-3.5 passes Poland's medical final examination-Is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med 2024; 12:20503121241257777. [PMID: 38895543 PMCID: PMC11185017 DOI: 10.1177/20503121241257777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 05/08/2024] [Indexed: 06/21/2024] Open
Abstract
Objectives ChatGPT is an advanced chatbot based on Large Language Model that has the ability to answer questions. Undoubtedly, ChatGPT is capable of transforming communication, education, and customer support; however, can it play the role of a doctor? In Poland, prior to obtaining a medical diploma, candidates must successfully pass the Medical Final Examination. Methods The purpose of this research was to determine how well ChatGPT performed on the Polish Medical Final Examination, which passing is required to become a doctor in Poland (an exam is considered passed if at least 56% of the tasks are answered correctly). A total of 2138 categorized Medical Final Examination questions (from 11 examination sessions held between 2013-2015 and 2021-2023) were presented to ChatGPT-3.5 from 19 to 26 May 2023. For further analysis, the questions were divided into quintiles based on difficulty and duration, as well as question types (simple A-type or complex K-type). The answers provided by ChatGPT were compared to the official answer key, reviewed for any changes resulting from the advancement of medical knowledge. Results ChatGPT correctly answered 53.4%-64.9% of questions. In 8 out of 11 exam sessions, ChatGPT achieved the scores required to successfully pass the examination (60%). The correlation between the efficacy of artificial intelligence and the level of complexity, difficulty, and length of a question was found to be negative. AI outperformed humans in one category: psychiatry (77.18% vs. 70.25%, p = 0.081). Conclusions The performance of artificial intelligence is deemed satisfactory; however, it is observed to be markedly inferior to that of human graduates in the majority of instances. Despite its potential utility in many medical areas, ChatGPT is constrained by its inherent limitations that prevent it from entirely supplanting human expertise and knowledge.
Collapse
Affiliation(s)
- Szymon Suwała
- Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Paulina Szulc
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Cezary Guzowski
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Barbara Kamińska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Jakub Dorobiała
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Karolina Wojciechowska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Maria Berska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Olga Kubicka
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Oliwia Kosturkiewicz
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Bernadetta Kosztulska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Alicja Rajewska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Roman Junik
- Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| |
Collapse
|
39
|
Longwell JB, Hirsch I, Binder F, Gonzalez Conchas GA, Mau D, Jang R, Krishnan RG, Grant RC. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open 2024; 7:e2417641. [PMID: 38888919 PMCID: PMC11185976 DOI: 10.1001/jamanetworkopen.2024.17641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 04/18/2024] [Indexed: 06/20/2024] Open
Abstract
Importance Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information. Objective To evaluate the accuracy and safety of LLM answers on medical oncology examination questions. Design, Setting, and Participants This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs. Main Outcomes and Measures The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm. Results Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm. Conclusions and Relevance In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.
Collapse
Affiliation(s)
- Jack B. Longwell
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Ian Hirsch
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Fernando Binder
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | | | - Daniel Mau
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada
| | - Raymond Jang
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Rahul G. Krishnan
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Robert C. Grant
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
- Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada
- ICES, Toronto, Ontario, Canada
| |
Collapse
|
40
|
Jenko N, Ariyaratne S, Jeys L, Evans S, Iyengar KP, Botchu R. An evaluation of AI generated literature reviews in musculoskeletal radiology. Surgeon 2024; 22:194-197. [PMID: 38218659 DOI: 10.1016/j.surge.2023.12.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 12/20/2023] [Accepted: 12/27/2023] [Indexed: 01/15/2024]
Abstract
PURPOSE The use of artificial intelligence (AI) tools to aid in summarizing information in medicine and research has recently garnered a huge amount of interest. While tools such as ChatGPT produce convincing and naturally sounding output, the answers are sometimes incorrect. Some of these drawbacks, it is hoped, can be avoided by using programmes trained for a more specific scope. In this study we compared the performance of a new AI tool (the-literature.com) to the latest version OpenAI's ChatGPT (GPT-4) in summarizing topics that the authors have significantly contributed to. METHODS The AI tools were asked to produce a literature review on 7 topics. These were selected based on the research topics that the authors were intimately familiar with and have contributed to through their own publications. The output produced by the AI tools were graded on a 1-5 Likert scale for accuracy, comprehensiveness, and relevance by two fellowship trained consultant radiologists. RESULTS The-literature.com produced 3 excellent summaries, 3 very poor summaries not relevant to the prompt, and one summary, which was relevant but did not include all relevant papers. All of the summaries produced by GPT-4 were relevant, but fewer relevant papers were identified. The average Likert rating was for the-literature was 2.88 and 3.86 for GPT-4. There was good agreement between the ratings of both radiologists (ICC = 0.883). CONCLUSION Summaries produced by AI in its current state require careful human validation. GPT-4 on average provides higher quality summaries. Neither tool can reliably identify all relevant publications.
Collapse
Affiliation(s)
- N Jenko
- Radiology, Royal Orthopaedic Hospital NHS Foundation Trust, Birmingham, UK.
| | - S Ariyaratne
- Radiology, Royal Orthopaedic Hospital NHS Foundation Trust, Birmingham, UK
| | - L Jeys
- Orthopaedic Surgery, Royal Orthopaedic Hospital NHS Foundation Trust, Birmingham, UK
| | - S Evans
- Orthopaedic Surgery, Royal Orthopaedic Hospital NHS Foundation Trust, Birmingham, UK
| | - K P Iyengar
- Orthopaedic Surgery, Mersey and West Lancashire Teaching Hospitals NHS Trust, Southport, UK
| | - R Botchu
- Radiology, Royal Orthopaedic Hospital NHS Foundation Trust, Birmingham, UK
| |
Collapse
|
41
|
Taesotikul S, Singhan W, Taesotikul T. ChatGPT vs pharmacy students in the pharmacotherapy time-limit test: A comparative study in Thailand. CURRENTS IN PHARMACY TEACHING & LEARNING 2024; 16:404-410. [PMID: 38641483 DOI: 10.1016/j.cptl.2024.04.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 04/03/2024] [Accepted: 04/04/2024] [Indexed: 04/21/2024]
Abstract
OBJECTIVES ChatGPT is an innovative artificial intelligence designed to enhance human activities and serve as a potent tool for information retrieval. This study aimed to evaluate the performance and limitation of ChatGPT on fourth-year pharmacy student examination. METHODS This cross-sectional study was conducted on February 2023 at the Faculty of Pharmacy, Chiang Mai University, Thailand. The exam contained 16 multiple-choice questions and 2 short-answer questions, focusing on classification and medical management of shock and electrolyte disorders. RESULTS Out of the 18 questions, ChatGPT provided 44% (8 out of 18) correct responses. In contrast, the students provided a higher accuracy rate with 66% (12 out of 18) correctly answered questions. The findings of this study underscore that while AI exhibits proficiency, it encounters limitations when confronted with specific queries derived from practical scenarios, on the contrary with pharmacy students who possess the liberty to explore and collaborate, mirroring real-world scenarios. CONCLUSIONS Users must exercise caution regarding its reliability, and interpretations of AI-generated answers should be approached judiciously due to potential restrictions in multi-step analysis and reliance on outdated data. Future advancements in AI models, with refinements and tailored enhancements, offer the potential for improved performance.
Collapse
Affiliation(s)
- Suthinee Taesotikul
- Department of Pharmaceutical Care, Faculty of Pharmacy, Chiang Mai University, Chiang Mai 50200, Thailand.
| | - Wanchana Singhan
- Department of Pharmaceutical Care, Faculty of Pharmacy, Chiang Mai University, Chiang Mai 50200, Thailand.
| | - Theerada Taesotikul
- Department of Biomedicine and Health Informatics, Faculty of Pharmacy, Silpakorn University, Nakhon Pathom 73000, Thailand.
| |
Collapse
|
42
|
Bhayana R, Nanda B, Dehkharghanian T, Deng Y, Bhambra N, Elias G, Datta D, Kambadakone A, Shwaartz CG, Moulton CA, Henault D, Gallinger S, Krishna S. Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer. Radiology 2024; 311:e233117. [PMID: 38888478 DOI: 10.1148/radiol.233117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/20/2024]
Abstract
Background Structured radiology reports for pancreatic ductal adenocarcinoma (PDAC) improve surgical decision-making over free-text reports, but radiologist adoption is variable. Resectability criteria are applied inconsistently. Purpose To evaluate the performance of large language models (LLMs) in automatically creating PDAC synoptic reports from original reports and to explore performance in categorizing tumor resectability. Materials and Methods In this institutional review board-approved retrospective study, 180 consecutive PDAC staging CT reports on patients referred to the authors' European Society for Medical Oncology-designated cancer center from January to December 2018 were included. Reports were reviewed by two radiologists to establish the reference standard for 14 key findings and National Comprehensive Cancer Network (NCCN) resectability category. GPT-3.5 and GPT-4 (accessed September 18-29, 2023) were prompted to create synoptic reports from original reports with the same 14 features, and their performance was evaluated (recall, precision, F1 score). To categorize resectability, three prompting strategies (default knowledge, in-context knowledge, chain-of-thought) were used for both LLMs. Hepatopancreaticobiliary surgeons reviewed original and artificial intelligence (AI)-generated reports to determine resectability, with accuracy and review time compared. The McNemar test, t test, Wilcoxon signed-rank test, and mixed effects logistic regression models were used where appropriate. Results GPT-4 outperformed GPT-3.5 in the creation of synoptic reports (F1 score: 0.997 vs 0.967, respectively). Compared with GPT-3.5, GPT-4 achieved equal or higher F1 scores for all 14 extracted features. GPT-4 had higher precision than GPT-3.5 for extracting superior mesenteric artery involvement (100% vs 88.8%, respectively). For categorizing resectability, GPT-4 outperformed GPT-3.5 for each prompting strategy. For GPT-4, chain-of-thought prompting was most accurate, outperforming in-context knowledge prompting (92% vs 83%, respectively; P = .002), which outperformed the default knowledge strategy (83% vs 67%, P < .001). Surgeons were more accurate in categorizing resectability using AI-generated reports than original reports (83% vs 76%, respectively; P = .03), while spending less time on each report (58%; 95% CI: 0.53, 0.62). Conclusion GPT-4 created near-perfect PDAC synoptic reports from original reports. GPT-4 with chain-of-thought achieved high accuracy in categorizing resectability. Surgeons were more accurate and efficient using AI-generated reports. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Chang in this issue.
Collapse
Affiliation(s)
- Rajesh Bhayana
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Bipin Nanda
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Taher Dehkharghanian
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Yangqing Deng
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Nishaant Bhambra
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Gavin Elias
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Daksh Datta
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Avinash Kambadakone
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Chaya G Shwaartz
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Carol-Anne Moulton
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - David Henault
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Steven Gallinger
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| | - Satheesh Krishna
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Princess Margaret Cancer Centre, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C (R.B., B.N., T.D., S.K.); Department of Biostatistics (Y.D.) and HPB Surgical Oncology (C.G.S., C.A.M., D.H., S.G.), University Health Network, Toronto, Ontario, Canada; Departments of Medicine (N.B., G.E., D.D.) and Surgery (C.G.S., C.A.M., D.H., S.G.), University of Toronto, Toronto, Ontario, Canada; and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (A.K.)
| |
Collapse
|
43
|
Sparks CA, Kraeutler MJ, Chester GA, Contrada EV, Zhu E, Fasulo SM, Scillia AJ. Inadequate Performance of ChatGPT on Orthopedic Board-Style Written Exams. Cureus 2024; 16:e62643. [PMID: 39036109 PMCID: PMC11258215 DOI: 10.7759/cureus.62643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/17/2024] [Indexed: 07/23/2024] Open
Abstract
BACKGROUND Chat Generative Pre-Trained Transformer (ChatGPT) is an artificial intelligence (AI) chatbot capable of delivering human-like responses to a seemingly infinite number of inquiries. For the technology to perform certain healthcare-related tasks or act as a study aid, the technology should have up-to-date knowledge and the ability to reason through medical information. The purpose of this study was to assess the orthopedic knowledge and reasoning ability of ChatGPT by querying it with orthopedic board-style questions. METHODOLOGY We queried ChatGPT (GPT-3.5) with a total of 472 questions from the Orthobullets dataset (n = 239), the 2022 Orthopaedic In-Training Examination (OITE) (n = 124), and the 2021 OITE (n = 109). The importance, difficulty, and category were recorded for questions from the Orthobullets question bank. Responses were assessed for answer choice correctness if the explanation given matched that of the dataset, answer integrity, and reason for incorrectness. RESULTS ChatGPT correctly answered 55.9% (264/472) of questions and, of those answered correctly, gave an explanation that matched that of the dataset for 92.8% (245/264) of the questions. The chatbot used information internal to the question in all responses (100%) and used information external to the question (98.3%) as well as logical reasoning (96.4%) in most responses. There was no significant difference in the proportion of questions answered correctly between the datasets (P = 0.62). There was no significant difference in the proportion of questions answered correctly by question category (P = 0.67), importance (P = 0.95), or difficulty (P = 0.87) within the Orthobullets dataset questions. ChatGPT mostly got questions incorrect due to information error (i.e., failure to identify the information required to answer the question) (81.7% of incorrect responses). CONCLUSIONS ChatGPT performs below a threshold likely to pass the American Board of Orthopedic Surgery (ABOS) Part I written exam. The chatbot's performance on the 2022 and 2021 OITEs was between the average performance of an intern and to second-year resident. A major limitation of the current model is the failure to identify the information required to correctly answer the questions.
Collapse
Affiliation(s)
- Chandler A Sparks
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Matthew J Kraeutler
- Department of Orthopedics, University of Colorado Anschutz Medical Campus, Aurora, USA
| | - Grace A Chester
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Edward V Contrada
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Eric Zhu
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Sydney M Fasulo
- Department of Orthopedic Surgery, St. Joseph's Medical Center, Paterson, USA
| | - Anthony J Scillia
- Department of Sports Medicine/Orthopedics, Seton Hall University, Paterson, USA
| |
Collapse
|
44
|
Altamimi I, Alhumimidi A, Alshehri S, Alrumayan A, Al-khlaiwi T, Meo SA, Temsah MH. The scientific knowledge of three large language models in cardiology: multiple-choice questions examination-based performance. Ann Med Surg (Lond) 2024; 86:3261-3266. [PMID: 38846858 PMCID: PMC11152788 DOI: 10.1097/ms9.0000000000002120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Accepted: 04/16/2024] [Indexed: 06/09/2024] Open
Abstract
Background The integration of artificial intelligence (AI) chatbots like Google's Bard, OpenAI's ChatGPT, and Microsoft's Bing Chatbot into academic and professional domains, including cardiology, has been rapidly evolving. Their application in educational and research frameworks, however, raises questions about their efficacy, particularly in specialized fields like cardiology. This study aims to evaluate the knowledge depth and accuracy of these AI chatbots in cardiology using a multiple-choice question (MCQ) format. Methods The study was conducted as an exploratory, cross-sectional study in November 2023 on a bank of 100 MCQs covering various cardiology topics that was created from authoritative textbooks and question banks. These MCQs were then used to assess the knowledge level of Google's Bard, Microsoft Bing, and ChatGPT 4.0. Each question was entered manually into the chatbots, ensuring no memory retention bias. Results The study found that ChatGPT 4.0 demonstrated the highest knowledge score in cardiology, with 87% accuracy, followed by Bing at 60% and Bard at 46%. The performance varied across different cardiology subtopics, with ChatGPT consistently outperforming the others. Notably, the study revealed significant differences in the proficiency of these chatbots in specific cardiology domains. Conclusion This study highlights a spectrum of efficacy among AI chatbots in disseminating cardiology knowledge. ChatGPT 4.0 emerged as a potential auxiliary educational resource in cardiology, surpassing traditional learning methods in some aspects. However, the variability in performance among these AI systems underscores the need for cautious evaluation and continuous improvement, especially for chatbots like Bard, to ensure reliability and accuracy in medical knowledge dissemination.
Collapse
Affiliation(s)
- Ibraheem Altamimi
- College of Medicine
- Evidence-Based Health Care and Knowledge Translation Research Chair, Family and Community Medicine Department, College of Medicine, King Saud University
| | | | | | - Abdullah Alrumayan
- College of Medicine, King Saud Bin Abdulaziz University for Health and Sciences, Riyadh, Saudi Arabia
| | | | | | - Mohamad-Hani Temsah
- College of Medicine
- Evidence-Based Health Care and Knowledge Translation Research Chair, Family and Community Medicine Department, College of Medicine, King Saud University
- Pediatric Intensive Care Unit, Pediatric Department, College of Medicine, King Saud University Medical City
| |
Collapse
|
45
|
Koga S, Du W. Integrating AI in medicine: Lessons from Chat-GPT's limitations in medical imaging. Dig Liver Dis 2024; 56:1114-1115. [PMID: 38429138 DOI: 10.1016/j.dld.2024.02.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Accepted: 02/19/2024] [Indexed: 03/03/2024]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, United States.
| | - Wei Du
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, United States
| |
Collapse
|
46
|
Mokkarala M, Bentley H, Gomez C, Jiao A, Zaki-Metias KM. The New American Board of Radiology Certifying Oral Examination: How Should Diagnostic Radiology Graduate Medical Education Evolve? Radiographics 2024; 44:e240016. [PMID: 38722783 DOI: 10.1148/rg.240016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/21/2024]
Affiliation(s)
- Mahati Mokkarala
- From the Department of Radiology, Mallinckrodt Institute of Radiology, 510 S Kingshighway Blvd #8131, St Louis, MO 63108 (M.M.); Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada (H.B.); Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, Ga (C.G.); Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (A.J.); and Department of Radiology, Trinity Health Oakland Hospital, Wayne State University School of Medicine, Pontiac, Mich (K.M.Z.M.)
| | - Helena Bentley
- From the Department of Radiology, Mallinckrodt Institute of Radiology, 510 S Kingshighway Blvd #8131, St Louis, MO 63108 (M.M.); Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada (H.B.); Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, Ga (C.G.); Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (A.J.); and Department of Radiology, Trinity Health Oakland Hospital, Wayne State University School of Medicine, Pontiac, Mich (K.M.Z.M.)
| | - Christian Gomez
- From the Department of Radiology, Mallinckrodt Institute of Radiology, 510 S Kingshighway Blvd #8131, St Louis, MO 63108 (M.M.); Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada (H.B.); Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, Ga (C.G.); Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (A.J.); and Department of Radiology, Trinity Health Oakland Hospital, Wayne State University School of Medicine, Pontiac, Mich (K.M.Z.M.)
| | - Albert Jiao
- From the Department of Radiology, Mallinckrodt Institute of Radiology, 510 S Kingshighway Blvd #8131, St Louis, MO 63108 (M.M.); Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada (H.B.); Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, Ga (C.G.); Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (A.J.); and Department of Radiology, Trinity Health Oakland Hospital, Wayne State University School of Medicine, Pontiac, Mich (K.M.Z.M.)
| | - Kaitlin M Zaki-Metias
- From the Department of Radiology, Mallinckrodt Institute of Radiology, 510 S Kingshighway Blvd #8131, St Louis, MO 63108 (M.M.); Department of Radiology, University of British Columbia, Vancouver, British Columbia, Canada (H.B.); Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, Ga (C.G.); Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Mass (A.J.); and Department of Radiology, Trinity Health Oakland Hospital, Wayne State University School of Medicine, Pontiac, Mich (K.M.Z.M.)
| |
Collapse
|
47
|
Mousavi M, Shafiee S, Harley JM, Cheung JCK, Abbasgholizadeh Rahimi S. Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada. Fam Med Community Health 2024; 12:e002626. [PMID: 38806403 PMCID: PMC11138270 DOI: 10.1136/fmch-2023-002626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/30/2024] Open
Abstract
INTRODUCTION The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical education, and its performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC). METHOD Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews' score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds. RESULT According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer's scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p<0.001). Similarly, the Reviewers' Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed those of GPT-3.5 (OR: 2.23; 95% CI: 1.22 to 4.06; p=0.009). Running the GPTs after a one week interval, regeneration of the prompt or using or not using the prompt did not significantly change the CFPC score percentage. CONCLUSION In our study, we used GPT-3.5 and GPT-4 to answer complex, open-ended sample questions of the CFPC exam and showed that more than 70% of the answers were accurate, and GPT-4 outperformed GPT-3.5 in responding to the questions. Large language models such as GPTs seem promising for assisting candidates of the CFPC exam by providing potential answers. However, their use for family medicine education and exam preparation needs further studies.
Collapse
Affiliation(s)
- Mehdi Mousavi
- Department of Family Medicine, Faculty of Medicine, University of Saskatchewan, Nipawin, Saskatchewan, Canada
| | - Shabnam Shafiee
- Department of Family Medicine, Saskatchewan Health Authority, Riverside Health Complex, Turtleford, Saskatchewan, Canada
| | - Jason M Harley
- Department of Surgery, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada
- Research Institute of the McGill University Health Centre, Montreal, Quebec, Canada
- Institute for Health Sciences Education, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada
| | - Jackie Chi Kit Cheung
- McGill University School of Computer Science, Montreal, Quebec, Canada
- CIFAR AI Chair, Mila-Quebec AI Institute, Montreal, Quebec, Canada
| | - Samira Abbasgholizadeh Rahimi
- Department of Family Medicine, McGill University, Montreal, Quebec, Canada
- Mila Quebec AI-Institute, Montreal, Quebec, Canada
- Faculty of Dentistry Medicine and Oral Health Sciences, McGill University, Montreal, Quebec, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Quebec, Canada
| |
Collapse
|
48
|
Horiuchi D, Tatekawa H, Oura T, Oue S, Walston SL, Takita H, Matsushita S, Mitsuyama Y, Shimono T, Miki Y, Ueda D. Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases. Clin Neuroradiol 2024:10.1007/s00062-024-01426-y. [PMID: 38806794 DOI: 10.1007/s00062-024-01426-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 05/06/2024] [Indexed: 05/30/2024]
Abstract
PURPOSE To compare the diagnostic performance among Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT‑4 with vision (GPT-4V) based ChatGPT, and radiologists in challenging neuroradiology cases. METHODS We collected 32 consecutive "Freiburg Neuropathology Case Conference" cases from the journal Clinical Neuroradiology between March 2016 and December 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Six radiologists (three radiology residents and three board-certified radiologists) independently reviewed all cases and provided diagnoses. ChatGPT and radiologists' diagnostic accuracy rates were evaluated based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists. RESULTS GPT‑4 and GPT-4V-based ChatGPTs achieved accuracy rates of 22% (7/32) and 16% (5/32), respectively. Radiologists achieved the following accuracy rates: three radiology residents 28% (9/32), 31% (10/32), and 28% (9/32); and three board-certified radiologists 38% (12/32), 47% (15/32), and 44% (14/32). GPT-4-based ChatGPT's diagnostic accuracy was lower than each radiologist, although not significantly (all p > 0.07). GPT-4V-based ChatGPT's diagnostic accuracy was also lower than each radiologist and significantly lower than two board-certified radiologists (p = 0.02 and 0.03) (not significant for radiology residents and one board-certified radiologist [all p > 0.09]). CONCLUSION While GPT-4-based ChatGPT demonstrated relatively higher diagnostic performance than GPT-4V-based ChatGPT, the diagnostic performance of GPT‑4 and GPT-4V-based ChatGPTs did not reach the performance level of either radiology residents or board-certified radiologists in challenging neuroradiology cases.
Collapse
Affiliation(s)
- Daisuke Horiuchi
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hiroyuki Tatekawa
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Tatsushi Oura
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Satoshi Oue
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Shannon L Walston
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hirotaka Takita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Shu Matsushita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Yasuhito Mitsuyama
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Taro Shimono
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Yukio Miki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Center for Health Science Innovation, Osaka Metropolitan University, Osaka, Japan.
| |
Collapse
|
49
|
Duggan R, Tsuruda KM. ChatGPT performance on radiation technologist and therapist entry to practice exams. J Med Imaging Radiat Sci 2024; 55:101426. [PMID: 38797622 DOI: 10.1016/j.jmir.2024.04.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 04/24/2024] [Accepted: 04/29/2024] [Indexed: 05/29/2024]
Abstract
BACKGROUND The aim of this study was to describe the proficiency of ChatGPT (GPT-4) on certification style exams from the Canadian Association of Medical Radiation Technologists (CAMRT), and describe its performance across multiple exam attempts. METHODS ChatGPT was prompted with questions from CAMRT practice exams in the disciplines of radiological technology, magnetic resonance (MRI), nuclear medicine and radiation therapy (87-98 questions each). ChatGPT attempted each exam five times. Exam performance was evaluated using descriptive statistics, stratified by discipline and question type (knowledge, application, critical thinking). Light's Kappa was used to assess agreement in answers across attempts. RESULTS Using a passing grade of 65 %, ChatGPT passed the radiological technology exam only once (20 %), MRI all five times (100 %), nuclear medicine three times (60 %), and radiation therapy all five times (100 %). ChatGPT's performance was best on knowledge questions across all disciplines except radiation therapy. It performed worst on critical thinking questions. Agreement in ChatGPT's responses across attempts was substantial within the disciplines of radiological technology, MRI, and nuclear medicine, and almost perfect for radiation therapy. CONCLUSION ChatGPT (GPT-4) was able to pass certification style exams for radiation technologists and therapists, but its performance varied between disciplines. The algorithm demonstrated substantial to almost perfect agreement in the responses it provided across multiple exam attempts. Future research evaluating ChatGPT's performance on standardized tests should consider using repeated measures.
Collapse
Affiliation(s)
- Ryan Duggan
- School of Health Sciences, Dalhousie University, Halifax, Nova Scotia, Canada; Miramichi Regional Hospital, Horizon Health Network, New Brunswick, Canada.
| | | |
Collapse
|
50
|
Igarashi Y, Nakahara K, Norii T, Miyake N, Tagami T, Yokobori S. Performance of a Large Language Model on Japanese Emergency Medicine Board Certification Examinations. J NIPPON MED SCH 2024; 91:155-161. [PMID: 38432929 DOI: 10.1272/jnms.jnms.2024_91-205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2024]
Abstract
BACKGROUND Emergency physicians need a broad range of knowledge and skills to address critical medical, traumatic, and environmental conditions. Artificial intelligence (AI), including large language models (LLMs), has potential applications in healthcare settings; however, the performance of LLMs in emergency medicine remains unclear. METHODS To evaluate the reliability of information provided by ChatGPT, an LLM was given the questions set by the Japanese Association of Acute Medicine in its board certification examinations over a period of 5 years (2018-2022) and programmed to answer them twice. Statistical analysis was used to assess agreement of the two responses. RESULTS The LLM successfully answered 465 of the 475 text-based questions, achieving an overall correct response rate of 62.3%. For questions without images, the rate of correct answers was 65.9%. For questions with images that were not explained to the LLM, the rate of correct answers was only 52.0%. The annual rates of correct answers to questions without images ranged from 56.3% to 78.8%. Accuracy was better for scenario-based questions (69.1%) than for stand-alone questions (62.1%). Agreement between the two responses was substantial (kappa = 0.70). Factual error accounted for 82% of the incorrectly answered questions. CONCLUSION An LLM performed satisfactorily on an emergency medicine board certification examination in Japanese and without images. However, factual errors in the responses highlight the need for physician oversight when using LLMs.
Collapse
Affiliation(s)
- Yutaka Igarashi
- Department of Emergency and Critical Care Medicine, Nippon Medical School
| | - Kyoichi Nakahara
- Department of Emergency and Critical Care Medicine, Nippon Medical School
| | - Tatsuya Norii
- Department of Emergency Medicine, University of New Mexico, NM, United States of America
| | - Nodoka Miyake
- Department of Emergency and Critical Care Medicine, Nippon Medical School
| | - Takashi Tagami
- Department of Emergency and Critical Care Medicine, Nippon Medical School Musashi Kosugi Hospital
| | - Shoji Yokobori
- Department of Emergency and Critical Care Medicine, Nippon Medical School
| |
Collapse
|