51
|
Şenoymak MC, Erbatur NH, Şenoymak İ, Fırat SN. The Role of Artificial Intelligence in Endocrine Management: Assessing ChatGPT's Responses to Prolactinoma Queries. J Pers Med 2024; 14:330. [PMID: 38672957 PMCID: PMC11051052 DOI: 10.3390/jpm14040330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 03/15/2024] [Accepted: 03/21/2024] [Indexed: 04/28/2024] Open
Abstract
This research investigates the utility of Chat Generative Pre-trained Transformer (ChatGPT) in addressing patient inquiries related to hyperprolactinemia and prolactinoma. A set of 46 commonly asked questions from patients with prolactinoma were presented to ChatGPT and responses were evaluated for accuracy with a 6-point Likert scale (1: completely inaccurate to 6: completely accurate) and adequacy with a 5-point Likert scale (1: completely inadequate to 5: completely adequate). Two independent endocrinologists assessed the responses, based on international guidelines. Questions were categorized into groups including general information, diagnostic process, treatment process, follow-up, and pregnancy period. The median accuracy score was 6.0 (IQR, 5.4-6.0), and the adequacy score was 4.5 (IQR, 3.5-5.0). The lowest accuracy and adequacy score assigned by both evaluators was two. Significant agreement was observed between the evaluators, demonstrated by a weighted κ of 0.68 (p = 0.08) for accuracy and a κ of 0.66 (p = 0.04) for adequacy. The Kruskal-Wallis tests revealed statistically significant differences among the groups for accuracy (p = 0.005) and adequacy (p = 0.023). The pregnancy period group had the lowest accuracy score and both pregnancy period and follow-up groups had the lowest adequacy score. In conclusion, ChatGPT demonstrated commendable responses in addressing prolactinoma queries; however, certain limitations were observed, particularly in providing accurate information related to the pregnancy period, emphasizing the need for refining its capabilities in medical contexts.
Collapse
Affiliation(s)
- Mustafa Can Şenoymak
- Department of Endocrinology and Metabolism, University of Health Sciences Sultan, Abdulhamid Han Training and Research Hospital, Istanbul 34668, Turkey
| | - Nuriye Hale Erbatur
- Department of Endocrinology and Metabolism, University of Health Sciences Sultan, Abdulhamid Han Training and Research Hospital, Istanbul 34668, Turkey
| | - İrem Şenoymak
- Family Medicine Department, Usküdar State Hospital, Istanbul 34662, Turkey
| | - Sevde Nur Fırat
- Department of Endocrinology and Metabolism, University of Health Sciences, Ankara Training and Research Hospital, Ankara 06230, Turkey
| |
Collapse
|
52
|
Chervonski E, Harish KB, Rockman CB, Sadek M, Teter KA, Jacobowitz GR, Berland TL, Lohr J, Moore C, Maldonado TS. Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients. Vascular 2024:17085381241240550. [PMID: 38500300 DOI: 10.1177/17085381241240550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
OBJECTIVES Generative artificial intelligence (AI) has emerged as a promising tool to engage with patients. The objective of this study was to assess the quality of AI responses to common patient questions regarding vascular surgery disease processes. METHODS OpenAI's ChatGPT-3.5 and Google Bard were queried with 24 mock patient questions spanning seven vascular surgery disease domains. Six experienced vascular surgery faculty at a tertiary academic center independently graded AI responses on their accuracy (rated 1-4 from completely inaccurate to completely accurate), completeness (rated 1-4 from totally incomplete to totally complete), and appropriateness (binary). Responses were also evaluated with three readability scales. RESULTS ChatGPT responses were rated, on average, more accurate than Bard responses (3.08 ± 0.33 vs 2.82 ± 0.40, p < .01). ChatGPT responses were scored, on average, more complete than Bard responses (2.98 ± 0.34 vs 2.62 ± 0.36, p < .01). Most ChatGPT responses (75.0%, n = 18) and almost half of Bard responses (45.8%, n = 11) were unanimously deemed appropriate. Almost one-third of Bard responses (29.2%, n = 7) were deemed inappropriate by at least two reviewers (29.2%), and two Bard responses (8.4%) were considered inappropriate by the majority. The mean Flesch Reading Ease, Flesch-Kincaid Grade Level, and Gunning Fog Index of ChatGPT responses were 29.4 ± 10.8, 14.5 ± 2.2, and 17.7 ± 3.1, respectively, indicating that responses were readable with a post-secondary education. Bard's mean readability scores were 58.9 ± 10.5, 8.2 ± 1.7, and 11.0 ± 2.0, respectively, indicating that responses were readable with a high-school education (p < .0001 for three metrics). ChatGPT's mean response length (332 ± 79 words) was higher than Bard's mean response length (183 ± 53 words, p < .001). There was no difference in the accuracy, completeness, readability, or response length of ChatGPT or Bard between disease domains (p > .05 for all analyses). CONCLUSIONS AI offers a novel means of educating patients that avoids the inundation of information from "Dr Google" and the time barriers of physician-patient encounters. ChatGPT provides largely valid, though imperfect, responses to myriad patient questions at the expense of readability. While Bard responses are more readable and concise, their quality is poorer. Further research is warranted to better understand failure points for large language models in vascular surgery patient education.
Collapse
Affiliation(s)
- Ethan Chervonski
- New York University Grossman School of Medicine, New York, NY, USA
| | - Keerthi B Harish
- New York University Grossman School of Medicine, New York, NY, USA
| | - Caron B Rockman
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Mikel Sadek
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Katherine A Teter
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Glenn R Jacobowitz
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Todd L Berland
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Joann Lohr
- Dorn Veterans Affairs Medical Center, Columbia, SC, USA
| | | | - Thomas S Maldonado
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| |
Collapse
|
53
|
Wu RC, Li DX, Feng DC. Re: Michael Eppler, Conner Ganjavi, Lorenzo Storino Ramacciotti, et al. Awareness and Use of ChatGPT and Large Language Models: A Prospective Cross-sectional Global Survey in Urology. Eur Urol. 2024;85:146-53. Eur Urol 2024; 85:e87-e88. [PMID: 38151444 DOI: 10.1016/j.eururo.2023.11.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 11/23/2023] [Indexed: 12/29/2023]
Affiliation(s)
- Rui-Cheng Wu
- Department of Urology, Institute of Urology, West China Hospital, Sichuan University, Chengdu, China
| | - Deng-Xiong Li
- Department of Urology, Institute of Urology, West China Hospital, Sichuan University, Chengdu, China
| | - De-Chao Feng
- Department of Urology, Institute of Urology, West China Hospital, Sichuan University, Chengdu, China.
| |
Collapse
|
54
|
Wu SH, Tong WJ, Li MD, Hu HT, Lu XZ, Huang ZR, Lin XX, Lu RF, Lu MD, Chen LD, Wang W. Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models. Radiology 2024; 310:e232255. [PMID: 38470237 DOI: 10.1148/radiol.232255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/13/2024]
Abstract
Background Large language models (LLMs) hold substantial promise for medical imaging interpretation. However, there is a lack of studies on their feasibility in handling reasoning questions associated with medical diagnosis. Purpose To investigate the viability of leveraging three publicly available LLMs to enhance consistency and diagnostic accuracy in medical imaging based on standardized reporting, with pathology as the reference standard. Materials and Methods US images of thyroid nodules with pathologic results were retrospectively collected from a tertiary referral hospital between July 2022 and December 2022 and used to evaluate malignancy diagnoses generated by three LLMs-OpenAI's ChatGPT 3.5, ChatGPT 4.0, and Google's Bard. Inter- and intra-LLM agreement of diagnosis were evaluated. Then, diagnostic performance, including accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC), was evaluated and compared for the LLMs and three interactive approaches: human reader combined with LLMs, image-to-text model combined with LLMs, and an end-to-end convolutional neural network model. Results A total of 1161 US images of thyroid nodules (498 benign, 663 malignant) from 725 patients (mean age, 42.2 years ± 14.1 [SD]; 516 women) were evaluated. ChatGPT 4.0 and Bard displayed substantial to almost perfect intra-LLM agreement (κ range, 0.65-0.86 [95% CI: 0.64, 0.86]), while ChatGPT 3.5 showed fair to substantial agreement (κ range, 0.36-0.68 [95% CI: 0.36, 0.68]). ChatGPT 4.0 had an accuracy of 78%-86% (95% CI: 76%, 88%) and sensitivity of 86%-95% (95% CI: 83%, 96%), compared with 74%-86% (95% CI: 71%, 88%) and 74%-91% (95% CI: 71%, 93%), respectively, for Bard. Moreover, with ChatGPT 4.0, the image-to-text-LLM strategy exhibited an AUC (0.83 [95% CI: 0.80, 0.85]) and accuracy (84% [95% CI: 82%, 86%]) comparable to those of the human-LLM interaction strategy with two senior readers and one junior reader and exceeding those of the human-LLM interaction strategy with one junior reader. Conclusion LLMs, particularly integrated with image-to-text approaches, show potential in enhancing diagnostic medical imaging. ChatGPT 4.0 was optimal for consistency and diagnostic accuracy when compared with Bard and ChatGPT 3.5. © RSNA, 2024 Supplemental material is available for this article.
Collapse
Affiliation(s)
- Shao-Hong Wu
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Wen-Juan Tong
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Ming-De Li
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Hang-Tong Hu
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Xiao-Zhou Lu
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Ze-Rong Huang
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Xin-Xin Lin
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Rui-Fang Lu
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Ming-De Lu
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Li-Da Chen
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| | - Wei Wang
- From the Department of Medical Ultrasonics, Ultrasomics Artificial Intelligence X-Laboratory, Institute of Diagnostic and Interventional Ultrasound, First Affiliated Hospital of Sun Yat-sen University, No. 58 Zhongshan Rd 2, Guangzhou 510080, People's Republic of China (S.H.W., W.J.T., M.D. Li, H.T.H., Z.R.H., X.X.L., R.F.L., M.D. Lu, L.D.C., W.W.); and Department of Traditional Chinese Medicine, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, People's Republic of China (X.Z.L.)
| |
Collapse
|
55
|
Coskun BN, Yagiz B, Ocakoglu G, Dalkilic E, Pehlivan Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int 2024; 44:509-515. [PMID: 37747564 DOI: 10.1007/s00296-023-05473-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 09/14/2023] [Indexed: 09/26/2023]
Abstract
We aimed to assess Large Language Models (LLMs)-ChatGPT 3.5-4, BARD, and Bing-in their accuracy and completeness when answering Methotrexate (MTX) related questions for treating rheumatoid arthritis. We employed 23 questions from an earlier study related to MTX concerns. These questions were entered into the LLMs, and the responses generated by each model were evaluated by two reviewers using Likert scales to assess accuracy and completeness. The GPT models achieved a 100% correct answer rate, while BARD and Bing scored 73.91%. In terms of accuracy of the outputs (completely correct responses), GPT-4 achieved a score of 100%, GPT 3.5 secured 86.96%, and BARD and Bing each scored 60.87%. BARD produced 17.39% incorrect responses and 8.7% non-responses, while Bing recorded 13.04% incorrect and 13.04% non-responses. The ChatGPT models produced significantly more accurate responses than Bing for the "mechanism of action" category, and GPT-4 model showed significantly higher accuracy than BARD in the "side effects" category. There were no statistically significant differences among the models for the "lifestyle" category. GPT-4 achieved a comprehensive output of 100%, followed by GPT-3.5 at 86.96%, BARD at 60.86%, and Bing at 0%. In the "mechanism of action" category, both ChatGPT models and BARD produced significantly more comprehensive outputs than Bing. For the "side effects" and "lifestyle" categories, the ChatGPT models showed significantly higher completeness than Bing. The GPT models, particularly GPT 4, demonstrated superior performance in providing accurate and comprehensive patient information about MTX use. However, the study also identified inaccuracies and shortcomings in the generated responses.
Collapse
Affiliation(s)
- Belkis Nihan Coskun
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey.
| | - Burcu Yagiz
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| | - Gokhan Ocakoglu
- Department of Biostatistics, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| | - Ediz Dalkilic
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| | - Yavuz Pehlivan
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| |
Collapse
|
56
|
Abi-Rafeh J, Mroueh VJ, Bassiri-Tehrani B, Marks J, Kazan R, Nahai F. Complications Following Body Contouring: Performance Validation of Bard, a Novel AI Large Language Model, in Triaging and Managing Postoperative Patient Concerns. Aesthetic Plast Surg 2024; 48:953-976. [PMID: 38273152 DOI: 10.1007/s00266-023-03819-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Accepted: 12/14/2023] [Indexed: 01/27/2024]
Abstract
INTRODUCTION Large language models (LLM) have revolutionized the way humans interact with artificial intelligence (AI) technology, with marked potential for applications in esthetic surgery. The present study evaluates the performance of Bard, a novel LLM, in identifying and managing postoperative patient concerns for complications following body contouring surgery. METHODS The American Society of Plastic Surgeons' website was queried to identify and simulate all potential postoperative complications following body contouring across different acuities and severity. Bard's accuracy was assessed in providing a differential diagnosis, soliciting a history, suggesting a most-likely diagnosis, appropriate disposition, treatments/interventions to begin from home, and red-flag signs/symptoms indicating deterioration, or requiring urgent emergency department (ED) presentation. RESULTS Twenty-two simulated body contouring complications were examined. Overall, Bard demonstrated a 59% accuracy in listing relevant diagnoses on its differentials, with a 52% incidence of incorrect or misleading diagnoses. Following history-taking, Bard demonstrated an overall accuracy of 44% in identifying the most-likely diagnosis, and a 55% accuracy in suggesting the indicated medical dispositions. Helpful treatments/interventions to begin from home were suggested with a 40% accuracy, whereas red-flag signs/symptoms, indicating deterioration, were shared with a 48% accuracy. A detailed analysis of performance, stratified according to latency of postoperative presentation (<48hours, 48hours-1month, or >1month postoperatively), and according to acuity and indicated medical disposition, is presented herein. CONCLUSIONS Despite promising potential of LLMs and AI in healthcare-related applications, Bard's performance in the present study significantly falls short of accepted clinical standards, thus indicating a need for further research and development prior to adoption. LEVEL OF EVIDENCE IV This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
Affiliation(s)
- Jad Abi-Rafeh
- Division of Plastic, Reconstructive, and Aesthetic Surgery, McGill University Health Centre, Montreal, QC, Canada
| | - Vanessa J Mroueh
- Brigham and Women's Hospital, Harvard Medical School, Boston Massachusetts, USA
| | | | - Jacob Marks
- Manhattan Eye, Ear, and Throat Hospital, New York, NY, USA
| | - Roy Kazan
- Division of Plastic, Reconstructive, and Aesthetic Surgery, McGill University Health Centre, Montreal, QC, Canada
| | - Foad Nahai
- Department of Surgery, Emory University, Atlanta, GA, USA.
| |
Collapse
|
57
|
Rahsepar AA. Large Language Models for Enhancing Radiology Report Impressions: Improve Readability While Decreasing Burnout. Radiology 2024; 310:e240498. [PMID: 38530179 DOI: 10.1148/radiol.240498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/27/2024]
Affiliation(s)
- Amir Ali Rahsepar
- From the Department of Radiology, Northwestern Memorial Hospital, 676 N Saint Clair St, Arkes Family Pavilion Suite 800, Chicago, IL 60611
| |
Collapse
|
58
|
Bera K, O'Connor G, Jiang S, Tirumani SH, Ramaiya N. Analysis of ChatGPT publications in radiology: Literature so far. Curr Probl Diagn Radiol 2024; 53:215-225. [PMID: 37891083 DOI: 10.1067/j.cpradiol.2023.10.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 10/18/2023] [Indexed: 10/29/2023]
Abstract
OBJECTIVE To perform a detailed qualitative and quantitative analysis of the published literature on ChatGPT and radiology in the nine months since its public release, detailing the scope of the work in the short timeframe. METHODS A systematic literature search was carried out of the MEDLINE, EMBASE databases through August 15, 2023 for articles that were focused on ChatGPT and imaging/radiology. Articles were classified into original research and reviews/perspectives. Quantitative analysis was carried out by two experienced radiologists using objective scoring systems for evaluating original and non-original research. RESULTS 51 articles were published involving ChatGPT and radiology/imaging dating from 26 Jan 2023 to the last article published on 14 Aug 2023. 23 articles were original research while the rest included reviews/perspectives or brief communications. For quantitative analysis scored by two readers, we included 23 original research and 17 non-original research articles (after excluding 11 letters as responses to previous articles). Mean score for original research was 3.20 out of 5 (across five questions), while mean score for non-original research was 1.17 out of 2 (across six questions). Mean score grading performance of ChatGPT in original research was 3.20 out of five (across two questions). DISCUSSION While it is early days for ChatGPT and its impact in radiology, there has already been a plethora of articles talking about the multifaceted nature of the tool and how it can impact every aspect of radiology from patient education, pre-authorization, protocol selection, generating differentials, to structuring radiology reports. Most articles show impressive performance of ChatGPT which can only improve with more research and improvements in the tool itself. There have also been several articles which have highlighted the limitations of ChatGPT in its current iteration, which will allow radiologists and researchers to improve these areas.
Collapse
Affiliation(s)
- Kaustav Bera
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA.
| | - Gregory O'Connor
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| | - Sirui Jiang
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| | - Sree Harsha Tirumani
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| | - Nikhil Ramaiya
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| |
Collapse
|
59
|
Doddi S, Hibshman T, Salichs O, Bera K, Tippareddy C, Ramaiya N, Tirumani SH. Assessing appropriate responses to ACR urologic imaging scenarios using ChatGPT and Bard. Curr Probl Diagn Radiol 2024; 53:226-229. [PMID: 37891086 DOI: 10.1067/j.cpradiol.2023.10.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 10/18/2023] [Accepted: 10/18/2023] [Indexed: 10/29/2023]
Abstract
Artificial intelligence (AI) has recently become a trending tool and topic regarding productivity especially with publicly available free services such as ChatGPT and Bard. In this report, we investigate if two widely available chatbots chatGPT and Bard, are able to show consistent accurate responses for the best imaging modality for urologic clinical situations and if they are in line with American College of Radiology (ACR) Appropriateness Criteria (AC). All clinical scenarios provided by the ACR were inputted into ChatGPT and Bard with result compared to the ACR AC and recorded. Both chatbots had an appropriate imaging modality rate of of 62% and no significant difference in proportion of correct imaging modality was found overall between the two services (p>0.05). The results of our study found that both ChatGPT and Bard are similar in their ability to suggest the most appropriate imaging modality in a variety of urologic scenarios based on ACR AC criteria. Nonetheless, both chatbots lack consistent accuracy and further development is necessary for implementation in clinical settings. For proper use of these AI services in clinical decision making, further developments are needed to improve the workflow of physicians.
Collapse
Affiliation(s)
- Sishir Doddi
- University of Toledo College of Medicine, Toledo, OH, United States.
| | - Taryn Hibshman
- University of Toledo College of Medicine, Toledo, OH, United States
| | - Oscar Salichs
- University of Toledo College of Medicine, Toledo, OH, United States
| | - Kaustav Bera
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| | - Charit Tippareddy
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| | - Nikhil Ramaiya
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| | - Sree Harsha Tirumani
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| |
Collapse
|
60
|
Hu Y, Hu Z, Liu W, Gao A, Wen S, Liu S, Lin Z. Exploring the potential of ChatGPT as an adjunct for generating diagnosis based on chief complaint and cone beam CT radiologic findings. BMC Med Inform Decis Mak 2024; 24:55. [PMID: 38374067 PMCID: PMC10875853 DOI: 10.1186/s12911-024-02445-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 01/28/2024] [Indexed: 02/21/2024] Open
Abstract
AIM This study aimed to assess the performance of OpenAI's ChatGPT in generating diagnosis based on chief complaint and cone beam computed tomography (CBCT) radiologic findings. MATERIALS AND METHODS 102 CBCT reports (48 with dental diseases (DD) and 54 with neoplastic/cystic diseases (N/CD)) were collected. ChatGPT was provided with chief complaint and CBCT radiologic findings. Diagnostic outputs from ChatGPT were scored based on five-point Likert scale. For diagnosis accuracy, the scoring was based on the accuracy of chief complaint related diagnosis and chief complaint unrelated diagnoses (1-5 points); for diagnosis completeness, the scoring was based on how many accurate diagnoses included in ChatGPT's output for one case (1-5 points); for text quality, the scoring was based on how many text errors included in ChatGPT's output for one case (1-5 points). For 54 N/CD cases, the consistence of the diagnosis generated by ChatGPT with pathological diagnosis was also calculated. The constitution of text errors in ChatGPT's outputs was evaluated. RESULTS After subjective ratings by expert reviewers on a five-point Likert scale, the final score of diagnosis accuracy, diagnosis completeness and text quality of ChatGPT was 3.7, 4.5 and 4.6 for the 102 cases. For diagnostic accuracy, it performed significantly better on N/CD (3.8/5) compared to DD (3.6/5). For 54 N/CD cases, 21(38.9%) cases have first diagnosis completely consistent with pathological diagnosis. No text errors were observed in 88.7% of all the 390 text items. CONCLUSION ChatGPT showed potential in generating radiographic diagnosis based on chief complaint and radiologic findings. However, the performance of ChatGPT varied with task complexity, necessitating professional oversight due to a certain error rate.
Collapse
Affiliation(s)
- Yanni Hu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Ziyang Hu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
- Department of Stomatology, Shenzhen Longhua District Central Hospital, Shenzhen, People's Republic of China
| | - Wenjing Liu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Antian Gao
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Shanhui Wen
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Shu Liu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Zitong Lin
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China.
| |
Collapse
|
61
|
Peng W, Feng Y, Yao C, Zhang S, Zhuo H, Qiu T, Zhang Y, Tang J, Gu Y, Sun Y. Evaluating AI in medicine: a comparative analysis of expert and ChatGPT responses to colorectal cancer questions. Sci Rep 2024; 14:2840. [PMID: 38310152 PMCID: PMC10838275 DOI: 10.1038/s41598-024-52853-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Accepted: 01/24/2024] [Indexed: 02/05/2024] Open
Abstract
Colorectal cancer (CRC) is a global health challenge, and patient education plays a crucial role in its early detection and treatment. Despite progress in AI technology, as exemplified by transformer-like models such as ChatGPT, there remains a lack of in-depth understanding of their efficacy for medical purposes. We aimed to assess the proficiency of ChatGPT in the field of popular science, specifically in answering questions related to CRC diagnosis and treatment, using the book "Colorectal Cancer: Your Questions Answered" as a reference. In general, 131 valid questions from the book were manually input into ChatGPT. Responses were evaluated by clinical physicians in the relevant fields based on comprehensiveness and accuracy of information, and scores were standardized for comparison. Not surprisingly, ChatGPT showed high reproducibility in its responses, with high uniformity in comprehensiveness, accuracy, and final scores. However, the mean scores of ChatGPT's responses were significantly lower than the benchmarks, indicating it has not reached an expert level of competence in CRC. While it could provide accurate information, it lacked in comprehensiveness. Notably, ChatGPT performed well in domains of radiation therapy, interventional therapy, stoma care, venous care, and pain control, almost rivaling the benchmarks, but fell short in basic information, surgery, and internal medicine domains. While ChatGPT demonstrated promise in specific domains, its general efficiency in providing CRC information falls short of expert standards, indicating the need for further advancements and improvements in AI technology for patient education in healthcare.
Collapse
Affiliation(s)
- Wen Peng
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China
| | - Yifei Feng
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China
| | - Cui Yao
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China
| | - Sheng Zhang
- Department of Radiotherapy, The First Affiliated Hospital with Nanjing Medical University, Nanjing, People's Republic of China
| | - Han Zhuo
- Department of Intervention, The First Affiliated Hospital with Nanjing Medical University, Nanjing, People's Republic of China
| | - Tianzhu Qiu
- Department of Oncology, The First Affiliated Hospital with Nanjing Medical University, Nanjing, People's Republic of China
| | - Yi Zhang
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China
| | - Junwei Tang
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China.
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China.
| | - Yanhong Gu
- Department of Oncology, The First Affiliated Hospital with Nanjing Medical University, Nanjing, People's Republic of China.
| | - Yueming Sun
- Department of General Surgery, The First Affiliated Hospital with Nanjing Medical University, Nanjing, 210029, Jiangsu, People's Republic of China.
- The First School of Clinical Medicine, Nanjing Medical University, Nanjing, China.
| |
Collapse
|
62
|
Toyama Y, Harigai A, Abe M, Nagano M, Kawabata M, Seki Y, Takase K. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol 2024; 42:201-207. [PMID: 37792149 PMCID: PMC10811006 DOI: 10.1007/s11604-023-01491-2] [Citation(s) in RCA: 27] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 09/12/2023] [Indexed: 10/05/2023]
Abstract
PURPOSE Herein, we assessed the accuracy of large language models (LLMs) in generating responses to questions in clinical radiology practice. We compared the performance of ChatGPT, GPT-4, and Google Bard using questions from the Japan Radiology Board Examination (JRBE). MATERIALS AND METHODS In total, 103 questions from the JRBE 2022 were used with permission from the Japan Radiological Society. These questions were categorized by pattern, required level of thinking, and topic. McNemar's test was used to compare the proportion of correct responses between the LLMs. Fisher's exact test was used to assess the performance of GPT-4 for each topic category. RESULTS ChatGPT, GPT-4, and Google Bard correctly answered 40.8% (42 of 103), 65.0% (67 of 103), and 38.8% (40 of 103) of the questions, respectively. GPT-4 significantly outperformed ChatGPT by 24.2% (p < 0.001) and Google Bard by 26.2% (p < 0.001). In the categorical analysis by level of thinking, GPT-4 correctly answered 79.7% of the lower-order questions, which was significantly higher than ChatGPT or Google Bard (p < 0.001). The categorical analysis by question pattern revealed GPT-4's superiority over ChatGPT (67.4% vs. 46.5%, p = 0.004) and Google Bard (39.5%, p < 0.001) in the single-answer questions. The categorical analysis by topic revealed that GPT-4 outperformed ChatGPT (40%, p = 0.013) and Google Bard (26.7%, p = 0.004). No significant differences were observed between the LLMs in the categories not mentioned above. The performance of GPT-4 was significantly better in nuclear medicine (93.3%) than in diagnostic radiology (55.8%; p < 0.001). GPT-4 also performed better on lower-order questions than on higher-order questions (79.7% vs. 45.5%, p < 0.001). CONCLUSION ChatGPTplus based on GPT-4 scored 65% when answering Japanese questions from the JRBE, outperforming ChatGPT and Google Bard. This highlights the potential of using LLMs to address advanced clinical questions in the field of radiology in Japan.
Collapse
Affiliation(s)
- Yoshitaka Toyama
- Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan.
| | - Ayaka Harigai
- Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan
- Department of Radiology, Tohoku Medical and Pharmaceutical University, Sendai, Japan
- Department of Diagnostic Radiology, Tohoku University Graduate School of Medicine, Sendai, Japan
| | - Mirei Abe
- Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan
| | | | - Masahiro Kawabata
- Department of Diagnostic Radiology, Tohoku University Hospital, 1-1 Seiryo-Machi, Aoba-Ku, Sendai, 980-8575, Japan
| | - Yasuhiro Seki
- Department of Radiation Oncology, Tohoku University Hospital, Sendai, Japan
| | - Kei Takase
- Department of Diagnostic Radiology, Tohoku University Graduate School of Medicine, Sendai, Japan
| |
Collapse
|
63
|
Kim S, Lee CK, Kim SS. Large Language Models: A Guide for Radiologists. Korean J Radiol 2024; 25:126-133. [PMID: 38288895 PMCID: PMC10831297 DOI: 10.3348/kjr.2023.0997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 11/27/2023] [Accepted: 12/18/2023] [Indexed: 02/01/2024] Open
Abstract
Large language models (LLMs) have revolutionized the global landscape of technology beyond natural language processing. Owing to their extensive pre-training on vast datasets, contemporary LLMs can handle tasks ranging from general functionalities to domain-specific areas, such as radiology, without additional fine-tuning. General-purpose chatbots based on LLMs can optimize the efficiency of radiologists in terms of their professional work and research endeavors. Importantly, these LLMs are on a trajectory of rapid evolution, wherein challenges such as "hallucination," high training cost, and efficiency issues are addressed, along with the inclusion of multimodal inputs. In this review, we aim to offer conceptual knowledge and actionable guidance to radiologists interested in utilizing LLMs through a succinct overview of the topic and a summary of radiology-specific aspects, from the beginning to potential future directions.
Collapse
Affiliation(s)
- Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
- AIGEN Sciences, Seoul, Republic of Korea
| | - Choong-Kun Lee
- Division of Medical Oncology, Department of Internal Medicine, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Seung-Seob Kim
- Department of Radiology and Research Institute of Radiological Science, Severance Hospital, Yonsei University College of Medicine, Seoul, Republic of Korea.
| |
Collapse
|
64
|
Mediboina A, Badam RK, Chodavarapu S. Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI. Cureus 2024; 16:e51544. [PMID: 38318564 PMCID: PMC10840059 DOI: 10.7759/cureus.51544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/01/2024] [Indexed: 02/07/2024] Open
Abstract
Background and objective ChatGPT and Google Bard AI are widely used conversational chatbots, even in healthcare. While they have several strengths, they can generate seemingly correct but erroneous responses, warranting caution in medical contexts. In an era where access to abortion care is diminishing, patients may increasingly rely on online resources and AI-driven language models for information on medication abortions. In light of this, this study aimed to compare the accuracy and comprehensiveness of responses generated by ChatGPT 3.5 and Google Bard AI to medical queries about medication abortions. Methods Fourteen open-ended questions about medication abortion were formulated based on the Frequently Asked Questions (FAQs) from the National Abortion Federation (NAF) and the Reproductive Health Access Project (RHAP) websites. These questions were answered using ChatGPT version 3.5 and Google Bard AI on October 7, 2023. The accuracy of the responses was analyzed by cross-referencing the generated answers against the information provided by NAF and RHAP. Any discrepancies were further verified against the guidelines from the American Congress of Obstetricians and Gynecologists (ACOG). A rating scale used by Johnson et al. was employed for assessment, utilizing a 6-point Likert scale [ranging from 1 (completely incorrect) to 6 (correct)] to evaluate accuracy and a 3-point scale [ranging from 1 (incomplete) to 3 (comprehensive)] to assess completeness. Questions that did not yield answers were assigned a score of 0 and omitted from the correlation analysis. Data analysis and visualization were done using R Software version 4.3.1. Statistical significance was determined by employing Spearman's R and Mann-Whitney U tests. Results All questions were entered sequentially into both chatbots by the same author. On the initial attempt, ChatGPT successfully generated relevant responses for all questions, while Google Bard AI failed to provide answers for five questions. Repeating the same question in Google Bard AI yielded an answer for one; two were answered with different phrasing; and two remained unanswered despite rephrasing. ChatGPT showed a median accuracy score of 5 (mean: 5.26, SD: 0.73) and a median completeness score of 3 (mean: 2.57, SD: 0.51). It showed the highest accuracy score in six responses and the highest completeness score in eight responses. In contrast, Google Bard AI had a median accuracy score of 5 (mean: 4.5, SD: 2.03) and a median completeness score of 2 (mean: 2.14, SD: 1.03). It achieved the highest accuracy score in five responses and the highest completeness score in six responses. Spearman's correlation coefficient revealed no correlation between accuracy and completeness for ChatGPT (rs = -0.46771, p = 0.09171). However, Google Bard AI showed a marginally significant correlation (rs = 0.5738, p = 0.05108). Mann-Whitney U test indicated no statistically significant differences between ChatGPT and Google Bard AI concerning accuracy (U = 82, p>0.05) or completeness (U = 78, p>0.05). Conclusion While both chatbots showed similar levels of accuracy, minor errors were noted, pertaining to finer aspects that demand specialized knowledge of abortion care. This could explain the lack of a significant correlation between accuracy and completeness. Ultimately, AI-driven language models have the potential to provide information on medication abortions, but there is a need for continual refinement and oversight.
Collapse
Affiliation(s)
- Anjali Mediboina
- Community Medicine, Alluri Sita Ramaraju Academy of Medical Sciences, Eluru, IND
| | - Rajani Kumari Badam
- Obstetrics and Gynaecology, Sri Venkateswara Medical College, Tirupathi, IND
| | - Sailaja Chodavarapu
- Obstetrics and Gynaecology, Government Medical College, Rajamahendravaram, IND
| |
Collapse
|
65
|
Park SH. Noteworthy Developments in the Korean Journal of Radiology in 2023 and for 2024. Korean J Radiol 2024; 25:1-5. [PMID: 38184762 PMCID: PMC10788598 DOI: 10.3348/kjr.2023.1172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 11/22/2023] [Indexed: 01/08/2024] Open
Affiliation(s)
- Seong Ho Park
- Department of Radiology and Research Institute of Radiology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.
| |
Collapse
|
66
|
Marey A, Saad AM, Killeen BD, Gomez C, Tregubova M, Unberath M, Umair M. Generative Artificial Intelligence: Enhancing Patient Education in Cardiovascular Imaging. BJR Open 2024; 6:tzae018. [PMID: 39086557 PMCID: PMC11290812 DOI: 10.1093/bjro/tzae018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 04/18/2024] [Accepted: 07/10/2024] [Indexed: 08/02/2024] Open
Abstract
Cardiovascular disease (CVD) is a major cause of mortality worldwide, especially in resource-limited countries with limited access to healthcare resources. Early detection and accurate imaging are vital for managing CVD, emphasizing the significance of patient education. Generative artificial intelligence (AI), including algorithms to synthesize text, speech, images, and combinations thereof given a specific scenario or prompt, offers promising solutions for enhancing patient education. By combining vision and language models, generative AI enables personalized multimedia content generation through natural language interactions, benefiting patient education in cardiovascular imaging. Simulations, chat-based interactions, and voice-based interfaces can enhance accessibility, especially in resource-limited settings. Despite its potential benefits, implementing generative AI in resource-limited countries faces challenges like data quality, infrastructure limitations, and ethical considerations. Addressing these issues is crucial for successful adoption. Ethical challenges related to data privacy and accuracy must also be overcome to ensure better patient understanding, treatment adherence, and improved healthcare outcomes. Continued research, innovation, and collaboration in generative AI have the potential to revolutionize patient education. This can empower patients to make informed decisions about their cardiovascular health, ultimately improving healthcare outcomes in resource-limited settings.
Collapse
Affiliation(s)
- Ahmed Marey
- Alexandria University Faculty of Medicine, Alexandria, 21521, Egypt
| | | | | | - Catalina Gomez
- Department of Computer Science, Laboratory for Computational Sensing and Robotics, Johns Hopkins University, Baltimore, MD, 21218, United States
| | - Mariia Tregubova
- Department of Radiology, Amosov National Institute of Cardiovascular Surgery, Kyiv, 02000, Ukraine
| | - Mathias Unberath
- Department of Computer Science, Laboratory for Computational Sensing and Robotics, Johns Hopkins University, Baltimore, MD, 21218, United States
| | - Muhammad Umair
- Russell H. Morgan Department of Radiology and Radiological Sciences, The Johns Hopkins Hospital, Baltimore, MD, 21205, United States
| |
Collapse
|
67
|
Bhayana R. Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology 2024; 310:e232756. [PMID: 38226883 DOI: 10.1148/radiol.232756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2024]
Abstract
Although chatbots have existed for decades, the emergence of transformer-based large language models (LLMs) has captivated the world through the most recent wave of artificial intelligence chatbots, including ChatGPT. Transformers are a type of neural network architecture that enables better contextual understanding of language and efficient training on massive amounts of unlabeled data, such as unstructured text from the internet. As LLMs have increased in size, their improved performance and emergent abilities have revolutionized natural language processing. Since language is integral to human thought, applications based on LLMs have transformative potential in many industries. In fact, LLM-based chatbots have demonstrated human-level performance on many professional benchmarks, including in radiology. LLMs offer numerous clinical and research applications in radiology, several of which have been explored in the literature with encouraging results. Multimodal LLMs can simultaneously interpret text and images to generate reports, closely mimicking current diagnostic pathways in radiology. Thus, from requisition to report, LLMs have the opportunity to positively impact nearly every step of the radiology journey. Yet, these impressive models are not without limitations. This article reviews the limitations of LLMs and mitigation strategies, as well as potential uses of LLMs, including multimodal models. Also reviewed are existing LLM-based applications that can enhance efficiency in supervised settings.
Collapse
Affiliation(s)
- Rajesh Bhayana
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital, and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Bldg, 1st Fl, Toronto, ON, Canada M5G 24C
| |
Collapse
|
68
|
Gan RK, Ogbodo JC, Wee YZ, Gan AZ, González PA. Performance of Google bard and ChatGPT in mass casualty incidents triage. Am J Emerg Med 2024; 75:72-78. [PMID: 37967485 DOI: 10.1016/j.ajem.2023.10.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 10/03/2023] [Accepted: 10/24/2023] [Indexed: 11/17/2023] Open
Abstract
AIM The objective of our research is to evaluate and compare the performance of ChatGPT, Google Bard, and medical students in performing START triage during mass casualty situations. METHOD We conducted a cross-sectional analysis to compare ChatGPT, Google Bard, and medical students in mass casualty incident (MCI) triage using the Simple Triage And Rapid Treatment (START) method. A validated questionnaire with 15 diverse MCI scenarios was used to assess triage accuracy and content analysis in four categories: "Walking wounded," "Respiration," "Perfusion," and "Mental Status." Statistical analysis compared the results. RESULT Google Bard demonstrated a notably higher accuracy of 60%, while ChatGPT achieved an accuracy of 26.67% (p = 0.002). Comparatively, medical students performed at an accuracy rate of 64.3% in a previous study. However, there was no significant difference observed between Google Bard and medical students (p = 0.211). Qualitative content analysis of 'walking-wounded', 'respiration', 'perfusion', and 'mental status' indicated that Google Bard outperformed ChatGPT. CONCLUSION Google Bard was found to be superior to ChatGPT in correctly performing mass casualty incident triage. Google Bard achieved an accuracy of 60%, while chatGPT only achieved an accuracy of 26.67%. This difference was statistically significant (p = 0.002).
Collapse
Affiliation(s)
- Rick Kye Gan
- Unit for Research in Emergency and Disaster, Faculty of Medicine and Health Sciences, University of Oviedo, Oviedo 33006, Spain.
| | - Jude Chukwuebuka Ogbodo
- Unit for Research in Emergency and Disaster, Faculty of Medicine and Health Sciences, University of Oviedo, Oviedo 33006, Spain; Department of Primary Care and Population Health, Medical School, University of Nicosia, Nicosia 2408, Cyprus
| | - Yong Zheng Wee
- Faculty of Computing & Informatics, Multimedia University, 63100 Cyberjaya, Selangor, Malaysia
| | - Ann Zee Gan
- Tenghilan Health Clinic, Tuaran 89208, Sabah, Malaysia; Hospital Universiti Sains Malaysia, 16150 Kota Bharu, Malaysia
| | - Pedro Arcos González
- Unit for Research in Emergency and Disaster, Faculty of Medicine and Health Sciences, University of Oviedo, Oviedo 33006, Spain
| |
Collapse
|
69
|
Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T. Diagnostic performance of generative artificial intelligences for a series of complex case reports. Digit Health 2024; 10:20552076241265215. [PMID: 39229463 PMCID: PMC11369864 DOI: 10.1177/20552076241265215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Accepted: 06/13/2024] [Indexed: 09/05/2024] Open
Abstract
Background Diagnostic performance of generative artificial intelligences (AIs) using large language models (LLMs) across comprehensive medical specialties is still unknown. Objective We aimed to evaluate the diagnostic performance of generative AIs using LLMs in complex case series across comprehensive medical fields. Methods We analyzed published case reports from the American Journal of Case Reports from January 2022 to March 2023. We excluded pediatric cases and those primarily focused on management. We utilized three generative AIs to generate the top 10 differential-diagnosis (DDx) lists from case descriptions: the fourth-generation chat generative pre-trained transformer (ChatGPT-4), Google Gemini (previously Bard), and LLM Meta AI 2 (LLaMA2) chatbot. Two independent physicians assessed the inclusion of the final diagnosis in the lists generated by the AIs. Results Out of 557 consecutive case reports, 392 were included. The inclusion rates of the final diagnosis within top 10 DDx lists were 86.7% (340/392) for ChatGPT-4, 68.6% (269/392) for Google Gemini, and 54.6% (214/392) for LLaMA2 chatbot. The top diagnoses matched the final diagnoses in 54.6% (214/392) for ChatGPT-4, 31.4% (123/392) for Google Gemini, and 23.0% (90/392) for LLaMA2 chatbot. ChatGPT-4 showed higher diagnostic accuracy than Google Gemini (P < 0.001) and LLaMA2 chatbot (P < 0.001). Additionally, Google Gemini outperformed LLaMA2 chatbot within the top 10 DDx lists (P < 0.001) and as the top diagnosis (P = 0.010). Conclusions This study demonstrated the diagnostic performance of generative AIs including ChatGPT-4, Google Gemini, and LLaMA2 chatbot. ChatGPT-4 exhibited higher diagnostic accuracy than the other platforms. These findings suggest the importance of understanding the differences in diagnostic performance among generative AIs, especially in complex case series across comprehensive medical fields, like general medicine.
Collapse
Affiliation(s)
- Takanobu Hirosawa
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| | - Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| | - Kazuya Mizuta
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| | - Tetsu Sakamoto
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| | - Kazuki Tokumasu
- Department of General Medicine, Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama University, Okayama, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| |
Collapse
|
70
|
Huo B, Cacciamani GE, Collins GS, McKechnie T, Lee Y, Guyatt G. Reporting standards for the use of large language model-linked chatbots for health advice. Nat Med 2023; 29:2988. [PMID: 37957381 DOI: 10.1038/s41591-023-02656-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada.
| | - Giovanni E Cacciamani
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- AI Center at USC Urology, University of Southern California, Los Angeles, CA, USA
| | - Gary S Collins
- Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology & Musculoskeletal Sciences, University of Oxford, Oxford, UK
- UK EQUATOR Centre, University of Oxford, Oxford, UK
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
- Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
| | - Gordon Guyatt
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
- Department of Medicine, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
71
|
Alanzi TM, Alzahrani W, Albalawi NS, Allahyani T, Alghamdi A, Al-Zahrani H, Almutairi A, Alzahrani H, Almulhem L, Alanzi N, Al Moarfeg A, Farhah N. Public Awareness of Obesity as a Risk Factor for Cancer in Central Saudi Arabia: Feasibility of ChatGPT as an Educational Intervention. Cureus 2023; 15:e50781. [PMID: 38239542 PMCID: PMC10795720 DOI: 10.7759/cureus.50781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/17/2023] [Indexed: 01/22/2024] Open
Abstract
BACKGROUND While the link between obesity and chronic diseases such as diabetes and cardiovascular disorders is well-documented, there is a growing body of evidence connecting obesity with an increased risk of cancer. However, public awareness of this connection remains limited. STUDY PURPOSE To analyze public awareness of overweight/obesity as a risk factor for cancer and analyze public perceptions on the feasibility of ChatGPT, an artificial intelligence-based conversational agent, as an educational intervention tool. METHODS A mixed-methods approach including deductive quantitative cross-sectional approach to draw precise conclusions based on empirical evidence on public awareness of the link between obesity and cancer; and inductive qualitative approach to interpret public perceptions on using ChatGPT for creating awareness of obesity, cancer and its risk factors was used in this study. Participants included adult residents in Saudi Arabia. A total of 486 individuals and 21 individuals were included in the survey and semi-structured interviews respectively. RESULTS About 65% of the participants are not completely aware of cancer and its risk factors. Significant differences in awareness were observed concerning age groups (p < .0001), socio-economic status (p = .041), and regional distribution (p = .0351). A total of 10 themes were analyzed from the interview data, which included four positive factors (accessibility, personalization, cost-effectiveness, anonymity and privacy, multi-language support) and five negative factors (information inaccuracy, lack of emotional intelligence, dependency and overreliance, data privacy and security, and inability to provide physical support or diagnosis). CONCLUSION This study has underscored the potential of leveraging ChatGPT as a valuable public awareness tool for cancer in Saudi Arabia.
Collapse
Affiliation(s)
- Turki M Alanzi
- Department of Health Information Management and Technology, College of Public Health, Imam Abdulrahman Bin Faisal University, Dammam, SAU
| | - Wala Alzahrani
- Department of Clinical Nutrition, College of Applied Medical Sciences, King Abdulaziz University, Jeddah, SAU
| | | | - Taif Allahyani
- College of Applied Medical Sciences, Umm Al-Qura University, Makkah, SAU
| | | | - Haneen Al-Zahrani
- Department of Hematology, Armed Forces Hospital at King Abdulaziz Airbase Dhahran, Dhahran, SAU
| | - Awatif Almutairi
- Department of Clinical Laboratories Sciences, College of Applied Medical Sciences, Jouf University, Jouf, SAU
| | | | | | - Nouf Alanzi
- Department of Clinical Laboratories Sciences, College of Applied Medical Sciences, Jouf University, Jouf, SAU
| | | | - Nesren Farhah
- Department of Health Informatics, College of Health Sciences, Saudi Electronic University, Riyadh, SAU
| |
Collapse
|
72
|
Zhang C, Xu J, Tang R, Yang J, Wang W, Yu X, Shi S. Novel research and future prospects of artificial intelligence in cancer diagnosis and treatment. J Hematol Oncol 2023; 16:114. [PMID: 38012673 PMCID: PMC10680201 DOI: 10.1186/s13045-023-01514-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Accepted: 11/20/2023] [Indexed: 11/29/2023] Open
Abstract
Research into the potential benefits of artificial intelligence for comprehending the intricate biology of cancer has grown as a result of the widespread use of deep learning and machine learning in the healthcare sector and the availability of highly specialized cancer datasets. Here, we review new artificial intelligence approaches and how they are being used in oncology. We describe how artificial intelligence might be used in the detection, prognosis, and administration of cancer treatments and introduce the use of the latest large language models such as ChatGPT in oncology clinics. We highlight artificial intelligence applications for omics data types, and we offer perspectives on how the various data types might be combined to create decision-support tools. We also evaluate the present constraints and challenges to applying artificial intelligence in precision oncology. Finally, we discuss how current challenges may be surmounted to make artificial intelligence useful in clinical settings in the future.
Collapse
Affiliation(s)
- Chaoyi Zhang
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China
| | - Jin Xu
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China
| | - Rong Tang
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China
| | - Jianhui Yang
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China
| | - Wei Wang
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China
| | - Xianjun Yu
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China.
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China.
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China.
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China.
| | - Si Shi
- Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong'An Road, Shanghai, 200032, People's Republic of China.
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People's Republic of China.
- Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People's Republic of China.
- Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People's Republic of China.
| |
Collapse
|
73
|
Iannantuono GM, Bracken-Clarke D, Karzai F, Choo-Wosoba H, Gulley JL, Floudas CS. Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.10.31.23297825. [PMID: 38076813 PMCID: PMC10705618 DOI: 10.1101/2023.10.31.23297825] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Background The capability of large language models (LLMs) to understand and generate human-readable text has prompted the investigation of their potential as educational and management tools for cancer patients and healthcare providers. Materials and Methods We conducted a cross-sectional study aimed at evaluating the ability of ChatGPT-4, ChatGPT-3.5, and Google Bard to answer questions related to four domains of immuno-oncology (Mechanisms, Indications, Toxicities, and Prognosis). We generated 60 open-ended questions (15 for each section). Questions were manually submitted to LLMs, and responses were collected on June 30th, 2023. Two reviewers evaluated the answers independently. Results ChatGPT-4 and ChatGPT-3.5 answered all questions, whereas Google Bard answered only 53.3% (p <0.0001). The number of questions with reproducible answers was higher for ChatGPT-4 (95%) and ChatGPT3.5 (88.3%) than for Google Bard (50%) (p <0.0001). In terms of accuracy, the number of answers deemed fully correct were 75.4%, 58.5%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (p = 0.03). Furthermore, the number of responses deemed highly relevant was 71.9%, 77.4%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (p = 0.04). Regarding readability, the number of highly readable was higher for ChatGPT-4 and ChatGPT-3.5 (98.1%) and (100%) compared to Google Bard (87.5%) (p = 0.02). Conclusion ChatGPT-4 and ChatGPT-3.5 are potentially powerful tools in immuno-oncology, whereas Google Bard demonstrated relatively poorer performance. However, the risk of inaccuracy or incompleteness in the responses was evident in all three LLMs, highlighting the importance of expert-driven verification of the outputs returned by these technologies.
Collapse
Affiliation(s)
- Giovanni Maria Iannantuono
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Dara Bracken-Clarke
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Fatima Karzai
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Hyoyoung Choo-Wosoba
- Biostatistics and Data Management Section, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - James L. Gulley
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Charalampos S. Floudas
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
74
|
Hu JM, Liu FC, Chu CM, Chang YT. Health Care Trainees' and Professionals' Perceptions of ChatGPT in Improving Medical Knowledge Training: Rapid Survey Study. J Med Internet Res 2023; 25:e49385. [PMID: 37851495 PMCID: PMC10620632 DOI: 10.2196/49385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 07/13/2023] [Accepted: 09/29/2023] [Indexed: 10/19/2023] Open
Abstract
BACKGROUND ChatGPT is a powerful pretrained large language model. It has both demonstrated potential and raised concerns related to knowledge translation and knowledge transfer. To apply and improve knowledge transfer in the real world, it is essential to assess the perceptions and acceptance of the users of ChatGPT-assisted training. OBJECTIVE We aimed to investigate the perceptions of health care trainees and professionals on ChatGPT-assisted training, using biomedical informatics as an example. METHODS We used purposeful sampling to include all health care undergraduate trainees and graduate professionals (n=195) from January to May 2023 in the School of Public Health at the National Defense Medical Center in Taiwan. Subjects were asked to watch a 2-minute video introducing 5 scenarios about ChatGPT-assisted training in biomedical informatics and then answer a self-designed online (web- and mobile-based) questionnaire according to the Kirkpatrick model. The survey responses were used to develop 4 constructs: "perceived knowledge acquisition," "perceived training motivation," "perceived training satisfaction," and "perceived training effectiveness." The study used structural equation modeling (SEM) to evaluate and test the structural model and hypotheses. RESULTS The online questionnaire response rate was 152 of 195 (78%); 88 of 152 participants (58%) were undergraduate trainees and 90 of 152 participants (59%) were women. The ages ranged from 18 to 53 years (mean 23.3, SD 6.0 years). There was no statistical difference in perceptions of training evaluation between men and women. Most participants were enthusiastic about the ChatGPT-assisted training, while the graduate professionals were more enthusiastic than undergraduate trainees. Nevertheless, some concerns were raised about potential cheating on training assessment. The average scores for knowledge acquisition, training motivation, training satisfaction, and training effectiveness were 3.84 (SD 0.80), 3.76 (SD 0.93), 3.75 (SD 0.87), and 3.72 (SD 0.91), respectively (Likert scale 1-5: strongly disagree to strongly agree). Knowledge acquisition had the highest score and training effectiveness the lowest. In the SEM results, training effectiveness was influenced predominantly by knowledge acquisition and partially met the hypotheses in the research framework. Knowledge acquisition had a direct effect on training effectiveness, training satisfaction, and training motivation, with β coefficients of .80, .87, and .97, respectively (all P<.001). CONCLUSIONS Most health care trainees and professionals perceived ChatGPT-assisted training as an aid in knowledge transfer. However, to improve training effectiveness, it should be combined with empirical experts for proper guidance and dual interaction. In a future study, we recommend using a larger sample size for evaluation of internet-connected large language models in medical knowledge transfer.
Collapse
Affiliation(s)
- Je-Ming Hu
- Division of Colorectal Surgery, Department of Surgery, Tri-service General Hospital, National Defense Medical Center, Taipei, Taiwan
- Graduate Institute of Medical Sciences, National Defense Medical Center, Taipei, Taiwan
- School of Medicine, National Defense Medical Center, Taipei, Taiwan
| | - Feng-Cheng Liu
- Division of Rheumatology/Immunology and Allergy, Department of Medicine, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Chi-Ming Chu
- Graduate Institute of Medical Sciences, National Defense Medical Center, Taipei, Taiwan
- School of Public Health, National Defense Medical Center, Taipei, Taiwan
- Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan
- Big Data Research Center, College of Medicine, Fu-Jen Catholic University, New Taipei City, Taiwan
- Department of Public Health, Kaohsiung Medical University, Kaohsiung, Taiwan
- Department of Public Health, China Medical University, Taichung, Taiwan
| | - Yu-Tien Chang
- School of Public Health, National Defense Medical Center, Taipei, Taiwan
| |
Collapse
|
75
|
Talyshinskii A, Naik N, Hameed BMZ, Zhanbyrbekuly U, Khairli G, Guliev B, Juilebø-Jones P, Tzelves L, Somani BK. Expanding horizons and navigating challenges for enhanced clinical workflows: ChatGPT in urology. Front Surg 2023; 10:1257191. [PMID: 37744723 PMCID: PMC10512827 DOI: 10.3389/fsurg.2023.1257191] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 08/28/2023] [Indexed: 09/26/2023] Open
Abstract
Purpose of review ChatGPT has emerged as a potential tool for facilitating doctors' workflows. However, when it comes to applying these findings within a urological context, there have not been many studies. Thus, our objective was rooted in analyzing the pros and cons of ChatGPT use and how it can be exploited and used by urologists. Recent findings ChatGPT can facilitate clinical documentation and note-taking, patient communication and support, medical education, and research. In urology, it was proven that ChatGPT has the potential as a virtual healthcare aide for benign prostatic hyperplasia, an educational and prevention tool on prostate cancer, educational support for urological residents, and as an assistant in writing urological papers and academic work. However, several concerns about its exploitation are presented, such as lack of web crawling, risk of accidental plagiarism, and concerns about patients-data privacy. Summary The existing limitations mediate the need for further improvement of ChatGPT, such as ensuring the privacy of patient data and expanding the learning dataset to include medical databases, and developing guidance on its appropriate use. Urologists can also help by conducting studies to determine the effectiveness of ChatGPT in urology in clinical scenarios and nosologies other than those previously listed.
Collapse
Affiliation(s)
- Ali Talyshinskii
- Department of Urology, Astana Medical University, Astana, Kazakhstan
| | - Nithesh Naik
- Department of Mechanical and Industrial Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
| | | | | | - Gafur Khairli
- Department of Urology, Astana Medical University, Astana, Kazakhstan
| | - Bakhman Guliev
- Department of Urology, Mariinsky Hospital, St Petersburg, Russia
| | | | - Lazaros Tzelves
- Department of Urology, National and Kapodistrian University of Athens, Sismanogleion Hospital, Athens, Marousi, Greece
| | - Bhaskar Kumar Somani
- Department of Urology, University Hospital Southampton NHS Trust, Southampton, United Kingdom
| |
Collapse
|
76
|
Iannantuono GM, Bracken-Clarke D, Floudas CS, Roselli M, Gulley JL, Karzai F. Applications of large language models in cancer care: current evidence and future perspectives. Front Oncol 2023; 13:1268915. [PMID: 37731643 PMCID: PMC10507617 DOI: 10.3389/fonc.2023.1268915] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 08/21/2023] [Indexed: 09/22/2023] Open
Abstract
The development of large language models (LLMs) is a recent success in the field of generative artificial intelligence (AI). They are computer models able to perform a wide range of natural language processing tasks, including content generation, question answering, or language translation. In recent months, a growing number of studies aimed to assess their potential applications in the field of medicine, including cancer care. In this mini review, we described the present published evidence for using LLMs in oncology. All the available studies assessed ChatGPT, an advanced language model developed by OpenAI, alone or compared to other LLMs, such as Google Bard, Chatsonic, and Perplexity. Although ChatGPT could provide adequate information on the screening or the management of specific solid tumors, it also demonstrated a significant error rate and a tendency toward providing obsolete data. Therefore, an accurate, expert-driven verification process remains mandatory to avoid the potential for misinformation and incorrect evidence. Overall, although this new generative AI-based technology has the potential to revolutionize the field of medicine, including that of cancer care, it will be necessary to develop rules to guide the application of these tools to maximize benefits and minimize risks.
Collapse
Affiliation(s)
- Giovanni Maria Iannantuono
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
- Medical Oncology Unit, Department of Systems Medicine, University of Rome Tor Vergata, Rome, Italy
| | - Dara Bracken-Clarke
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Charalampos S. Floudas
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Mario Roselli
- Medical Oncology Unit, Department of Systems Medicine, University of Rome Tor Vergata, Rome, Italy
| | - James L. Gulley
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Fatima Karzai
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
77
|
Tippareddy C, Jiang S, Bera K, Ramaiya N. Radiology Reading Room for the Future: Harnessing the Power of Large Language Models Like ChatGPT. Curr Probl Diagn Radiol 2023:S0363-0188(23)00133-0. [PMID: 37758604 DOI: 10.1067/j.cpradiol.2023.08.018] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 08/28/2023] [Accepted: 08/28/2023] [Indexed: 09/29/2023]
Abstract
Radiology has usually been the field of medicine that has been at the forefront of technological advances, often being the first to wholeheartedly embrace them. Whether it's from digitization to cloud side architecture, radiology has led the way for adopting the latest advances. With the advent of large language models (LLMs), especially with the unprecedented explosion of freely available ChatGPT, time is ripe for radiology and radiologists to find novel ways to use the technology to improve their workflow. Towards this, we believe these LLMs have a key role in the radiology reading room not only to expedite processes, simplify mundane and archaic tasks, but also to increase the radiologist's and radiologist trainee's knowledge base at a far faster pace. In this article, we discuss some of the ways we believe ChatGPT, and the likes can be harnessed in the reading room.
Collapse
Affiliation(s)
- Charit Tippareddy
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| | - Sirui Jiang
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| | - Kaustav Bera
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH.
| | - Nikhil Ramaiya
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH
| |
Collapse
|
78
|
Patil NS, Huang RS, van der Pol CB, Larocque N. Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment. Can Assoc Radiol J 2023:8465371231193716. [PMID: 37578849 DOI: 10.1177/08465371231193716] [Citation(s) in RCA: 26] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/16/2023] Open
Abstract
PURPOSE Bard by Google, a direct competitor to ChatGPT, was recently released. Understanding the relative performance of these different chatbots can provide important insight into their strengths and weaknesses as well as which roles they are most suited to fill. In this project, we aimed to compare the most recent version of ChatGPT, ChatGPT-4, and Bard by Google, in their ability to accurately respond to radiology board examination practice questions. METHODS Text-based questions were collected from the 2017-2021 American College of Radiology's Diagnostic Radiology In-Training (DXIT) examinations. ChatGPT-4 and Bard were queried, and their comparative accuracies, response lengths, and response times were documented. Subspecialty-specific performance was analyzed as well. RESULTS 318 questions were included in our analysis. ChatGPT answered significantly more accurately than Bard (87.11% vs 70.44%, P < .0001). ChatGPT's response length was significantly shorter than Bard's (935.28 ± 440.88 characters vs 1437.52 ± 415.91 characters, P < .0001). ChatGPT's response time was significantly longer than Bard's (26.79 ± 3.27 seconds vs 7.55 ± 1.88 seconds, P < .0001). ChatGPT performed superiorly to Bard in neuroradiology, (100.00% vs 86.21%, P = .03), general & physics (85.39% vs 68.54%, P < .001), nuclear medicine (80.00% vs 56.67%, P < .01), pediatric radiology (93.75% vs 68.75%, P = .03), and ultrasound (100.00% vs 63.64%, P < .001). In the remaining subspecialties, there were no significant differences between ChatGPT and Bard's performance. CONCLUSION ChatGPT displayed superior radiology knowledge compared to Bard. While both chatbots display reasonable radiology knowledge, they should be used with conscious knowledge of their limitations and fallibility. Both chatbots provided incorrect or illogical answer explanations and did not always address the educational content of the question.
Collapse
Affiliation(s)
- Nikhil S Patil
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada
| | - Ryan S Huang
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Christian B van der Pol
- Department of Diagnostic Imaging, Hamilton Health Sciences, Juravinski Hospital and Cancer Centre, Hamilton, ON, Canada
| | - Natasha Larocque
- Department of Radiology, McMaster University, Hamilton, ON, Canada
| |
Collapse
|
79
|
Park SH. Use of Generative Artificial Intelligence, Including Large Language Models Such as ChatGPT, in Scientific Publications: Policies of KJR and Prominent Authorities. Korean J Radiol 2023; 24:715-718. [PMID: 37500572 PMCID: PMC10400373 DOI: 10.3348/kjr.2023.0643] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 07/10/2023] [Indexed: 07/29/2023] Open
Affiliation(s)
- Seong Ho Park
- Department of Radiology and Research Institute of Radiology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.
| |
Collapse
|
80
|
Dhanvijay AKD, Pinjar MJ, Dhokane N, Sorte SR, Kumari A, Mondal H. Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology. Cureus 2023; 15:e42972. [PMID: 37671207 PMCID: PMC10475852 DOI: 10.7759/cureus.42972] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/04/2023] [Indexed: 09/07/2023] Open
Abstract
Background Large language models (LLMs) have emerged as powerful tools capable of processing and generating human-like text. These LLMs, such as ChatGPT (OpenAI Incorporated, Mission District, San Francisco, United States), Google Bard (Alphabet Inc., CA, US), and Microsoft Bing (Microsoft Corporation, WA, US), have been applied across various domains, demonstrating their potential to assist in solving complex tasks and improving information accessibility. However, their application in solving case vignettes in physiology has not been explored. This study aimed to assess the performance of three LLMs, namely, ChatGPT (3.5; free research version), Google Bard (Experiment), and Microsoft Bing (precise), in answering cases vignettes in Physiology. Methods This cross-sectional study was conducted in July 2023. A total of 77 case vignettes in physiology were prepared by two physiologists and were validated by two other content experts. These cases were presented to each LLM, and their responses were collected. Two physiologists independently rated the answers provided by the LLMs based on their accuracy. The ratings were measured on a scale from 0 to 4 according to the structure of the observed learning outcome (pre-structural = 0, uni-structural = 1, multi-structural = 2, relational = 3, extended-abstract). The scores among the LLMs were compared by Friedman's test and inter-observer agreement was checked by the intraclass correlation coefficient (ICC). Results The overall scores for ChatGPT, Bing, and Bard in the study, with a total of 77 cases, were found to be 3.19±0.3, 2.15±0.6, and 2.91±0.5, respectively, p<0.0001. Hence, ChatGPT 3.5 (free version) obtained the highest score, Bing (Precise) had the lowest score, and Bard (Experiment) fell in between the two in terms of performance. The average ICC values for ChatGPT, Bing, and Bard were 0.858 (95% CI: 0.777 to 0.91, p<0.0001), 0.975 (95% CI: 0.961 to 0.984, p<0.0001), and 0.964 (95% CI: 0.944 to 0.977, p<0.0001), respectively. Conclusion ChatGPT outperformed Bard and Bing in answering case vignettes in physiology. Hence, students and teachers may think about choosing LLMs for their educational purposes accordingly for case-based learning in physiology. Further exploration of their capabilities is needed for adopting those in medical education and support for clinical decision-making.
Collapse
Affiliation(s)
| | | | - Nitin Dhokane
- Physiology, Government Medical College, Sindhudurg, Oros, IND
| | - Smita R Sorte
- Physiology, All India Institute of Medical Sciences, Nagpur, Nagpur, IND
| | - Amita Kumari
- Physiology, All India Institute of Medical Sciences, Deoghar, Deoghar, IND
| | - Himel Mondal
- Physiology, All India Institute of Medical Sciences, Deoghar, Deoghar, IND
| |
Collapse
|
81
|
Agarwal M, Sharma P, Goswami A. Analysing the Applicability of ChatGPT, Bard, and Bing to Generate Reasoning-Based Multiple-Choice Questions in Medical Physiology. Cureus 2023; 15:e40977. [PMID: 37519497 PMCID: PMC10372539 DOI: 10.7759/cureus.40977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/26/2023] [Indexed: 08/01/2023] Open
Abstract
Background Artificial intelligence (AI) is evolving in the medical education system. ChatGPT, Google Bard, and Microsoft Bing are AI-based models that can solve problems in medical education. However, the applicability of AI to create reasoning-based multiple-choice questions (MCQs) in the field of medical physiology is yet to be explored. Objective We aimed to assess and compare the applicability of ChatGPT, Bard, and Bing in generating reasoning-based MCQs for MBBS (Bachelor of Medicine, Bachelor of Surgery) undergraduate students on the subject of physiology. Methods The National Medical Commission of India has developed an 11-module physiology curriculum with various competencies. Two physiologists independently chose a competency from each module. The third physiologist prompted all three AIs to generate five MCQs for each chosen competency. The two physiologists who provided the competencies rated the MCQs generated by the AIs on a scale of 0-3 for validity, difficulty, and reasoning ability required to answer them. We analyzed the average of the two scores using the Kruskal-Wallis test to compare the distribution across the total and module-wise responses, followed by a post-hoc test for pairwise comparisons. We used Cohen's Kappa (Κ) to assess the agreement in scores between the two raters. We expressed the data as a median with an interquartile range. We determined their statistical significance by a p-value <0.05. Results ChatGPT and Bard generated 110 MCQs for the chosen competencies. However, Bing provided only 100 MCQs as it failed to generate them for two competencies. The validity of the MCQs was rated as 3 (3-3) for ChatGPT, 3 (1.5-3) for Bard, and 3 (1.5-3) for Bing, showing a significant difference (p<0.001) among the models. The difficulty of the MCQs was rated as 1 (0-1) for ChatGPT, 1 (1-2) for Bard, and 1 (1-2) for Bing, with a significant difference (p=0.006). The required reasoning ability to answer the MCQs was rated as 1 (1-2) for ChatGPT, 1 (1-2) for Bard, and 1 (1-2) for Bing, with no significant difference (p=0.235). K was ≥ 0.8 for all three parameters across all three AI models. Conclusion AI still needs to evolve to generate reasoning-based MCQs in medical physiology. ChatGPT, Bard, and Bing showed certain limitations. Bing generated significantly least valid MCQs, while ChatGPT generated significantly least difficult MCQs.
Collapse
Affiliation(s)
- Mayank Agarwal
- Physiology, All India Institute of Medical Sciences, Raebareli, IND
| | - Priyanka Sharma
- Physiology, School of Medical Sciences and Research, Sharda University, Greater Noida, IND
| | - Ayan Goswami
- Physiology, Santiniketan Medical College, Bolpur, IND
| |
Collapse
|