1
|
Chen JS, Reddy AJ, Al-Sharif E, Shoji MK, Kalaw FGP, Eslani M, Lang PZ, Arya M, Koretz ZA, Bolo KA, Arnett JJ, Roginiel AC, Do JL, Robbins SL, Camp AS, Scott NL, Rudell JC, Weinreb RN, Baxter SL, Granet DB. Analysis of ChatGPT Responses to Ophthalmic Cases: Can ChatGPT Think like an Ophthalmologist? OPHTHALMOLOGY SCIENCE 2025; 5:100600. [PMID: 39346575 PMCID: PMC11437840 DOI: 10.1016/j.xops.2024.100600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Revised: 08/09/2024] [Accepted: 08/13/2024] [Indexed: 10/01/2024]
Abstract
Objective Large language models such as ChatGPT have demonstrated significant potential in question-answering within ophthalmology, but there is a paucity of literature evaluating its ability to generate clinical assessments and discussions. The objectives of this study were to (1) assess the accuracy of assessment and plans generated by ChatGPT and (2) evaluate ophthalmologists' abilities to distinguish between responses generated by clinicians versus ChatGPT. Design Cross-sectional mixed-methods study. Subjects Sixteen ophthalmologists from a single academic center, of which 10 were board-eligible and 6 were board-certified, were recruited to participate in this study. Methods Prompt engineering was used to ensure ChatGPT output discussions in the style of the ophthalmologist author of the Medical College of Wisconsin Ophthalmic Case Studies. Cases where ChatGPT accurately identified the primary diagnoses were included and then paired. Masked human-generated and ChatGPT-generated discussions were sent to participating ophthalmologists to identify the author of the discussions. Response confidence was assessed using a 5-point Likert scale score, and subjective feedback was manually reviewed. Main Outcome Measures Accuracy of ophthalmologist identification of discussion author, as well as subjective perceptions of human-generated versus ChatGPT-generated discussions. Results Overall, ChatGPT correctly identified the primary diagnosis in 15 of 17 (88.2%) cases. Two cases were excluded from the paired comparison due to hallucinations or fabrications of nonuser-provided data. Ophthalmologists correctly identified the author in 77.9% ± 26.6% of the 13 included cases, with a mean Likert scale confidence rating of 3.6 ± 1.0. No significant differences in performance or confidence were found between board-certified and board-eligible ophthalmologists. Subjectively, ophthalmologists found that discussions written by ChatGPT tended to have more generic responses, irrelevant information, hallucinated more frequently, and had distinct syntactic patterns (all P < 0.01). Conclusions Large language models have the potential to synthesize clinical data and generate ophthalmic discussions. While these findings have exciting implications for artificial intelligence-assisted health care delivery, more rigorous real-world evaluation of these models is necessary before clinical deployment. Financial Disclosures The author(s) have no proprietary or commercial interest in any materials discussed in this article.
Collapse
Affiliation(s)
- Jimmy S Chen
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Akshay J Reddy
- School of Medicine, California University of Science and Medicine, Colton, California
| | - Eman Al-Sharif
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- Surgery Department, College of Medicine, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Marissa K Shoji
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Fritz Gerald P Kalaw
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Medi Eslani
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Paul Z Lang
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Malvika Arya
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Zachary A Koretz
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Kyle A Bolo
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Justin J Arnett
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Aliya C Roginiel
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Jiun L Do
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Shira L Robbins
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Andrew S Camp
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Nathan L Scott
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Jolene C Rudell
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Robert N Weinreb
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Sally L Baxter
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - David B Granet
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| |
Collapse
|
2
|
Di Paolo LD, White B, Guénin-Carlut A, Constant A, Clark A. Active inference goes to school: the importance of active learning in the age of large language models. Philos Trans R Soc Lond B Biol Sci 2024; 379:20230148. [PMID: 39155715 PMCID: PMC11391319 DOI: 10.1098/rstb.2023.0148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 12/16/2023] [Accepted: 01/23/2024] [Indexed: 08/20/2024] Open
Abstract
Human learning essentially involves embodied interactions with the material world. But our worlds now include increasing numbers of powerful and (apparently) disembodied generative artificial intelligence (AI). In what follows we ask how best to understand these new (somewhat 'alien', because of their disembodied nature) resources and how to incorporate them in our educational practices. We focus on methodologies that encourage exploration and embodied interactions with 'prepared' material environments, such as the carefully organized settings of Montessori education. Using the active inference framework, we approach our questions by thinking about human learning as epistemic foraging and prediction error minimization. We end by arguing that generative AI should figure naturally as new elements in prepared learning environments by facilitating sequences of precise prediction error enabling trajectories of self-correction. In these ways, we anticipate new synergies between (apparently) disembodied and (essentially) embodied forms of intelligence. This article is part of the theme issue 'Minds in movement: embodied cognition in the age of artificial intelligence'.
Collapse
Affiliation(s)
- Laura Desirèe Di Paolo
- Department of Engineering and Informatics, The University of Sussex , Brighton, UK
- School of Psychology, Children & Technology Lab, The University of Sussex , Falmer (Brighton), UK
| | - Ben White
- Department of Philosophy, The University of Sussex , Sussex, UK
| | - Avel Guénin-Carlut
- Department of Engineering and Informatics, The University of Sussex , Brighton, UK
| | - Axel Constant
- Department of Engineering and Informatics, The University of Sussex , Brighton, UK
| | - Andy Clark
- Department of Engineering and Informatics, The University of Sussex , Brighton, UK
- Department of Philosophy, The University of Sussex , Sussex, UK
- Department of Philosophy, Macquarie University , Sydney, New South Wales, Australia
| |
Collapse
|
3
|
Wang A, Zhou J, Zhang P, Cao H, Xin H, Xu X, Zhou H. Large language model answers medical questions about standard pathology reports. Front Med (Lausanne) 2024; 11:1402457. [PMID: 39359921 PMCID: PMC11445125 DOI: 10.3389/fmed.2024.1402457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Accepted: 08/28/2024] [Indexed: 10/04/2024] Open
Abstract
This study aims to evaluate the feasibility of large language model (LLM) in answering pathology questions based on pathology reports (PRs) of colorectal cancer (CRC). Four common questions (CQs) and corresponding answers about pathology were retrieved from public webpages. These questions were input as prompts for Chat Generative Pretrained Transformer (ChatGPT) (gpt-3.5-turbo). The quality indicators (understanding, scientificity, satisfaction) of all answers were evaluated by gastroenterologists. Standard PRs from 5 CRC patients who received radical surgeries in Shanghai Changzheng Hospital were selected. Six report questions (RQs) and corresponding answers were generated by a gastroenterologist and a pathologist. We developed an interactive PRs interpretation system which allows users to upload standard PRs as JPG images. Then the ChatGPT's responses to the RQs were generated. The quality indicators of all answers were evaluated by gastroenterologists and out-patients. As for CQs, gastroenterologists rated AI answers similarly to non-AI answers in understanding, scientificity, and satisfaction. As for RQ1-3, gastroenterologists and patients rated the AI mean scores higher than non-AI scores among the quality indicators. However, as for RQ4-6, gastroenterologists rated the AI mean scores lower than non-AI scores in understanding and satisfaction. In RQ4, gastroenterologists rated the AI scores lower than non-AI scores in scientificity (P = 0.011); patients rated the AI scores lower than non-AI scores in understanding (P = 0.004) and satisfaction (P = 0.011). In conclusion, LLM could generate credible answers to common pathology questions and conceptual questions on the PRs. It holds great potential in improving doctor-patient communication.
Collapse
Affiliation(s)
- Anqi Wang
- Division of Colorectal Surgery, Changzheng Hospital, Navy Medical University, Shanghai, China
| | - Jieli Zhou
- UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China
| | - Peng Zhang
- Division of Colorectal Surgery, Changzheng Hospital, Navy Medical University, Shanghai, China
| | - Haotian Cao
- Department of Pathology, Changzheng Hospital, Navy Medical University, Shanghai, China
| | - Hongyi Xin
- UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China
| | - Xinyun Xu
- Division of Breast and Thyroid Surgery, Changzheng Hospital, Navy Medical University, Shanghai, China
| | - Haiyang Zhou
- Division of Colorectal Surgery, Changzheng Hospital, Navy Medical University, Shanghai, China
| |
Collapse
|
4
|
Filetti S, Fenza G, Gallo A. Research design and writing of scholarly articles: new artificial intelligence tools available for researchers. Endocrine 2024; 85:1104-1116. [PMID: 39085566 DOI: 10.1007/s12020-024-03977-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Accepted: 07/22/2024] [Indexed: 08/02/2024]
|
5
|
Kim SE, Lee JH, Choi BS, Han HS, Lee MC, Ro DH. Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4. Clin Orthop Surg 2024; 16:669-673. [PMID: 39092297 PMCID: PMC11262944 DOI: 10.4055/cios23179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 01/29/2024] [Accepted: 01/29/2024] [Indexed: 08/04/2024] Open
Abstract
Background The application of artificial intelligence and large language models in the medical field requires an evaluation of their accuracy in providing medical information. This study aimed to assess the performance of Chat Generative Pre-trained Transformer (ChatGPT) models 3.5 and 4 in solving orthopedic board-style questions. Methods A total of 160 text-only questions from the Orthopedic Surgery Department at Seoul National University Hospital, conforming to the format of the Korean Orthopedic Association board certification examinations, were input into the ChatGPT 3.5 and ChatGPT 4 programs. The questions were divided into 11 subcategories. The accuracy rates of the initial answers provided by Chat GPT 3.5 and ChatGPT 4 were analyzed. In addition, inconsistency rates of answers were evaluated by regenerating the responses. Results ChatGPT 3.5 answered 37.5% of the questions correctly, while ChatGPT 4 showed an accuracy rate of 60.0% (p < 0.001). ChatGPT 4 demonstrated superior performance across most subcategories, except for the tumor-related questions. The rates of inconsistency in answers were 47.5% for ChatGPT 3.5 and 9.4% for ChatGPT 4. Conclusions ChatGPT 4 showed the ability to pass orthopedic board-style examinations, outperforming ChatGPT 3.5 in accuracy rate. However, inconsistencies in response generation and instances of incorrect answers with misleading explanations require caution when applying ChatGPT in clinical settings or for educational purposes.
Collapse
Affiliation(s)
- Sung Eun Kim
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Ji Han Lee
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Byung Sun Choi
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Hyuk-Soo Han
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Myung Chul Lee
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Du Hyun Ro
- Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| |
Collapse
|
6
|
Yaïci R, Cieplucha M, Bock R, Moayed F, Bechrakis NE, Berens P, Feltgen N, Friedburg D, Gräf M, Guthoff R, Hoffmann EM, Hoerauf H, Hintschich C, Kohnen T, Messmer EM, Nentwich MM, Pleyer U, Schaudig U, Seitz B, Geerling G, Roth M. [ChatGPT and the German board examination for ophthalmology: an evaluation]. DIE OPHTHALMOLOGIE 2024; 121:554-564. [PMID: 38801461 DOI: 10.1007/s00347-024-02046-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Revised: 04/18/2024] [Accepted: 04/18/2024] [Indexed: 05/29/2024]
Abstract
PURPOSE In recent years artificial intelligence (AI), as a new segment of computer science, has also become increasingly more important in medicine. The aim of this project was to investigate whether the current version of ChatGPT (ChatGPT 4.0) is able to answer open questions that could be asked in the context of a German board examination in ophthalmology. METHODS After excluding image-based questions, 10 questions from 15 different chapters/topics were selected from the textbook 1000 questions in ophthalmology (1000 Fragen Augenheilkunde 2nd edition, 2014). ChatGPT was instructed by means of a so-called prompt to assume the role of a board certified ophthalmologist and to concentrate on the essentials when answering. A human expert with considerable expertise in the respective topic, evaluated the answers regarding their correctness, relevance and internal coherence. Additionally, the overall performance was rated by school grades and assessed whether the answers would have been sufficient to pass the ophthalmology board examination. RESULTS The ChatGPT would have passed the board examination in 12 out of 15 topics. The overall performance, however, was limited with only 53.3% completely correct answers. While the correctness of the results in the different topics was highly variable (uveitis and lens/cataract 100%; optics and refraction 20%), the answers always had a high thematic fit (70%) and internal coherence (71%). CONCLUSION The fact that ChatGPT 4.0 would have passed the specialist examination in 12 out of 15 topics is remarkable considering the fact that this AI was not specifically trained for medical questions; however, there is a considerable performance variability between the topics, with some serious shortcomings that currently rule out its safe use in clinical practice.
Collapse
Affiliation(s)
- Rémi Yaïci
- Klinik für Augenheilkunde, Medizinische Fakultät, Universitätsklinikum Düsseldorf, Heinrich-Heine Universität Düsseldorf, Moorenstr. 5, 40225, Düsseldorf, Deutschland.
| | - M Cieplucha
- Klinik für Augenheilkunde, Medizinische Fakultät, Universitätsklinikum Düsseldorf, Heinrich-Heine Universität Düsseldorf, Moorenstr. 5, 40225, Düsseldorf, Deutschland
| | - R Bock
- Klinik für Augenheilkunde, Medizinische Fakultät, Universitätsklinikum Düsseldorf, Heinrich-Heine Universität Düsseldorf, Moorenstr. 5, 40225, Düsseldorf, Deutschland
| | - F Moayed
- Klinik für Augenheilkunde, Medizinische Fakultät, Universitätsklinikum Düsseldorf, Heinrich-Heine Universität Düsseldorf, Moorenstr. 5, 40225, Düsseldorf, Deutschland
| | - N E Bechrakis
- Augenklinik, Universitätsklinikum Essen, Essen, Deutschland
| | - P Berens
- Hertie Institute for AI in Brain Health (Hertie AI), Tübingen, Deutschland
| | - N Feltgen
- Augenklinik, Universitätsspital Basel, Basel, Schweiz
| | | | - M Gräf
- Universitätsklinikum Gießen und Marburg, Marburg, Gießen, Deutschland
| | - R Guthoff
- Klinik für Augenheilkunde, Medizinische Fakultät, Universitätsklinikum Düsseldorf, Heinrich-Heine Universität Düsseldorf, Moorenstr. 5, 40225, Düsseldorf, Deutschland
| | - E M Hoffmann
- Augenklinik, Universitätsklinikum Mainz, Mainz, Deutschland
| | - H Hoerauf
- Augenklinik, Universitätsklinikum Göttingen, Göttingen, Deutschland
| | - C Hintschich
- Augenklinik und Poliklinik, LMU Klinikum, Ludwigs-Maximilians-Universität München, München, Deutschland
| | - T Kohnen
- Augenklinik, Universitätsklinikum Frankfurt, Frankfurt, Deutschland
| | - E M Messmer
- Augenklinik und Poliklinik, LMU Klinikum, Ludwigs-Maximilians-Universität München, München, Deutschland
| | - M M Nentwich
- Augenklinik, Universitätsklinikum Würzburg, Würzburg, Deutschland
| | - U Pleyer
- Charité - Universitätsmedizin Berlin, Berlin, Deutschland
| | - U Schaudig
- Asklepios Klinik Barmbek, Hamburg, Deutschland
| | - B Seitz
- Klinik für Augenheilkunde, Universitätsklinikum des Saarlandes, Homburg, Deutschland
| | - G Geerling
- Klinik für Augenheilkunde, Medizinische Fakultät, Universitätsklinikum Düsseldorf, Heinrich-Heine Universität Düsseldorf, Moorenstr. 5, 40225, Düsseldorf, Deutschland
| | - M Roth
- Klinik für Augenheilkunde, Medizinische Fakultät, Universitätsklinikum Düsseldorf, Heinrich-Heine Universität Düsseldorf, Moorenstr. 5, 40225, Düsseldorf, Deutschland
| |
Collapse
|
7
|
Ahmed W, Zaidat B, Duey A, Saturno M, Cho S. Answer to the Letter to the Editor of G. Shen, et al. concerning "ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis" by Ahmed W, et al. (Eur Spine J [2024]: doi:10.1007/s00586-024-08198-6). EUROPEAN SPINE JOURNAL : OFFICIAL PUBLICATION OF THE EUROPEAN SPINE SOCIETY, THE EUROPEAN SPINAL DEFORMITY SOCIETY, AND THE EUROPEAN SECTION OF THE CERVICAL SPINE RESEARCH SOCIETY 2024; 33:2920. [PMID: 38695950 DOI: 10.1007/s00586-024-08282-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Accepted: 04/16/2024] [Indexed: 07/25/2024]
Affiliation(s)
- Wasil Ahmed
- Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA.
| | - Bashar Zaidat
- Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA
| | - Akiro Duey
- Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA
| | - Michael Saturno
- Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA
| | - Samuel Cho
- Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA
| |
Collapse
|
8
|
Shin E, Yu Y, Bies RR, Ramanathan M. Evaluation of ChatGPT and Gemini large language models for pharmacometrics with NONMEM. J Pharmacokinet Pharmacodyn 2024; 51:187-197. [PMID: 38656706 DOI: 10.1007/s10928-024-09921-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Accepted: 04/16/2024] [Indexed: 04/26/2024]
Abstract
To assess ChatGPT 4.0 (ChatGPT) and Gemini Ultra 1.0 (Gemini) large language models on NONMEM coding tasks relevant to pharmacometrics and clinical pharmacology. ChatGPT and Gemini were assessed on tasks mimicking real-world applications of NONMEM. The tasks ranged from providing a curriculum for learning NONMEM, an overview of NONMEM code structure to generating code. Prompts in lay language to elicit NONMEM code for a linear pharmacokinetic (PK) model with oral administration and a more complex model with two parallel first-order absorption mechanisms were investigated. Reproducibility and the impact of "temperature" hyperparameter settings were assessed. The code was reviewed by two NONMEM experts. ChatGPT and Gemini provided NONMEM curriculum structures combining foundational knowledge with advanced concepts (e.g., covariate modeling and Bayesian approaches) and practical skills including NONMEM code structure and syntax. ChatGPT provided an informative summary of the NONMEM control stream structure and outlined the key NONMEM Translator (NM-TRAN) records needed. ChatGPT and Gemini were able to generate code blocks for the NONMEM control stream from the lay language prompts for the two coding tasks. The control streams contained focal structural and syntax errors that required revision before they could be executed without errors and warnings. The code output from ChatGPT and Gemini was not reproducible, and varying the temperature hyperparameter did not reduce the errors and omissions substantively. Large language models may be useful in pharmacometrics for efficiently generating an initial coding template for modeling projects. However, the output can contain errors and omissions that require correction.
Collapse
Affiliation(s)
- Euibeom Shin
- Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York, Buffalo, NY, 14214-8033, USA
| | - Yifan Yu
- Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York, Buffalo, NY, 14214-8033, USA
| | - Robert R Bies
- Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York, Buffalo, NY, 14214-8033, USA
| | - Murali Ramanathan
- Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York, Buffalo, NY, 14214-8033, USA.
| |
Collapse
|
9
|
Paul S, Govindaraj S, Jk J. ChatGPT Versus National Eligibility cum Entrance Test for Postgraduate (NEET PG). Cureus 2024; 16:e63048. [PMID: 39050297 PMCID: PMC11268980 DOI: 10.7759/cureus.63048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/24/2024] [Indexed: 07/27/2024] Open
Abstract
Introduction With both suspicion and excitement, artificial intelligence tools are being integrated into nearly every aspect of human existence, including medical sciences and medical education. The newest large language model (LLM) in the class of autoregressive language models is ChatGPT. While ChatGPT's potential to revolutionize clinical practice and medical education is under investigation, further research is necessary to understand its strengths and limitations in this field comprehensively. Methods Two hundred National Eligibility cum Entrance Test for Postgraduate 2023 questions were gathered from various public education websites and individually entered into Microsoft Bing (GPT-4 Version 2.2.1). Microsoft Bing Chatbot is currently the only platform incorporating all of GPT-4's multimodal features, including image recognition. The results were subsequently analyzed. Results Out of 200 questions, ChatGPT-4 answered 129 correctly. The most tested specialties were medicine (15%), obstetrics and gynecology (15%), general surgery (14%), and pathology (10%), respectively. Conclusion This study sheds light on how well the GPT-4 performs in addressing the NEET-PG entrance test. ChatGPT has potential as an adjunctive instrument within medical education and clinical settings. Its capacity to react intelligently and accurately in complicated clinical settings demonstrates its versatility.
Collapse
Affiliation(s)
- Sam Paul
- General Surgery, St John's Medical College Hospital, Bengaluru, IND
| | - Sridar Govindaraj
- Surgical Gastroenterology and Laparoscopy, St John's Medical College Hospital, Bengaluru, IND
| | - Jerisha Jk
- Pediatrics and Neonatology, Christian Medical College Ludhiana, Ludhiana, IND
| |
Collapse
|
10
|
UYGUN İLİKHAN S, ÖZER M, TANBERKAN H, BOZKURT V. How to mitigate the risks of deployment of artificial intelligence in medicine? Turk J Med Sci 2024; 54:483-492. [PMID: 39050000 PMCID: PMC11265878 DOI: 10.55730/1300-0144.5814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 06/12/2024] [Accepted: 05/20/2024] [Indexed: 07/27/2024] Open
Abstract
The aim of this study is to examine the risks associated with the use of artificial intelligence (AI) in medicine and to offer policy suggestions to reduce these risks and optimize the benefits of AI technology. AI is a multifaceted technology. If harnessed effectively, it has the capacity to significantly impact the future of humanity in the field of health, as well as in several other areas. However, the rapid spread of this technology also raises significant ethical, legal, and social issues. This study examines the potential dangers of AI integration in medicine by reviewing current scientific work and exploring strategies to mitigate these risks. Biases in data sets for AI systems can lead to inequities in health care. Educational data that is narrowly represented based on a demographic group can lead to biased results from AI systems for those who do not belong to that group. In addition, the concepts of explainability and accountability in AI systems could create challenges for healthcare professionals in understanding and evaluating AI-generated diagnoses or treatment recommendations. This could jeopardize patient safety and lead to the selection of inappropriate treatments. Ensuring the security of personal health information will be critical as AI systems become more widespread. Therefore, improving patient privacy and security protocols for AI systems is imperative. The report offers suggestions for reducing the risks associated with the increasing use of AI systems in the medical sector. These include increasing AI literacy, implementing a participatory society-in-the-loop management strategy, and creating ongoing education and auditing systems. Integrating ethical principles and cultural values into the design of AI systems can help reduce healthcare disparities and improve patient care. Implementing these recommendations will ensure the efficient and equitable use of AI systems in medicine, improve the quality of healthcare services, and ensure patient safety.
Collapse
Affiliation(s)
- Sevil UYGUN İLİKHAN
- Department of Internal Medicine Sciences, Gülhane Faculty of Medicine, University of Health Sciences, Ankara,
Turkiye
| | - Mahmut ÖZER
- Commission of National Education, Culture, Youth and Sports of the Parliament, Ankara,
Turkiye
| | | | - Veysel BOZKURT
- Department of Economic Sociology, Faculty of Economics, İstanbul University, İstanbul,
Turkiye
| |
Collapse
|
11
|
Khan AA, Yunus R, Sohail M, Rehman TA, Saeed S, Bu Y, Jackson CD, Sharkey A, Mahmood F, Matyal R. Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models. J Cardiothorac Vasc Anesth 2024; 38:1251-1259. [PMID: 38423884 DOI: 10.1053/j.jvca.2024.01.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/24/2024] [Accepted: 01/29/2024] [Indexed: 03/02/2024]
Abstract
New artificial intelligence tools have been developed that have implications for medical usage. Large language models (LLMs), such as the widely used ChatGPT developed by OpenAI, have not been explored in the context of anesthesiology education. Understanding the reliability of various publicly available LLMs for medical specialties could offer insight into their understanding of the physiology, pharmacology, and practical applications of anesthesiology. An exploratory prospective review was conducted using 3 commercially available LLMs--OpenAI's ChatGPT GPT-3.5 version (GPT-3.5), OpenAI's ChatGPT GPT-4 (GPT-4), and Google's Bard--on questions from a widely used anesthesia board examination review book. Of the 884 eligible questions, the overall correct answer rates were 47.9% for GPT-3.5, 69.4% for GPT-4, and 45.2% for Bard. GPT-4 exhibited significantly higher performance than both GPT-3.5 and Bard (p = 0.001 and p < 0.001, respectively). None of the LLMs met the criteria required to secure American Board of Anesthesiology certification, according to the 70% passing score approximation. GPT-4 significantly outperformed GPT-3.5 and Bard in terms of overall performance, but lacked consistency in providing explanations that aligned with scientific and medical consensus. Although GPT-4 shows promise, current LLMs are not sufficiently advanced to answer anesthesiology board examination questions with passing success. Further iterations and domain-specific training may enhance their utility in medical education.
Collapse
Affiliation(s)
- Adnan A Khan
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Rayaan Yunus
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Mahad Sohail
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Taha A Rehman
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Shirin Saeed
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Yifan Bu
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Cullen D Jackson
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Aidan Sharkey
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Feroze Mahmood
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Robina Matyal
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA.
| |
Collapse
|
12
|
Shin E, Ramanathan M. Evaluation of prompt engineering strategies for pharmacokinetic data analysis with the ChatGPT large language model. J Pharmacokinet Pharmacodyn 2024; 51:101-108. [PMID: 37952004 DOI: 10.1007/s10928-023-09892-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 10/11/2023] [Indexed: 11/14/2023]
Abstract
To systematically assess the ChatGPT large language model on diverse tasks relevant to pharmacokinetic data analysis. ChatGPT was evaluated with prototypical tasks related to report writing, code generation, non-compartmental analysis, and pharmacokinetic word problems. The writing task consisted of writing an introduction for this paper from a draft title. The coding tasks consisted of generating R code for semi-logarithmic graphing of concentration-time profiles and calculating area under the curve and area under the moment curve from time zero to infinity. Pharmacokinetics word problems on single intravenous, extravascular bolus, and multiple dosing were taken from a pharmacokinetics textbook. Chain-of-thought and problem separation were assessed as prompt engineering strategies when errors occurred. ChatGPT showed satisfactory performance on the report writing, code generation tasks and provided accurate information on the principles and methods underlying pharmacokinetic data analysis. However, ChatGPT had high error rates in numerical calculations involving exponential functions. The outputs generated by ChatGPT were not reproducible: the precise content of the output was variable albeit not necessarily erroneous for different instances of the same prompt. Incorporation of prompt engineering strategies reduced but did not eliminate errors in numerical calculations. ChatGPT has the potential to become a powerful productivity tool for writing, knowledge encapsulation, and coding tasks in pharmacokinetic data analysis. The poor accuracy of ChatGPT in numerical calculations require resolution before it can be reliably used for PK and pharmacometrics data analysis.
Collapse
Affiliation(s)
- Euibeom Shin
- Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York, 355 Pharmacy, Buffalo, NY, 14214-8033, USA
| | - Murali Ramanathan
- Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York, 355 Pharmacy, Buffalo, NY, 14214-8033, USA.
| |
Collapse
|
13
|
Javid M, Bhandari M, Parameshwari P, Reddiboina M, Prasad S. Evaluation of ChatGPT for Patient Counseling in Kidney Stone Clinic: A Prospective Study. J Endourol 2024; 38:377-383. [PMID: 38411835 DOI: 10.1089/end.2023.0571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/28/2024] Open
Abstract
Introduction: The potential of large language models (LLMs) is to improve the clinical workflow and to make patient care efficient. We prospectively evaluated the performance of the LLM ChatGPT as a patient counseling tool in the urology stone clinic and validated the generated responses with those of urologists. Methods: We collected 61 questions from 12 kidney stone patients and prompted those to ChatGPT and a panel of experienced urologists (Level 1). Subsequently, the blinded responses of urologists and ChatGPT were presented to two expert urologists (Level 2) for comparative evaluation on preset domains: accuracy, relevance, empathy, completeness, and practicality. All responses were rated on a Likert scale of 1 to 10 for psychometric response evaluation. The mean difference in the scores given by the urologists (Level 2) was analyzed and interrater reliability (IRR) for the level of agreement in the responses between the urologists (Level 2) was analyzed by Cohen's kappa. Results: The mean differences in average scores between the responses from ChatGPT and urologists showed significant differences in accuracy (p < 0.001), empathy (p < 0.001), completeness (p < 0.001), and practicality (p < 0.001), except for the relevance domain (p = 0.051), with ChatGPT's responses being rated higher. The IRR analysis revealed significant agreement only in the empathy domain [k = 0.163, (0.059-0.266)]. Conclusion: We believe the introduction of ChatGPT in the clinical workflow could further optimize the information provided to patients in a busy stone clinic. In this preliminary study, ChatGPT supplemented the answers provided by the urologists, adding value to the conversation. However, in its current state, it is still not ready to be a direct source of authentic information for patients. We recommend its use as a source to build a comprehensive Frequently Asked Questions bank as a prelude to developing an LLM Chatbot for patient counseling.
Collapse
Affiliation(s)
- Mohamed Javid
- Department of Urology, Chengalpattu Medical College, Chengalpattu, Tamil Nadu, India
| | - Mahendra Bhandari
- Vattikuti Urology Institute, Henry Ford Hospital, Detroit, Michigan, USA
| | - P Parameshwari
- Department of Community Medicine, Chengalpattu Medical College, Chengalpattu, Tamil Nadu, India
| | | | - Srikala Prasad
- Department of Urology, Chengalpattu Medical College, Chengalpattu, Tamil Nadu, India
| |
Collapse
|
14
|
Abou-Abdallah M, Dar T, Mahmudzade Y, Michaels J, Talwar R, Tornari C. The quality and readability of patient information provided by ChatGPT: can AI reliably explain common ENT operations? Eur Arch Otorhinolaryngol 2024:10.1007/s00405-024-08598-w. [PMID: 38530460 DOI: 10.1007/s00405-024-08598-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Accepted: 03/04/2024] [Indexed: 03/28/2024]
Abstract
PURPOSE Access to high-quality and comprehensible patient information is crucial. However, information provided by increasingly prevalent Artificial Intelligence tools has not been thoroughly investigated. This study assesses the quality and readability of information from ChatGPT regarding three index ENT operations: tonsillectomy, adenoidectomy, and grommets. METHODS We asked ChatGPT standard and simplified questions. Readability was calculated using Flesch-Kincaid Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI) and Simple Measure of Gobbledygook (SMOG) scores. We assessed quality using the DISCERN instrument and compared these with ENT UK patient leaflets. RESULTS ChatGPT readability was poor, with mean FRES of 38.9 and 55.1 pre- and post-simplification, respectively. Simplified information from ChatGPT was 43.6% more readable (FRES) but scored 11.6% lower for quality. ENT UK patient information readability and quality was consistently higher. CONCLUSIONS ChatGPT can simplify information at the expense of quality, resulting in shorter answers with important omissions. Limitations in knowledge and insight curb its reliability for healthcare information. Patients should use reputable sources from professional organisations alongside clear communication with their clinicians for well-informed consent and making decisions.
Collapse
Affiliation(s)
- Michel Abou-Abdallah
- Ear, Nose and Throat Department, Luton and Dunstable University Hospital, Lewsey Rd, Luton, LU4 0DZ, UK.
| | - Talib Dar
- Ear, Nose and Throat Department, Luton and Dunstable University Hospital, Lewsey Rd, Luton, LU4 0DZ, UK
| | - Yasamin Mahmudzade
- Foundation Programme, East and North Hertfordshire NHS Trust, Stevenage, UK
| | - Joshua Michaels
- Ear, Nose and Throat Department, Luton and Dunstable University Hospital, Lewsey Rd, Luton, LU4 0DZ, UK
| | - Rishi Talwar
- Ear, Nose and Throat Department, Luton and Dunstable University Hospital, Lewsey Rd, Luton, LU4 0DZ, UK
| | - Chrysostomos Tornari
- Ear, Nose and Throat Department, Luton and Dunstable University Hospital, Lewsey Rd, Luton, LU4 0DZ, UK
| |
Collapse
|
15
|
Sandmann S, Riepenhausen S, Plagwitz L, Varghese J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun 2024; 15:2050. [PMID: 38448475 PMCID: PMC10917796 DOI: 10.1038/s41467-024-46411-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 02/27/2024] [Indexed: 03/08/2024] Open
Abstract
It is likely that individuals are turning to Large Language Models (LLMs) to seek health advice, much like searching for diagnoses on Google. We evaluate clinical accuracy of GPT-3·5 and GPT-4 for suggesting initial diagnosis, examination steps and treatment of 110 medical cases across diverse clinical disciplines. Moreover, two model configurations of the Llama 2 open source LLMs are assessed in a sub-study. For benchmarking the diagnostic task, we conduct a naïve Google search for comparison. Overall, GPT-4 performed best with superior performances over GPT-3·5 considering diagnosis and examination and superior performance over Google for diagnosis. Except for treatment, better performance on frequent vs rare diseases is evident for all three approaches. The sub-study indicates slightly lower performances for Llama models. In conclusion, the commercial LLMs show growing potential for medical question answering in two successive major releases. However, some weaknesses underscore the need for robust and regulated AI models in health care. Open source LLMs can be a viable option to address specific needs regarding data privacy and transparency of training.
Collapse
Affiliation(s)
- Sarah Sandmann
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Sarah Riepenhausen
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Lucas Plagwitz
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Julian Varghese
- Institute of Medical Informatics, University of Münster, Münster, Germany.
| |
Collapse
|
16
|
Romano MF, Shih LC, Paschalidis IC, Au R, Kolachalama VB. Large Language Models in Neurology Research and Future Practice. Neurology 2023; 101:1058-1067. [PMID: 37816646 PMCID: PMC10752640 DOI: 10.1212/wnl.0000000000207967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 09/06/2023] [Indexed: 10/12/2023] Open
Abstract
Recent advancements in generative artificial intelligence, particularly using large language models (LLMs), are gaining increased public attention. We provide a perspective on the potential of LLMs to analyze enormous amounts of data from medical records and gain insights on specific topics in neurology. In addition, we explore use cases for LLMs, such as early diagnosis, supporting patient and caregivers, and acting as an assistant for clinicians. We point to the potential ethical and technical challenges raised by LLMs, such as concerns about privacy and data security, potential biases in the data for model training, and the need for careful validation of results. Researchers must consider these challenges and take steps to address them to ensure that their work is conducted in a safe and responsible manner. Despite these challenges, LLMs offer promising opportunities for improving care and treatment of various neurologic disorders.
Collapse
Affiliation(s)
- Michael F Romano
- From the Department of Medicine (M.F.R., R.A., V.B.K.), Boston University Chobanian & Avedisian School of Medicine, MA; Department of Radiology and Biomedical Imaging (M.F.R.), University of California, San Francisco; Department of Neurology (L.C.S., R.A.), Boston University Chobanian & Avedisian School of Medicine; Department of Electrical and Computer Engineering (I.C.P.), Division of Systems Engineering, and Department of Biomedical Engineering; Faculty of Computing and Data Sciences (I.C.P., V.B.K.), Boston University; Department of Anatomy and Neurobiology (R.A.); The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine; Department of Epidemiology, Boston University School of Public Health; Boston University Alzheimer's Disease Research Center (R.A.); and Department of Computer Science (V.B.K.), Boston University, MA
| | - Ludy C Shih
- From the Department of Medicine (M.F.R., R.A., V.B.K.), Boston University Chobanian & Avedisian School of Medicine, MA; Department of Radiology and Biomedical Imaging (M.F.R.), University of California, San Francisco; Department of Neurology (L.C.S., R.A.), Boston University Chobanian & Avedisian School of Medicine; Department of Electrical and Computer Engineering (I.C.P.), Division of Systems Engineering, and Department of Biomedical Engineering; Faculty of Computing and Data Sciences (I.C.P., V.B.K.), Boston University; Department of Anatomy and Neurobiology (R.A.); The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine; Department of Epidemiology, Boston University School of Public Health; Boston University Alzheimer's Disease Research Center (R.A.); and Department of Computer Science (V.B.K.), Boston University, MA
| | - Ioannis C Paschalidis
- From the Department of Medicine (M.F.R., R.A., V.B.K.), Boston University Chobanian & Avedisian School of Medicine, MA; Department of Radiology and Biomedical Imaging (M.F.R.), University of California, San Francisco; Department of Neurology (L.C.S., R.A.), Boston University Chobanian & Avedisian School of Medicine; Department of Electrical and Computer Engineering (I.C.P.), Division of Systems Engineering, and Department of Biomedical Engineering; Faculty of Computing and Data Sciences (I.C.P., V.B.K.), Boston University; Department of Anatomy and Neurobiology (R.A.); The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine; Department of Epidemiology, Boston University School of Public Health; Boston University Alzheimer's Disease Research Center (R.A.); and Department of Computer Science (V.B.K.), Boston University, MA
| | - Rhoda Au
- From the Department of Medicine (M.F.R., R.A., V.B.K.), Boston University Chobanian & Avedisian School of Medicine, MA; Department of Radiology and Biomedical Imaging (M.F.R.), University of California, San Francisco; Department of Neurology (L.C.S., R.A.), Boston University Chobanian & Avedisian School of Medicine; Department of Electrical and Computer Engineering (I.C.P.), Division of Systems Engineering, and Department of Biomedical Engineering; Faculty of Computing and Data Sciences (I.C.P., V.B.K.), Boston University; Department of Anatomy and Neurobiology (R.A.); The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine; Department of Epidemiology, Boston University School of Public Health; Boston University Alzheimer's Disease Research Center (R.A.); and Department of Computer Science (V.B.K.), Boston University, MA
| | - Vijaya B Kolachalama
- From the Department of Medicine (M.F.R., R.A., V.B.K.), Boston University Chobanian & Avedisian School of Medicine, MA; Department of Radiology and Biomedical Imaging (M.F.R.), University of California, San Francisco; Department of Neurology (L.C.S., R.A.), Boston University Chobanian & Avedisian School of Medicine; Department of Electrical and Computer Engineering (I.C.P.), Division of Systems Engineering, and Department of Biomedical Engineering; Faculty of Computing and Data Sciences (I.C.P., V.B.K.), Boston University; Department of Anatomy and Neurobiology (R.A.); The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine; Department of Epidemiology, Boston University School of Public Health; Boston University Alzheimer's Disease Research Center (R.A.); and Department of Computer Science (V.B.K.), Boston University, MA.
| |
Collapse
|
17
|
Tay JQ. Re: Online patient education in body contouring: A comparison between Google and ChatGPT. J Plast Reconstr Aesthet Surg 2023; 87:440-441. [PMID: 37944454 DOI: 10.1016/j.bjps.2023.10.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 10/20/2023] [Indexed: 11/12/2023]
Affiliation(s)
- Jing Qin Tay
- Plastic, Burns and Reconstructive Surgery Department, Salisbury District Hospital, Thames Valley/Wessex Deanery, UK.
| |
Collapse
|
18
|
Gilvaz VJ, Reginato AM. Artificial intelligence in rheumatoid arthritis: potential applications and future implications. Front Med (Lausanne) 2023; 10:1280312. [PMID: 38034534 PMCID: PMC10687464 DOI: 10.3389/fmed.2023.1280312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 10/13/2023] [Indexed: 12/02/2023] Open
Abstract
The widespread adoption of digital health records, coupled with the rise of advanced diagnostic testing, has resulted in an explosion of patient data, comparable in scope to genomic datasets. This vast information repository offers significant potential for improving patient outcomes and decision-making, provided one can extract meaningful insights from it. This is where artificial intelligence (AI) tools like machine learning (ML) and deep learning come into play, helping us leverage these enormous datasets to predict outcomes and make informed decisions. AI models can be trained to analyze and interpret patient data, including physician notes, laboratory testing, and imaging, to aid in the management of patients with rheumatic diseases. As one of the most common autoimmune diseases, rheumatoid arthritis (RA) has attracted considerable attention, particularly concerning the evolution of diagnostic techniques and therapeutic interventions. Our aim is to underscore those areas where AI, according to recent research, demonstrates promising potential to enhance the management of patients with RA.
Collapse
Affiliation(s)
- Vinit J. Gilvaz
- Division of Rheumatology, Department of Medicine, Rhode Island Hospital, Warren Alpert Medical School of Brown University, Providence, RI, United States
| | - Anthony M. Reginato
- Division of Rheumatology, Department of Medicine, Rhode Island Hospital, Warren Alpert Medical School of Brown University, Providence, RI, United States
- Department of Dermatology, Rhode Island Hospital, Warren Alpert Medical School of Brown University, Providence, RI, United States
| |
Collapse
|
19
|
Wu RT, Dang RR. ChatGPT in head and neck scientific writing: A precautionary anecdote. Am J Otolaryngol 2023; 44:103980. [PMID: 37459740 DOI: 10.1016/j.amjoto.2023.103980] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 07/04/2023] [Indexed: 09/24/2023]
Abstract
PURPOSE To evaluate the accuracy of ChatGPT references in scientific writing relevant to head and neck surgery. MATERIALS AND METHODS Five commonly researched keywords relevant to head and neck surgery were selected (osteoradionecrosis of the jaws, oral cancer, adjuvant therapy for oral cancer, TORS, and free flap reconstruction in oral cancer). The AI chatbot was then asked to provide ten complete citations for each of the keywords. Two independent authors reviewed the results for accuracy and assigned each article a numerical score based on pre-selected criteria. RESULTS Among 50 total references provided by ChatGPT, only five (10 %) were found to have the correct title, journal, authors, year of publication, and DOI. Merely 14 % of the presented references had correct DOI. References regarding free flap reconstruction for oral cancer were the least accurate from all the five categories, with no correct DOI. Complete inter-rater agreement was noted while evaluating the citations. CONCLUSION Only 10 % of the articles provided by ChatGPT, relevant to head and neck surgery, were correct. A high degree of academic hallucination was noted.
Collapse
Affiliation(s)
- Robin T Wu
- Department of Plastic and Reconstruction Surgery, Chang Gung Memorial Hospital, Linkou, Taiwan; Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University Hospital, Stanford, CA, USA
| | - Rushil R Dang
- Department of Plastic and Reconstruction Surgery, Chang Gung Memorial Hospital, Linkou, Taiwan; Former fellow, Maxillofacial Oncology and Reconstructive Surgery, Department of Oral and Maxillofacial Surgery, Boston Medical Center, Boston, MA, USA.
| |
Collapse
|
20
|
Ordak M. ChatGPT's Skills in Statistical Analysis Using the Example of Allergology: Do We Have Reason for Concern? Healthcare (Basel) 2023; 11:2554. [PMID: 37761751 PMCID: PMC10530997 DOI: 10.3390/healthcare11182554] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 09/13/2023] [Accepted: 09/13/2023] [Indexed: 09/29/2023] Open
Abstract
BACKGROUND Content generated by artificial intelligence is sometimes not truthful. To date, there have been a number of medical studies related to the validity of ChatGPT's responses; however, there is a lack of studies addressing various aspects of statistical analysis. The aim of this study was to assess the validity of the answers provided by ChatGPT in relation to statistical analysis, as well as to identify recommendations to be implemented in the future in connection with the results obtained. METHODS The study was divided into four parts and was based on the exemplary medical field of allergology. The first part consisted of asking ChatGPT 30 different questions related to statistical analysis. The next five questions included a request for ChatGPT to perform the relevant statistical analyses, and another five requested ChatGPT to indicate which statistical test should be applied to articles accepted for publication in Allergy. The final part of the survey involved asking ChatGPT the same statistical question three times. RESULTS Out of the 40 general questions asked that related to broad statistical analysis, ChatGPT did not fully answer half of them. Assumptions necessary for the application of specific statistical tests were not included. ChatGPT also gave completely divergent answers to one question about which test should be used. CONCLUSION The answers provided by ChatGPT to various statistical questions may give rise to the use of inappropriate statistical tests and, consequently, the subsequent misinterpretation of the research results obtained. Questions asked in this regard need to be framed more precisely.
Collapse
Affiliation(s)
- Michal Ordak
- Department of Pharmacotherapy and Pharmaceutical Care, Faculty of Pharmacy, Medical University of Warsaw, Banacha 1 Str., 02-097 Warsaw, Poland
| |
Collapse
|
21
|
Jiao C, Edupuganti NR, Patel PA, Bui T, Sheth V. Evaluating the Artificial Intelligence Performance Growth in Ophthalmic Knowledge. Cureus 2023; 15:e45700. [PMID: 37868408 PMCID: PMC10590143 DOI: 10.7759/cureus.45700] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/20/2023] [Indexed: 10/24/2023] Open
Abstract
OBJECTIVE We aim to compare the capabilities of Chat Generative Pre-Trained Transformer (ChatGPT)-3.5 and ChatGPT-4.0 (OpenAI, San Francisco, CA, USA) in addressing multiple-choice ophthalmic case challenges. METHODS AND ANALYSIS Both models' accuracy was compared across different ophthalmology subspecialties using multiple-choice ophthalmic clinical cases provided by the American Academy of Ophthalmology (AAO) "Diagnosis This" questions. Additional analysis was based on image content, question difficulty, character length of models' responses, and model's alignment with responses from human respondents. χ2 test, Fisher's exact test, Student's t-test, and one-way analysis of variance (ANOVA) were conducted where appropriate, with p<0.05 considered significant. RESULTS GPT-4.0 significantly outperformed GPT-3.5 (75% versus 46%, p<0.01), with the most noticeable improvement in neuro-ophthalmology (100% versus 38%, p=0.03). While both models struggled with uveitis and refractive questions, GPT-4.0 excelled in other areas, such as pediatric questions (82%). In image-related questions, GPT-4.0 also displayed superior accuracy that trended toward significance (73% versus 46%, p=0.07). GPT-4.0 performed better with easier questions (93.8% (least difficult) versus 76.2% (middle) versus 53.3% (most), p=0.03) and generated more concise answers than GPT-3.5 (651.7±342.9 versus 1,112.9±328.8 characters, p<0.01). Moreover, GPT-4.0's answers were more in line with those of AAO respondents (57.3% versus 41.4%, p<0.01), showing a strong correlation between its accuracy and the proportion of AAO respondents who selected GPT-4.0's answer (ρ=0.713, p<0.01). CONCLUSION AND RELEVANCE Our study demonstrated that GPT-4.0 significantly outperforms GPT-3.5 in addressing ophthalmic case challenges, especially in neuro-ophthalmology, with improved accuracy even in image-related questions. These findings underscore the potential of advancing artificial intelligence (AI) models in enhancing ophthalmic diagnostics and medical education.
Collapse
Affiliation(s)
- Cheng Jiao
- Ophthalmology, Augusta University Medical College of Georgia, Augusta, USA
| | - Neel R Edupuganti
- Ophthalmology, Augusta University Medical College of Georgia, Augusta, USA
| | - Parth A Patel
- Neurology, Augusta University Medical College of Georgia, Augusta, USA
| | - Tommy Bui
- Ophthalmology, Augusta University Medical College of Georgia, Augusta, USA
| | - Veeral Sheth
- Ophthalmology, University Retina and Macula Associates, Oak Forest, USA
| |
Collapse
|
22
|
Kumar M, Mani UA, Tripathi P, Saalim M, Roy S. Artificial Hallucinations by Google Bard: Think Before You Leap. Cureus 2023; 15:e43313. [PMID: 37700993 PMCID: PMC10492900 DOI: 10.7759/cureus.43313] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/10/2023] [Indexed: 09/14/2023] Open
Abstract
One of the critical challenges posed by artificial intelligence (AI) tools like Google Bard (Google LLC, Mountain View, California, United States) is the potential for "artificial hallucinations." These refer to instances where an AI chatbot generates fictional, erroneous, or unsubstantiated information in response to queries. In research, such inaccuracies can lead to the propagation of misinformation and undermine the credibility of scientific literature. The experience presented here highlights the importance of cross-checking the information provided by AI tools with reliable sources and maintaining a cautious approach when utilizing these tools in research writing.
Collapse
Affiliation(s)
- Mukesh Kumar
- Emergency Medicine, King George's Medical University, Lucknow, IND
| | - Utsav Anand Mani
- Emergency Medicine, King George's Medical University, Lucknow, IND
| | | | - Mohd Saalim
- Emergency Medicine, King George's Medical University, Lucknow, IND
| | - Sneha Roy
- Medicine, King George's Medical University, Lucknow, IND
| |
Collapse
|
23
|
Li H, Moon JT, Iyer D, Balthazar P, Krupinski EA, Bercu ZL, Newsome JM, Banerjee I, Gichoya JW, Trivedi HM. Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports. Clin Imaging 2023; 101:137-141. [PMID: 37336169 DOI: 10.1016/j.clinimag.2023.06.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 05/26/2023] [Accepted: 06/06/2023] [Indexed: 06/21/2023]
Abstract
PURPOSE To evaluate the complexity of diagnostic radiology reports across major imaging modalities and the ability of ChatGPT (Early March 2023 Version, OpenAI, California, USA) to simplify these reports to the 8th grade reading level of the average U.S. adult. METHODS We randomly sampled 100 radiographs (XR), 100 ultrasound (US), 100 CT, and 100 MRI radiology reports from our institution's database dated between 2022 and 2023 (N = 400). These were processed by ChatGPT using the prompt "Explain this radiology report to a patient in layman's terms in second person: <Report Text>". Mean report length, Flesch reading ease score (FRES), and Flesch-Kincaid reading level (FKRL) were calculated for each report and ChatGPT output. T-tests were used to determine significance. RESULTS Mean report length was 164 ± 117 words, FRES was 38.0 ± 11.8, and FKRL was 10.4 ± 1.9. FKRL was significantly higher for CT and MRI than for US and XR. Only 60/400 (15%) had a FKRL <8.5. The mean simplified ChatGPT output length was 103 ± 36 words, FRES was 83.5 ± 5.6, and FKRL was 5.8 ± 1.1. This reflects a mean decrease of 61 words (p < 0.01), increase in FRES of 45.5 (p < 0.01), and decrease in FKRL of 4.6 (p < 0.01). All simplified outputs had FKRL <8.5. DISCUSSION Our study demonstrates the effective use of ChatGPT when tasked with simplifying radiology reports to below the 8th grade reading level. We report significant improvements in FRES, FKRL, and word count, the last of which requires modality-specific context.
Collapse
Affiliation(s)
- Hanzhou Li
- Emory University School of Medicine, Department of Radiology and Imaging Science, 1364 Clifton Rd, Atlanta, GA 30322, United States of America.
| | - John T Moon
- Emory University School of Medicine, Department of Radiology and Imaging Science, 1364 Clifton Rd, Atlanta, GA 30322, United States of America. https://twitter.com/johntmoon
| | - Deepak Iyer
- Emory University School of Medicine, Department of Radiology and Imaging Science, 1364 Clifton Rd, Atlanta, GA 30322, United States of America. https://twitter.com/d_iyer7
| | - Patricia Balthazar
- Emory University School of Medicine, Department of Radiology and Imaging Science, 1364 Clifton Rd, Atlanta, GA 30322, United States of America. https://twitter.com/PBalthazarMD
| | - Elizabeth A Krupinski
- Emory University School of Medicine, Department of Radiology and Imaging Science, 1364 Clifton Rd, Atlanta, GA 30322, United States of America. https://twitter.com/EAKrup
| | - Zachary L Bercu
- Emory University School of Medicine, Department of Radiology and Imaging Science, 1364 Clifton Rd, Atlanta, GA 30322, United States of America. https://twitter.com/ZachBercuMD
| | - Janice M Newsome
- Emory University School of Medicine, Department of Radiology and Imaging Science, 1364 Clifton Rd, Atlanta, GA 30322, United States of America. https://twitter.com/angiowoman
| | - Imon Banerjee
- Mayo Clinic, Department of Radiology, Phoenix, AZ, United States of America. https://twitter.com/ImonBanerjee6
| | - Judy W Gichoya
- Emory University School of Medicine, Department of Radiology and Imaging Science, 1364 Clifton Rd, Atlanta, GA 30322, United States of America. https://twitter.com/judywawira
| | - Hari M Trivedi
- Emory University School of Medicine, Department of Radiology and Imaging Science, 1364 Clifton Rd, Atlanta, GA 30322, United States of America. https://twitter.com/HariTrivediMD
| |
Collapse
|
24
|
Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 2023; 15:e40895. [PMID: 37492832 PMCID: PMC10364849 DOI: 10.7759/cureus.40895] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/24/2023] [Indexed: 07/27/2023] Open
Abstract
Objective The primary aim of this research was to address the limitations observed in the medical knowledge of prevalent large language models (LLMs) such as ChatGPT, by creating a specialized language model with enhanced accuracy in medical advice. Methods We achieved this by adapting and refining the large language model meta-AI (LLaMA) using a large dataset of 100,000 patient-doctor dialogues sourced from a widely used online medical consultation platform. These conversations were cleaned and anonymized to respect privacy concerns. In addition to the model refinement, we incorporated a self-directed information retrieval mechanism, allowing the model to access and utilize real-time information from online sources like Wikipedia and data from curated offline medical databases. Results The fine-tuning of the model with real-world patient-doctor interactions significantly improved the model's ability to understand patient needs and provide informed advice. By equipping the model with self-directed information retrieval from reliable online and offline sources, we observed substantial improvements in the accuracy of its responses. Conclusion Our proposed ChatDoctor, represents a significant advancement in medical LLMs, demonstrating a significant improvement in understanding patient inquiries and providing accurate advice. Given the high stakes and low error tolerance in the medical field, such enhancements in providing accurate and reliable information are not only beneficial but essential.
Collapse
Affiliation(s)
- Yunxiang Li
- Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, USA
| | - Zihan Li
- Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, USA
| | - Kai Zhang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, USA
| | - Ruilong Dan
- College of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, CHN
| | - Steve Jiang
- Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, USA
| | - You Zhang
- Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, USA
| |
Collapse
|