1
|
Chen JS, Reddy AJ, Al-Sharif E, Shoji MK, Kalaw FGP, Eslani M, Lang PZ, Arya M, Koretz ZA, Bolo KA, Arnett JJ, Roginiel AC, Do JL, Robbins SL, Camp AS, Scott NL, Rudell JC, Weinreb RN, Baxter SL, Granet DB. Analysis of ChatGPT Responses to Ophthalmic Cases: Can ChatGPT Think like an Ophthalmologist? OPHTHALMOLOGY SCIENCE 2025; 5:100600. [PMID: 39346575 PMCID: PMC11437840 DOI: 10.1016/j.xops.2024.100600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Revised: 08/09/2024] [Accepted: 08/13/2024] [Indexed: 10/01/2024]
Abstract
Objective Large language models such as ChatGPT have demonstrated significant potential in question-answering within ophthalmology, but there is a paucity of literature evaluating its ability to generate clinical assessments and discussions. The objectives of this study were to (1) assess the accuracy of assessment and plans generated by ChatGPT and (2) evaluate ophthalmologists' abilities to distinguish between responses generated by clinicians versus ChatGPT. Design Cross-sectional mixed-methods study. Subjects Sixteen ophthalmologists from a single academic center, of which 10 were board-eligible and 6 were board-certified, were recruited to participate in this study. Methods Prompt engineering was used to ensure ChatGPT output discussions in the style of the ophthalmologist author of the Medical College of Wisconsin Ophthalmic Case Studies. Cases where ChatGPT accurately identified the primary diagnoses were included and then paired. Masked human-generated and ChatGPT-generated discussions were sent to participating ophthalmologists to identify the author of the discussions. Response confidence was assessed using a 5-point Likert scale score, and subjective feedback was manually reviewed. Main Outcome Measures Accuracy of ophthalmologist identification of discussion author, as well as subjective perceptions of human-generated versus ChatGPT-generated discussions. Results Overall, ChatGPT correctly identified the primary diagnosis in 15 of 17 (88.2%) cases. Two cases were excluded from the paired comparison due to hallucinations or fabrications of nonuser-provided data. Ophthalmologists correctly identified the author in 77.9% ± 26.6% of the 13 included cases, with a mean Likert scale confidence rating of 3.6 ± 1.0. No significant differences in performance or confidence were found between board-certified and board-eligible ophthalmologists. Subjectively, ophthalmologists found that discussions written by ChatGPT tended to have more generic responses, irrelevant information, hallucinated more frequently, and had distinct syntactic patterns (all P < 0.01). Conclusions Large language models have the potential to synthesize clinical data and generate ophthalmic discussions. While these findings have exciting implications for artificial intelligence-assisted health care delivery, more rigorous real-world evaluation of these models is necessary before clinical deployment. Financial Disclosures The author(s) have no proprietary or commercial interest in any materials discussed in this article.
Collapse
Affiliation(s)
- Jimmy S Chen
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Akshay J Reddy
- School of Medicine, California University of Science and Medicine, Colton, California
| | - Eman Al-Sharif
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- Surgery Department, College of Medicine, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Marissa K Shoji
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Fritz Gerald P Kalaw
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Medi Eslani
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Paul Z Lang
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Malvika Arya
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Zachary A Koretz
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Kyle A Bolo
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Justin J Arnett
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Aliya C Roginiel
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Jiun L Do
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Shira L Robbins
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Andrew S Camp
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Nathan L Scott
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Jolene C Rudell
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Robert N Weinreb
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Sally L Baxter
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - David B Granet
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| |
Collapse
|
2
|
Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, Freire Y. Beyond the Scalpel: Assessing ChatGPT's potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J 2024; 24:46-52. [PMID: 38162955 PMCID: PMC10755495 DOI: 10.1016/j.csbj.2023.11.058] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/28/2023] [Accepted: 11/28/2023] [Indexed: 01/03/2024] Open
Abstract
AI has revolutionized the way we interact with technology. Noteworthy advances in AI algorithms and large language models (LLM) have led to the development of natural generative language (NGL) systems such as ChatGPT. Although these LLM can simulate human conversations and generate content in real time, they face challenges related to the topicality and accuracy of the information they generate. This study aimed to assess whether ChatGPT-4 could provide accurate and reliable answers to general dentists in the field of oral surgery, and thus explore its potential as an intelligent virtual assistant in clinical decision making in oral surgery. Thirty questions related to oral surgery were posed to ChatGPT4, each question repeated 30 times. Subsequently, a total of 900 responses were obtained. Two surgeons graded the answers according to the guidelines of the Spanish Society of Oral Surgery, using a three-point Likert scale (correct, partially correct/incomplete, and incorrect). Disagreements were arbitrated by an experienced oral surgeon, who provided the final grade Accuracy was found to be 71.7%, and consistency of the experts' grading across iterations, ranged from moderate to almost perfect. ChatGPT-4, with its potential capabilities, will inevitably be integrated into dental disciplines, including oral surgery. In the future, it could be considered as an auxiliary intelligent virtual assistant, though it would never replace oral surgery experts. Proper training and verified information by experts will remain vital to the implementation of the technology. More comprehensive research is needed to ensure the safe and successful application of AI in oral surgery.
Collapse
Affiliation(s)
- Ana Suárez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Jaime Jiménez
- Department of Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - María Llorente de Pedro
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Cristina Andreu-Vázquez
- Department of Veterinary Medicine, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Víctor Díaz-Flores García
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Margarita Gómez Sánchez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Yolanda Freire
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| |
Collapse
|
3
|
Chan J, Dong T, Angelini GD. The performance of large language models in intercollegiate Membership of the Royal College of Surgeons examination. Ann R Coll Surg Engl 2024; 106:700-704. [PMID: 38445611 PMCID: PMC11528401 DOI: 10.1308/rcsann.2024.0023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/09/2024] [Indexed: 03/07/2024] Open
Abstract
INTRODUCTION Large language models (LLM), such as Chat Generative Pre-trained Transformer (ChatGPT) and Bard utilise deep learning algorithms that have been trained on a massive data set of text and code to generate human-like responses. Several studies have demonstrated satisfactory performance on postgraduate examinations, including the United States Medical Licensing Examination. We aimed to evaluate artificial intelligence performance in Part A of the intercollegiate Membership of the Royal College of Surgeons (MRCS) examination. METHODS The MRCS mock examination from Pastest, a commonly used question bank for examinees, was used to assess the performance of three LLMs: GPT-3.5, GPT 4.0 and Bard. Three hundred mock questions were input into the three LLMs, and the responses provided by the LLMs were recorded and analysed. The pass mark was set at 70%. RESULTS The overall accuracies for GPT-3.5, GPT 4.0 and Bard were 67.33%, 71.67% and 65.67%, respectively (p = 0.27). The performances of GPT-3.5, GPT 4.0 and Bard in Applied Basic Sciences were 68.89%, 72.78% and 63.33% (p = 0.15), respectively. Furthermore, the three LLMs obtained correct answers in 65.00%, 70.00% and 69.17% of the Principles of Surgery in General questions (p = 0.67). There were no differences in performance in the overall and subcategories among the three LLMs. CONCLUSIONS Our findings demonstrated satisfactory performance for all three LLMs in the MRCS Part A examination, with GPT 4.0 the only LLM that achieved the pass mark set.
Collapse
Affiliation(s)
- J Chan
- Bristol Heart Institute, University of Bristol, UK
| | - T Dong
- Bristol Heart Institute, University of Bristol, UK
| | - GD Angelini
- Bristol Heart Institute, University of Bristol, UK
| |
Collapse
|
4
|
Anguita R, Downie C, Ferro Desideri L, Sagoo MS. Assessing large language models' accuracy in providing patient support for choroidal melanoma. Eye (Lond) 2024; 38:3113-3117. [PMID: 39003430 DOI: 10.1038/s41433-024-03231-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 06/17/2024] [Accepted: 07/09/2024] [Indexed: 07/15/2024] Open
Abstract
PURPOSE This study aimed to evaluate the accuracy of information that patients can obtain from large language models (LLMs) when seeking answers to common questions about choroidal melanoma. METHODS Comparative study comparing frequently asked questions from choroidal melanoma patients and queried three major LLMs-ChatGPT 3.5, Bing AI, and DocsGPT. Answers were reviewed by three ocular oncology experts and scored as accurate, partially accurate, or inaccurate. Statistical analysis compared the quality of responses across models. RESULTS For medical advice questions, ChatGPT gave 92% accurate responses compared to 58% for Bing AI and DocsGPT. For pre/post-op questions, ChatGPT and Bing AI were 86% accurate while DocsGPT was 73% accurate. There were no statistically significant differences between models. ChatGPT responses were the longest while Bing AI responses were the shortest, but length did not affect accuracy. All LLMs appropriately directed patients to seek medical advice from professionals. CONCLUSION LLMs show promising capability to address common choroidal melanoma patient questions at generally acceptable accuracy levels. However, inconsistent, and inaccurate responses do occur, highlighting the need for improved fine-tuning and oversight before integration into clinical practice.
Collapse
Affiliation(s)
- Rodrigo Anguita
- Moorfields Eye Hospital NHS Foundation Trust, City Road London, London, UK.
- Department of Ophthalmology, Inselspital University Hospital of Bern, Bern, Switzerland.
| | - Catriona Downie
- Moorfields Eye Hospital NHS Foundation Trust, City Road London, London, UK
| | | | - Mandeep S Sagoo
- Moorfields Eye Hospital NHS Foundation Trust, City Road London, London, UK
- NIHR Biomedical Research Centre for Ophthalmology at Moorfields Eye Hospital and University College London Institute of Ophthalmology, London, UK
| |
Collapse
|
5
|
Marshall RF, Mallem K, Xu H, Thorne J, Burkholder B, Chaon B, Liberman P, Berkenstock M. Investigating the Accuracy and Completeness of an Artificial Intelligence Large Language Model About Uveitis: An Evaluation of ChatGPT. Ocul Immunol Inflamm 2024; 32:2052-2055. [PMID: 38394625 DOI: 10.1080/09273948.2024.2317417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 12/20/2023] [Accepted: 02/06/2024] [Indexed: 02/25/2024]
Abstract
PURPOSE To assess the accuracy and completeness of ChatGPT-generated answers regarding uveitis description, prevention, treatment, and prognosis. METHODS Thirty-two uveitis-related questions were generated by a uveitis specialist and inputted into ChatGPT 3.5. Answers were compiled into a survey and were reviewed by five uveitis specialists using standardized Likert scales of accuracy and completeness. RESULTS In total, the median accuracy score for all the uveitis questions (n = 32) was 4.00 (between "more correct than incorrect" and "nearly all correct"), and the median completeness score was 2.00 ("adequate, addresses all aspects of the question and provides the minimum amount of information required to be considered complete"). The interrater variability assessment had a total kappa value of 0.0278 for accuracy and 0.0847 for completeness. CONCLUSION ChatGPT can provide relatively high accuracy responses for various questions related to uveitis; however, the answers it provides are incomplete, with some inaccuracies. Its utility in providing medical information requires further validation and development prior to serving as a source of uveitis information for patients.
Collapse
Affiliation(s)
- Rayna F Marshall
- The Drexel University College of Medicine, Philadelphia, Pennsylvania, USA
| | - Krishna Mallem
- The Drexel University College of Medicine, Philadelphia, Pennsylvania, USA
| | - Hannah Xu
- University of California San Diego, San Diego, California, USA
| | - Jennifer Thorne
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Bryn Burkholder
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Benjamin Chaon
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Paulina Liberman
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Meghan Berkenstock
- The Wilmer Eye Institute, Division of Ocular Immunology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| |
Collapse
|
6
|
Kalaw FGP, Baxter SL. Ethical considerations for large language models in ophthalmology. Curr Opin Ophthalmol 2024; 35:438-446. [PMID: 39259616 PMCID: PMC11427135 DOI: 10.1097/icu.0000000000001083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/13/2024]
Abstract
PURPOSE OF REVIEW This review aims to summarize and discuss the ethical considerations regarding large language model (LLM) use in the field of ophthalmology. RECENT FINDINGS This review of 47 articles on LLM applications in ophthalmology highlights their diverse potential uses, including education, research, clinical decision support, and surgical assistance (as an aid in operative notes). We also review ethical considerations such as the inability of LLMs to interpret data accurately, the risk of promoting controversial or harmful recommendations, and breaches of data privacy. These concerns imply the need for cautious integration of artificial intelligence in healthcare, emphasizing human oversight, transparency, and accountability to mitigate risks and uphold ethical standards. SUMMARY The integration of LLMs in ophthalmology offers potential advantages such as aiding in clinical decision support and facilitating medical education through their ability to process queries and analyze ophthalmic imaging and clinical cases. However, their utilization also raises ethical concerns regarding data privacy, potential misinformation, and biases inherent in the datasets used. Awareness of these concerns should be addressed in order to optimize its utility in the healthcare setting. More importantly, promoting responsible and careful use by consumers should be practiced.
Collapse
Affiliation(s)
- Fritz Gerald P Kalaw
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute
- Department of Biomedical Informatics, University of California San Diego Health System, University of California San Diego, La Jolla, California, USA
| | - Sally L Baxter
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute
- Department of Biomedical Informatics, University of California San Diego Health System, University of California San Diego, La Jolla, California, USA
| |
Collapse
|
7
|
Lucas HC, Upperman JS, Robinson JR. A systematic review of large language models and their implications in medical education. MEDICAL EDUCATION 2024; 58:1276-1285. [PMID: 38639098 DOI: 10.1111/medu.15402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 03/20/2024] [Accepted: 03/23/2024] [Indexed: 04/20/2024]
Abstract
INTRODUCTION In the past year, the use of large language models (LLMs) has generated significant interest and excitement because of their potential to revolutionise various fields, including medical education for aspiring physicians. Although medical students undergo a demanding educational process to become competent health care professionals, the emergence of LLMs presents a promising solution to challenges like information overload, time constraints and pressure on clinical educators. However, integrating LLMs into medical education raises critical concerns and challenges for educators, professionals and students. This systematic review aims to explore LLM applications in medical education, specifically their impact on medical students' learning experiences. METHODS A systematic search was performed in PubMed, Web of Science and Embase for articles discussing the applications of LLMs in medical education using selected keywords related to LLMs and medical education, from the time of ChatGPT's debut until February 2024. Only articles available in full text or English were reviewed. The credibility of each study was critically appraised by two independent reviewers. RESULTS The systematic review identified 166 studies, of which 40 were found by review to be relevant to the study. Among the 40 relevant studies, key themes included LLM capabilities, benefits such as personalised learning and challenges regarding content accuracy. Importantly, 42.5% of these studies specifically evaluated LLMs in a novel way, including ChatGPT, in contexts such as medical exams and clinical/biomedical information, highlighting their potential in replicating human-level performance in medical knowledge. The remaining studies broadly discussed the prospective role of LLMs in medical education, reflecting a keen interest in their future potential despite current constraints. CONCLUSIONS The responsible implementation of LLMs in medical education offers a promising opportunity to enhance learning experiences. However, ensuring information accuracy, emphasising skill-building and maintaining ethical safeguards are crucial. Continuous critical evaluation and interdisciplinary collaboration are essential for the appropriate integration of LLMs in medical education.
Collapse
Affiliation(s)
| | - Jeffrey S Upperman
- Department of Pediatric Surgery, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jamie R Robinson
- Department of Pediatric Surgery, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
8
|
Edhem Yılmaz İ, Berhuni M, Özer Özcan Z, Doğan L. Chatbots talk Strabismus: Can AI become the new patient Educator? Int J Med Inform 2024; 191:105592. [PMID: 39159506 DOI: 10.1016/j.ijmedinf.2024.105592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 08/04/2024] [Accepted: 08/11/2024] [Indexed: 08/21/2024]
Abstract
BACKGROUND Strabismus is a common eye condition affecting both children and adults. Effective patient education is crucial for informed decision-making, but traditional methods often lack accessibility and engagement. Chatbots powered by AI have emerged as a promising solution. AIM This study aims to evaluate and compare the performance of three chatbots (ChatGPT, Bard, and Copilot) and a reliable website (AAPOS) in answering real patient questions about strabismus. METHOD Three chatbots (ChatGPT, Bard, and Copilot) were compared to a reliable website (AAPOS) using real patient questions. Metrics included accuracy (SOLO taxonomy), understandability/actionability (PEMAT), and readability (Flesch-Kincaid). We also performed a sentiment analysis to capture the emotional tone and impact of the responses. RESULTS The AAPOS achieved the highest mean SOLO score (4.14 ± 0.47), followed by Bard, Copilot, and ChatGPT. Bard scored highest on both PEMAT-U (74.8 ± 13.3) and PEMAT-A (66.2 ± 13.6) measures. Flesch-Kincaid Ease Scores revealed the AAPOS as the easiest to read (mean score: 55.8 ± 14.11), closely followed by Copilot. ChatGPT, and Bard had lower scores on readability. The sentiment analysis revealed exciting differences. CONCLUSION Chatbots, particularly Bard and Copilot, show promise in patient education for strabismus with strengths in understandability and actionability. However, the AAPOS website outperformed in accuracy and readability.
Collapse
Affiliation(s)
- İbrahim Edhem Yılmaz
- Ophthalmology Department, Gaziantep Islam Science and Technology University, Gaziantep, Turkey.
| | - Mustafa Berhuni
- Ophthalmology Department, Gaziantep Islam Science and Technology University, Gaziantep, Turkey
| | - Zeynep Özer Özcan
- Ophthalmology Department, Dr Ersin Aslan Teaching and Research Hospital, Gaziantep, Turkey
| | - Levent Doğan
- Ophthalmology Department, Omer Halis Demir University, Nigde, Turkey
| |
Collapse
|
9
|
Bellanda VCF, Santos MLD, Ferraz DA, Jorge R, Melo GB. Applications of ChatGPT in the diagnosis, management, education, and research of retinal diseases: a scoping review. Int J Retina Vitreous 2024; 10:79. [PMID: 39420407 PMCID: PMC11487877 DOI: 10.1186/s40942-024-00595-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Accepted: 10/04/2024] [Indexed: 10/19/2024] Open
Abstract
PURPOSE This scoping review aims to explore the current applications of ChatGPT in the retina field, highlighting its potential, challenges, and limitations. METHODS A comprehensive literature search was conducted across multiple databases, including PubMed, Scopus, MEDLINE, and Embase, to identify relevant articles published from 2022 onwards. The inclusion criteria focused on studies evaluating the use of ChatGPT in retinal healthcare. Data were extracted and synthesized to map the scope of ChatGPT's applications in retinal care, categorizing articles into various practical application areas such as academic research, charting, coding, diagnosis, disease management, and patient counseling. RESULTS A total of 68 articles were included in the review, distributed across several categories: 8 related to academics and research, 5 to charting, 1 to coding and billing, 44 to diagnosis, 49 to disease management, 2 to literature consulting, 23 to medical education, and 33 to patient counseling. Many articles were classified into multiple categories due to overlapping topics. The findings indicate that while ChatGPT shows significant promise in areas such as medical education and diagnostic support, concerns regarding accuracy, reliability, and the potential for misinformation remain prevalent. CONCLUSION ChatGPT offers substantial potential in advancing retinal healthcare by supporting clinical decision-making, enhancing patient education, and automating administrative tasks. However, its current limitations, particularly in clinical accuracy and the risk of generating misinformation, necessitate cautious integration into practice, with continuous oversight from healthcare professionals. Future developments should focus on improving accuracy, incorporating up-to-date medical guidelines, and minimizing the risks associated with AI-driven healthcare tools.
Collapse
Affiliation(s)
- Victor C F Bellanda
- Ribeirão Preto Medical School, University of São Paulo, 3900 Bandeirantes Ave, Ribeirão Preto, SP, 14049-900, Brazil.
| | | | | | - Rodrigo Jorge
- Ribeirão Preto Medical School, University of São Paulo, 3900 Bandeirantes Ave, Ribeirão Preto, SP, 14049-900, Brazil
| | - Gustavo Barreto Melo
- Sergipe Eye Hospital, Aracaju, SE, Brazil
- Paulista School of Medicine, Federal University of São Paulo, São Paulo, SP, Brazil
| |
Collapse
|
10
|
Howard EC, Carnino JM, Chong NYK, Levi JR. Navigating ChatGPT's alignment with expert consensus on pediatric OSA management. Int J Pediatr Otorhinolaryngol 2024; 186:112131. [PMID: 39423592 DOI: 10.1016/j.ijporl.2024.112131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 09/28/2024] [Accepted: 10/10/2024] [Indexed: 10/21/2024]
Abstract
OBJECTIVE This study aimed to evaluate the potential integration of artificial intelligence (AI), specifically ChatGPT, into healthcare decision-making, focusing on its alignment with expert consensus statements regarding the management of persistent pediatric obstructive sleep apnea. METHODS We analyzed ChatGPT's responses to 52 statements from the 2024 expert consensus statement (ECS) on the management of pediatric persistent OSA after adenotonsillectomy. Each statement was input into ChatGPT using a 9-point Likert scale format, with each statement entered three times to calculate mean scores and standard deviations. Statistical analysis was performed using Excel. RESULTS ChatGPT's responses were within 1.0 of the consensus statement mean score for 63 % (33/52) of the statements. 13 % (7/52) were statements in which the ChatGPT mean response was different from the ECS mean by 2.0 or greater, the majority of which were in the categories of surgical and medical management. Statements with ChatGPT mean scores differing by more than 2.0 from the consensus mean highlighted the risk of disseminating incorrect information on established medical topics, with a notable variation in responses suggesting inconsistencies in ChatGPT's reliability. CONCLUSION While ChatGPT demonstrated a promising ability to align with expert medical opinions in many cases, its inconsistencies and potential to propagate inaccuracies in contested areas raise important considerations for its application in clinical settings. The findings underscore the need for ongoing evaluation and refinement of AI tools in healthcare, emphasizing collaboration between AI developers, healthcare professionals, and regulatory bodies to ensure AI's safe and effective integration into medical decision-making processes.
Collapse
Affiliation(s)
- Eileen C Howard
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Jonathan M Carnino
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Nicholas Y K Chong
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Jessica R Levi
- Department of Otolaryngology - Head and Neck Surgery, Boston Medical Center, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
| |
Collapse
|
11
|
Duran A, Cortuk O, Ok B. Future Perspective of Risk Prediction in Aesthetic Surgery: Is Artificial Intelligence Reliable? Aesthet Surg J 2024; 44:NP839-NP849. [PMID: 38941487 DOI: 10.1093/asj/sjae140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 06/11/2024] [Accepted: 06/17/2024] [Indexed: 06/30/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) techniques are showing significant potential in the medical field. The rapid advancement in artificial intelligence methods suggests their soon-to-be essential role in physicians' practices. OBJECTIVES In this study, we sought to assess and compare the readability, clarity, and precision of medical knowledge responses provided by 3 large language models (LLMs) and informed consent forms for 14 common aesthetic surgical procedures, as prepared by the American Society of Plastic Surgeons (ASPS). METHODS The efficacy, readability, and accuracy of 3 leading LLMs, ChatGPT-4 (OpenAI, San Francisco, CA), Gemini (Google, Mountain View, CA), and Copilot (Microsoft, Redmond, WA), was systematically evaluated with 14 different prompts related to the risks of 14 common aesthetic procedures. Alongside these LLM responses, risk sections from the informed consent forms for these procedures, provided by the ASPS, were also reviewed. RESULTS The risk factor segments of the combined general and specific operation consent forms were rated highest for medical knowledge accuracy (P < .05). Regarding readability and clarity, the procedure-specific informed consent forms, including LLMs, scored highest scores (P < .05). However, these same forms received the lowest score for medical knowledge accuracy (P < .05). Interestingly, surgeons preferred patient-facing materials created by ChatGPT-4, citing superior accuracy and medical information compared to other AI tools. CONCLUSIONS Physicians prefer patient-facing materials created by ChatGPT-4 over other AI tools due to their precise and comprehensive medical knowledge. Importantly, adherence to the strong recommendation of ASPS for signing both the procedure-specific and the general informed consent forms can avoid potential future complications and ethical concerns, thereby ensuring patients receive adequate information.
Collapse
|
12
|
Barbosa-Silva J, Driusso P, Ferreira EA, de Abreu RM. Exploring the Efficacy of Artificial Intelligence: A Comprehensive Analysis of CHAT-GPT's Accuracy and Completeness in Addressing Urinary Incontinence Queries. Neurourol Urodyn 2024. [PMID: 39390731 DOI: 10.1002/nau.25603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 09/05/2024] [Accepted: 09/25/2024] [Indexed: 10/12/2024]
Abstract
BACKGROUND Artificial intelligence models are increasingly gaining popularity among patients and healthcare professionals. While it is impossible to restrict patient's access to different sources of information on the Internet, healthcare professional needs to be aware of the content-quality available across different platforms. OBJECTIVE To investigate the accuracy and completeness of Chat Generative Pretrained Transformer (ChatGPT) in addressing frequently asked questions related to the management and treatment of female urinary incontinence (UI), compared to recommendations from guidelines. METHODS This is a cross-sectional study. Two researchers developed 14 frequently asked questions related to UI. Then, they were inserted into the ChatGPT platform on September 16, 2023. The accuracy (scores from 1 to 5) and completeness (score from 1 to 3) of ChatGPT's answers were assessed individually by two experienced researchers in the Women's Health field, following the recommendations proposed by the guidelines for UI. RESULTS Most of the answers were classified as "more correct than incorrect" (n = 6), followed by "incorrect information than correct" (n = 3), "approximately equal correct and incorrect" (n = 2), "near all correct" (n = 2, and "correct" (n = 1). Regarding the appropriateness, most of the answers were classified as adequate, as they provided the minimum information expected to be classified as correct. CONCLUSION These results showed an inconsistency when evaluating the accuracy of answers generated by ChatGPT compared by scientific guidelines. Almost all the answers did not bring the complete content expected or reported in previous guidelines, which highlights to healthcare professionals and scientific community a concern about using artificial intelligence in patient counseling.
Collapse
Affiliation(s)
- Jordana Barbosa-Silva
- Women's Health Research Laboratory, Physical Therapy Department, Federal University of São Carlos, São Carlos, Brazil
| | - Patricia Driusso
- Women's Health Research Laboratory, Physical Therapy Department, Federal University of São Carlos, São Carlos, Brazil
| | - Elizabeth A Ferreira
- Department of Obstetrics and Gynecology, FMUSP School of Medicine, University of São Paulo, São Paulo, Brazil
- Department of Physiotherapy, Speech Therapy and Occupational Therapy, School of Medicine, University of São Paulo, São Paulo, Brazil
| | - Raphael M de Abreu
- Department of Physiotherapy, LUNEX University, International University of Health, Exercise & Sports S.A., Differdange, Luxembourg
- LUNEX ASBL Luxembourg Health & Sport Sciences Research Institute, Differdange, Luxembourg
| |
Collapse
|
13
|
Arun G, Perumal V, Urias FPJB, Ler YE, Tan BWT, Vallabhajosyula R, Tan E, Ng O, Ng KB, Mogali SR. ChatGPT versus a customized AI chatbot (Anatbuddy) for anatomy education: A comparative pilot study. ANATOMICAL SCIENCES EDUCATION 2024; 17:1396-1405. [PMID: 39169464 DOI: 10.1002/ase.2502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 07/30/2024] [Accepted: 07/31/2024] [Indexed: 08/23/2024]
Abstract
Large Language Models (LLMs) have the potential to improve education by personalizing learning. However, ChatGPT-generated content has been criticized for sometimes producing false, biased, and/or hallucinatory information. To evaluate AI's ability to return clear and accurate anatomy information, this study generated a custom interactive and intelligent chatbot (Anatbuddy) through an Open AI Application Programming Interface (API) that enables seamless AI-driven interactions within a secured cloud infrastructure. Anatbuddy was programmed through a Retrieval Augmented Generation (RAG) method to provide context-aware responses to user queries based on a predetermined knowledge base. To compare their outputs, various queries (i.e., prompts) on thoracic anatomy (n = 18) were fed into Anatbuddy and ChatGPT 3.5. A panel comprising three experienced anatomists evaluated both tools' responses for factual accuracy, relevance, completeness, coherence, and fluency on a 5-point Likert scale. These ratings were reviewed by a third party blinded to the study, who revised and finalized scores as needed. Anatbuddy's factual accuracy (mean ± SD = 4.78/5.00 ± 0.43; median = 5.00) was rated significantly higher (U = 84, p = 0.01) than ChatGPT's accuracy (4.11 ± 0.83; median = 4.00). No statistically significant differences were detected between the chatbots for the other variables. Given ChatGPT's current content knowledge limitations, we strongly recommend the anatomy profession develop a custom AI chatbot for anatomy education utilizing a carefully curated knowledge base to ensure accuracy. Further research is needed to determine students' acceptance of custom chatbots for anatomy education and their influence on learning experiences and outcomes.
Collapse
Affiliation(s)
- Gautham Arun
- Lee Kong Chian School of Medicine, Nanyang Technological University Singapore, Singapore, Singapore
- Singapore Polytechnic, Singapore, Singapore
| | - Vivek Perumal
- Lee Kong Chian School of Medicine, Nanyang Technological University Singapore, Singapore, Singapore
| | | | - Yan En Ler
- Singapore Polytechnic, Singapore, Singapore
| | | | | | - Emmanuel Tan
- Lee Kong Chian School of Medicine, Nanyang Technological University Singapore, Singapore, Singapore
| | - Olivia Ng
- Lee Kong Chian School of Medicine, Nanyang Technological University Singapore, Singapore, Singapore
| | - Kian Bee Ng
- Lee Kong Chian School of Medicine, Nanyang Technological University Singapore, Singapore, Singapore
| | | |
Collapse
|
14
|
Patel J, Robinson P, Illing E, Anthony B. Is ChatGPT 3.5 smarter than Otolaryngology trainees? A comparison study of board style exam questions. PLoS One 2024; 19:e0306233. [PMID: 39325705 PMCID: PMC11426521 DOI: 10.1371/journal.pone.0306233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Accepted: 09/01/2024] [Indexed: 09/28/2024] Open
Abstract
OBJECTIVES This study compares the performance of the artificial intelligence (AI) platform Chat Generative Pre-Trained Transformer (ChatGPT) to Otolaryngology trainees on board-style exam questions. METHODS We administered a set of 30 Otolaryngology board-style questions to medical students (MS) and Otolaryngology residents (OR). 31 MSs and 17 ORs completed the questionnaire. The same test was administered to ChatGPT version 3.5, five times. Comparisons of performance were achieved using a one-way ANOVA with Tukey Post Hoc test, along with a regression analysis to explore the relationship between education level and performance. RESULTS The average scores increased each year from MS1 to PGY5. A one-way ANOVA revealed that ChatGPT outperformed trainee years MS1, MS2, and MS3 (p = <0.001, 0.003, and 0.019, respectively). PGY4 and PGY5 otolaryngology residents outperformed ChatGPT (p = 0.033 and 0.002, respectively). For years MS4, PGY1, PGY2, and PGY3 there was no statistical difference between trainee scores and ChatGPT (p = .104, .996, and 1.000, respectively). CONCLUSION ChatGPT can outperform lower-level medical trainees on Otolaryngology board-style exam but still lacks the ability to outperform higher-level trainees. These questions primarily test rote memorization of medical facts; in contrast, the art of practicing medicine is predicated on the synthesis of complex presentations of disease and multilayered application of knowledge of the healing process. Given that upper-level trainees outperform ChatGPT, it is unlikely that ChatGPT, in its current form will provide significant clinical utility over an Otolaryngologist.
Collapse
Affiliation(s)
- Jaimin Patel
- Department of Otolaryngology-Head and Neck Surgery, Indiana University School of Medicine, Indianapolis, IN, United States of America
| | - Peyton Robinson
- Indiana University School of Medicine, Indianapolis, IN, United States of America
| | - Elisa Illing
- Department of Otolaryngology-Head and Neck Surgery, Indiana University School of Medicine, Indianapolis, IN, United States of America
| | - Benjamin Anthony
- Department of Otolaryngology-Head and Neck Surgery, Indiana University School of Medicine, Indianapolis, IN, United States of America
| |
Collapse
|
15
|
Sevgi M, Ruffell E, Antaki F, Chia MA, Keane PA. Foundation models in ophthalmology: opportunities and challenges. Curr Opin Ophthalmol 2024:00055735-990000000-00198. [PMID: 39329204 DOI: 10.1097/icu.0000000000001091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/28/2024]
Abstract
PURPOSE OF REVIEW Last year marked the development of the first foundation model in ophthalmology, RETFound, setting the stage for generalizable medical artificial intelligence (GMAI) that can adapt to novel tasks. Additionally, rapid advancements in large language model (LLM) technology, including models such as GPT-4 and Gemini, have been tailored for medical specialization and evaluated on clinical scenarios with promising results. This review explores the opportunities and challenges for further advancements in these technologies. RECENT FINDINGS RETFound outperforms traditional deep learning models in specific tasks, even when only fine-tuned on small datasets. Additionally, LMMs like Med-Gemini and Medprompt GPT-4 perform better than out-of-the-box models for ophthalmology tasks. However, there is still a significant deficiency in ophthalmology-specific multimodal models. This gap is primarily due to the substantial computational resources required to train these models and the limitations of high-quality ophthalmology datasets. SUMMARY Overall, foundation models in ophthalmology present promising opportunities but face challenges, particularly the need for high-quality, standardized datasets for training and specialization. Although development has primarily focused on large language and vision models, the greatest opportunities lie in advancing large multimodal models, which can more closely mimic the capabilities of clinicians.
Collapse
Affiliation(s)
- Mertcan Sevgi
- Institute of Ophthalmology, University College London
- Moorfields Eye Hospital NHS Foundation Trust
- NIHR Biomedical Research Centre at Moorfields Eye Hospital NHS Foundation Trust, London, UK
| | - Eden Ruffell
- Institute of Ophthalmology, University College London
- Institute of Health Informatics
- Centre for Medical Image Computing, University College London
- NIHR Biomedical Research Centre at Moorfields Eye Hospital NHS Foundation Trust, London, UK
| | - Fares Antaki
- Institute of Ophthalmology, University College London
- Moorfields Eye Hospital NHS Foundation Trust
- The CHUM School of Artificial Intelligence in Healthcare, Montreal, Quebec, Canada
| | - Mark A Chia
- Institute of Ophthalmology, University College London
- Moorfields Eye Hospital NHS Foundation Trust
- NIHR Biomedical Research Centre at Moorfields Eye Hospital NHS Foundation Trust, London, UK
| | - Pearse A Keane
- Institute of Ophthalmology, University College London
- Moorfields Eye Hospital NHS Foundation Trust
- NIHR Biomedical Research Centre at Moorfields Eye Hospital NHS Foundation Trust, London, UK
| |
Collapse
|
16
|
Chia MA, Antaki F, Zhou Y, Turner AW, Lee AY, Keane PA. Foundation models in ophthalmology. Br J Ophthalmol 2024; 108:1341-1348. [PMID: 38834291 PMCID: PMC11503093 DOI: 10.1136/bjo-2024-325459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 04/26/2024] [Indexed: 06/06/2024]
Abstract
Foundation models represent a paradigm shift in artificial intelligence (AI), evolving from narrow models designed for specific tasks to versatile, generalisable models adaptable to a myriad of diverse applications. Ophthalmology as a specialty has the potential to act as an exemplar for other medical specialties, offering a blueprint for integrating foundation models broadly into clinical practice. This review hopes to serve as a roadmap for eyecare professionals seeking to better understand foundation models, while equipping readers with the tools to explore the use of foundation models in their own research and practice. We begin by outlining the key concepts and technological advances which have enabled the development of these models, providing an overview of novel training approaches and modern AI architectures. Next, we summarise existing literature on the topic of foundation models in ophthalmology, encompassing progress in vision foundation models, large language models and large multimodal models. Finally, we outline major challenges relating to privacy, bias and clinical validation, and propose key steps forward to maximise the benefit of this powerful technology.
Collapse
Affiliation(s)
- Mark A Chia
- Institute of Ophthalmology, University College London, London, UK
- NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, London, UK
| | - Fares Antaki
- Institute of Ophthalmology, University College London, London, UK
- NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, London, UK
- The CHUM School of Artificial Intelligence in Healthcare, Montreal, Quebec, Canada
| | - Yukun Zhou
- Institute of Ophthalmology, University College London, London, UK
- NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, London, UK
| | - Angus W Turner
- Lions Outback Vision, Lions Eye Institute, Nedlands, Western Australia, Australia
- University of Western Australia, Perth, Western Australia, Australia
| | - Aaron Y Lee
- Department of Ophthalmology, University of Washington, Seattle, Washington, USA
- Roger and Angie Karalis Johnson Retina Center, University of Washington, Seattle, Washington, USA
| | - Pearse A Keane
- Institute of Ophthalmology, University College London, London, UK
- NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, London, UK
| |
Collapse
|
17
|
Fowler T, Pullen S, Birkett L. Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions. Br J Ophthalmol 2024; 108:1379-1383. [PMID: 37932006 DOI: 10.1136/bjo-2023-324091] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Accepted: 10/08/2023] [Indexed: 11/08/2023]
Abstract
BACKGROUND Chat Generative Pre-trained Transformer (ChatGPT), a large language model by OpenAI, and Bard, Google's artificial intelligence (AI) chatbot, have been evaluated in various contexts. This study aims to assess these models' proficiency in the part 1 Fellowship of the Royal College of Ophthalmologists (FRCOphth) Multiple Choice Question (MCQ) examination, highlighting their potential in medical education. METHODS Both models were tested on a sample question bank for the part 1 FRCOphth MCQ exam. Their performances were compared with historical human performance on the exam, focusing on the ability to comprehend, retain and apply information related to ophthalmology. We also tested it on the book 'MCQs for FRCOpth part 1', and assessed its performance across subjects. RESULTS ChatGPT demonstrated a strong performance, surpassing historical human pass marks and examination performance, while Bard underperformed. The comparison indicates the potential of certain AI models to match, and even exceed, human standards in such tasks. CONCLUSION The results demonstrate the potential of AI models, such as ChatGPT, in processing and applying medical knowledge at a postgraduate level. However, performance varied among different models, highlighting the importance of appropriate AI selection. The study underlines the potential for AI applications in medical education and the necessity for further investigation into their strengths and limitations.
Collapse
Affiliation(s)
- Thomas Fowler
- Department of Medicine, Barking Havering and Redbridge University Hospitals NHS Trust, London, UK
| | - Simon Pullen
- Department of Anaesthetics, Princess Alexandra Hospital, Harlow, UK
| | - Liam Birkett
- Emergency Medicine, Royal Free Hospital, London, UK
| |
Collapse
|
18
|
Sonmez SC, Sevgi M, Antaki F, Huemer J, Keane PA. Generative artificial intelligence in ophthalmology: current innovations, future applications and challenges. Br J Ophthalmol 2024; 108:1335-1340. [PMID: 38925907 PMCID: PMC11503064 DOI: 10.1136/bjo-2024-325458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Accepted: 06/03/2024] [Indexed: 06/28/2024]
Abstract
The rapid advancements in generative artificial intelligence are set to significantly influence the medical sector, particularly ophthalmology. Generative adversarial networks and diffusion models enable the creation of synthetic images, aiding the development of deep learning models tailored for specific imaging tasks. Additionally, the advent of multimodal foundational models, capable of generating images, text and videos, presents a broad spectrum of applications within ophthalmology. These range from enhancing diagnostic accuracy to improving patient education and training healthcare professionals. Despite the promising potential, this area of technology is still in its infancy, and there are several challenges to be addressed, including data bias, safety concerns and the practical implementation of these technologies in clinical settings.
Collapse
Affiliation(s)
| | - Mertcan Sevgi
- Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, NIHR Moorfields Biomedical Research Centre, London, UK
| | - Fares Antaki
- Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, NIHR Moorfields Biomedical Research Centre, London, UK
- The CHUM School of Artificial Intelligence in Healthcare, Montreal, Quebec, Canada
| | - Josef Huemer
- Moorfields Eye Hospital, NIHR Moorfields Biomedical Research Centre, London, UK
- Department of Ophthalmology and Optometry, Kepler University Hospital, Linz, Austria
| | - Pearse A Keane
- Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, NIHR Moorfields Biomedical Research Centre, London, UK
| |
Collapse
|
19
|
Milad D, Antaki F, Milad J, Farah A, Khairy T, Mikhail D, Giguère CÉ, Touma S, Bernstein A, Szigiato AA, Nayman T, Mullie GA, Duval R. Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases. Br J Ophthalmol 2024; 108:1398-1405. [PMID: 38365427 DOI: 10.1136/bjo-2023-325053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 02/07/2024] [Indexed: 02/18/2024]
Abstract
BACKGROUND/AIMS This study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases. METHODS We tested GPT-4 on 422 Journal of the American Medical Association Ophthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and-solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort. RESULTS Using PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p≤0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020). CONCLUSION Improved prompting enhances GPT-4's performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.
Collapse
Affiliation(s)
- Daniel Milad
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
| | - Fares Antaki
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Institute of Ophthalmology, University College London, London, UK
- CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l'Université de Montréal (CHUM), Montreal, Quebec, Canada
| | - Jason Milad
- Department of Software Engineering, University of Waterloo, Waterloo, Ontario, Canada
| | - Andrew Farah
- Faculty of Medicine, McGill University, Montreal, Quebec, Canada
| | - Thomas Khairy
- Faculty of Medicine, McGill University, Montreal, Quebec, Canada
| | - David Mikhail
- Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Charles-Édouard Giguère
- Centre de recherche de l'Institut universitaire en santé mentale de Montréal, Montréal, Quebec, Canada
| | - Samir Touma
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
| | - Allison Bernstein
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
| | - Andrei-Alexandru Szigiato
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital du Sacré-Coeur de Montréal, Montreal, Quebec, Canada
| | - Taylor Nayman
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
| | - Guillaume A Mullie
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Cité-de-la-Santé Hospital, Laval, Quebec, Canada
| | - Renaud Duval
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
| |
Collapse
|
20
|
Antaki F, Milad D, Chia MA, Giguère CÉ, Touma S, El-Khoury J, Keane PA, Duval R. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. Br J Ophthalmol 2024; 108:1371-1378. [PMID: 37923374 DOI: 10.1136/bjo-2023-324438] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 10/01/2023] [Indexed: 11/07/2023]
Abstract
BACKGROUND Evidence on the performance of Generative Pre-trained Transformer 4 (GPT-4), a large language model (LLM), in the ophthalmology question-answering domain is needed. METHODS We tested GPT-4 on two 260-question multiple choice question sets from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions question banks. We compared the accuracy of GPT-4 models with varying temperatures (creativity setting) and evaluated their responses in a subset of questions. We also compared the best-performing GPT-4 model to GPT-3.5 and to historical human performance. RESULTS GPT-4-0.3 (GPT-4 with a temperature of 0.3) achieved the highest accuracy among GPT-4 models, with 75.8% on the BCSC set and 70.0% on the OphthoQuestions set. The combined accuracy was 72.9%, which represents an 18.3% raw improvement in accuracy compared with GPT-3.5 (p<0.001). Human graders preferred responses from models with a temperature higher than 0 (more creative). Exam section, question difficulty and cognitive level were all predictive of GPT-4-0.3 answer accuracy. GPT-4-0.3's performance was numerically superior to human performance on the BCSC (75.8% vs 73.3%) and OphthoQuestions (70.0% vs 63.0%), but the difference was not statistically significant (p=0.55 and p=0.09). CONCLUSION GPT-4, an LLM trained on non-ophthalmology-specific data, performs significantly better than its predecessor on simulated ophthalmology board-style exams. Remarkably, its performance tended to be superior to historical human performance, but that difference was not statistically significant in our study.
Collapse
Affiliation(s)
- Fares Antaki
- Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Institute of Ophthalmology, UCL, London, UK
- The CHUM School of Artificial Intelligence in Healthcare, Montreal, Quebec, Canada
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Centre Hospitalier de l'Universite de Montreal (CHUM), Montreal, Quebec, Canada
| | - Daniel Milad
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Centre Hospitalier de l'Universite de Montreal (CHUM), Montreal, Quebec, Canada
- Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada
| | - Mark A Chia
- Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Institute of Ophthalmology, UCL, London, UK
| | | | - Samir Touma
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Centre Hospitalier de l'Universite de Montreal (CHUM), Montreal, Quebec, Canada
- Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada
| | - Jonathan El-Khoury
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Centre Hospitalier de l'Universite de Montreal (CHUM), Montreal, Quebec, Canada
- Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada
| | - Pearse A Keane
- Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Institute of Ophthalmology, UCL, London, UK
- NIHR Moorfields Biomedical Research Centre, London, UK
| | - Renaud Duval
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada
| |
Collapse
|
21
|
Wong M, Lim ZW, Pushpanathan K, Cheung CY, Wang YX, Chen D, Tham YC. Review of emerging trends and projection of future developments in large language models research in ophthalmology. Br J Ophthalmol 2024; 108:1362-1370. [PMID: 38164563 DOI: 10.1136/bjo-2023-324734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 11/14/2023] [Indexed: 01/03/2024]
Abstract
BACKGROUND Large language models (LLMs) are fast emerging as potent tools in healthcare, including ophthalmology. This systematic review offers a twofold contribution: it summarises current trends in ophthalmology-related LLM research and projects future directions for this burgeoning field. METHODS We systematically searched across various databases (PubMed, Europe PMC, Scopus and Web of Science) for articles related to LLM use in ophthalmology, published between 1 January 2022 and 31 July 2023. Selected articles were summarised, and categorised by type (editorial, commentary, original research, etc) and their research focus (eg, evaluating ChatGPT's performance in ophthalmology examinations or clinical tasks). FINDINGS We identified 32 articles meeting our criteria, published between January and July 2023, with a peak in June (n=12). Most were original research evaluating LLMs' proficiency in clinically related tasks (n=9). Studies demonstrated that ChatGPT-4.0 outperformed its predecessor, ChatGPT-3.5, in ophthalmology exams. Furthermore, ChatGPT excelled in constructing discharge notes (n=2), evaluating diagnoses (n=2) and answering general medical queries (n=6). However, it struggled with generating scientific articles or abstracts (n=3) and answering specific subdomain questions, especially those regarding specific treatment options (n=2). ChatGPT's performance relative to other LLMs (Google's Bard, Microsoft's Bing) varied by study design. Ethical concerns such as data hallucination (n=27), authorship (n=5) and data privacy (n=2) were frequently cited. INTERPRETATION While LLMs hold transformative potential for healthcare and ophthalmology, concerns over accountability, accuracy and data security remain. Future research should focus on application programming interface integration, comparative assessments of popular LLMs, their ability to interpret image-based data and the establishment of standardised evaluation frameworks.
Collapse
Affiliation(s)
| | - Zhi Wei Lim
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Krithi Pushpanathan
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Carol Y Cheung
- Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Ya Xing Wang
- Beijing Institute of Ophthalmology, Beijing Tongren Hospital, Capital University of Medical Science, Beijing, China
| | - David Chen
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yih Chung Tham
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
| |
Collapse
|
22
|
Xu P, Chen X, Zhao Z, Shi D. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br J Ophthalmol 2024; 108:1384-1389. [PMID: 38789133 DOI: 10.1136/bjo-2023-325054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2023] [Accepted: 05/13/2024] [Indexed: 05/26/2024]
Abstract
PURPOSE To evaluate the capabilities and incapabilities of a GPT-4V(ision)-based chatbot in interpreting ocular multimodal images. METHODS We developed a digital ophthalmologist app using GPT-4V and evaluated its performance with a dataset (60 images, 60 ophthalmic conditions, 6 modalities) that included slit-lamp, scanning laser ophthalmoscopy, fundus photography of the posterior pole (FPP), optical coherence tomography, fundus fluorescein angiography and ocular ultrasound images. The chatbot was tested with ten open-ended questions per image, covering examination identification, lesion detection, diagnosis and decision support. The responses were manually assessed for accuracy, usability, safety and diagnosis repeatability. Auto-evaluation was performed using sentence similarity and GPT-4-based auto-evaluation. RESULTS Out of 600 responses, 30.6% were accurate, 21.5% were highly usable and 55.6% were deemed as no harm. GPT-4V performed best with slit-lamp images, with 42.0%, 38.5% and 68.5% of the responses being accurate, highly usable and no harm, respectively. However, its performance was weaker in FPP images, with only 13.7%, 3.7% and 38.5% in the same categories. GPT-4V correctly identified 95.6% of the imaging modalities and showed varying accuracies in lesion identification (25.6%), diagnosis (16.1%) and decision support (24.0%). The overall repeatability of GPT-4V in diagnosing ocular images was 63.3% (38/60). The overall sentence similarity between responses generated by GPT-4V and human answers is 55.5%, with Spearman correlations of 0.569 for accuracy and 0.576 for usability. CONCLUSION GPT-4V currently is not yet suitable for clinical decision-making in ophthalmology. Our study serves as a benchmark for enhancing ophthalmic multimodal models.
Collapse
Affiliation(s)
- Pusheng Xu
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
| | - Xiaolan Chen
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
| | - Ziwei Zhao
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
| | - Danli Shi
- School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
- Research Centre for SHARP Vision, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
- Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Hong Kong
| |
Collapse
|
23
|
Bahir D, Zur O, Attal L, Nujeidat Z, Knaanie A, Pikkel J, Mimouni M, Plopsky G. Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge. Graefes Arch Clin Exp Ophthalmol 2024:10.1007/s00417-024-06625-4. [PMID: 39277830 DOI: 10.1007/s00417-024-06625-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 08/15/2024] [Accepted: 08/16/2024] [Indexed: 09/17/2024] Open
Abstract
INTRODUCTION The rapid advancement of artificial intelligence (AI), particularly in large language models like ChatGPT and Google's Gemini AI, marks a transformative era in technological innovation. This study explores the potential of AI in ophthalmology, focusing on the capabilities of ChatGPT and Gemini AI. While these models hold promise for medical education and clinical support, their integration requires comprehensive evaluation. This research aims to bridge a gap in the literature by comparing Gemini AI and ChatGPT, assessing their performance against ophthalmology residents using a dataset derived from ophthalmology board exams. METHODS A dataset comprising 600 questions across 12 subspecialties was curated from Israeli ophthalmology residency exams, encompassing text and image-based formats. Four AI models - ChatGPT-3.5, ChatGPT-4, Gemini, and Gemini Advanced - underwent testing on this dataset. The study includes a comparative analysis with Israeli ophthalmology residents, employing specific metrics for performance assessment. RESULTS Gemini Advanced demonstrated superior performance with a 66% accuracy rate. Notably, ChatGPT-4 exhibited improvement at 62%, Gemini at 58%, and ChatGPT-3.5 served as the reference at 46%. Comparative analysis with residents offered insights into AI models' performance relative to human-level medical knowledge. Further analysis delved into yearly performance trends, topic-specific variations, and the impact of images on chatbot accuracy. CONCLUSION The study unveils nuanced AI model capabilities in ophthalmology, emphasizing domain-specific variations. The superior performance of Gemini Advanced superior performance indicates significant advancements, while ChatGPT-4's improvement is noteworthy. Both Gemini and ChatGPT-3.5 demonstrated commendable performance. The comparative analysis underscores AI's evolving role as a supplementary tool in medical education. This research contributes vital insights into AI effectiveness in ophthalmology, highlighting areas for refinement. As AI models evolve, targeted improvements can enhance adaptability across subspecialties, making them valuable tools for medical professionals and enriching patient care. KEY MESSAGES What is known AI breakthroughs, like ChatGPT and Google's Gemini AI, are reshaping healthcare. In ophthalmology, AI integration has overhauled clinical workflows, particularly in analyzing images for diseases like diabetic retinopathy and glaucoma. What is new This study presents a pioneering comparison between Gemini AI and ChatGPT, evaluating their performance against ophthalmology residents using a meticulously curated dataset derived from real-world ophthalmology board exams. Notably, Gemini Advanced demonstrates superior performance, showcasing substantial advancements, while the evolution of ChatGPT-4 also merits attention. Both models exhibit commendable capabilities. These findings offer crucial insights into the efficacy of AI in ophthalmology, shedding light on areas ripe for further enhancement and optimization.
Collapse
Affiliation(s)
- Daniel Bahir
- Department of Ophthalmology, Tzafon Medical Center, Poriya, Israel.
- Azrieli Faculty of Medicine, Bar Ilan University, Safed, Israel.
| | - Omri Zur
- Azrieli Faculty of Medicine, Bar Ilan University, Safed, Israel
| | - Leah Attal
- Azrieli Faculty of Medicine, Bar Ilan University, Safed, Israel
| | - Zaki Nujeidat
- Department of Ophthalmology, Tzafon Medical Center, Poriya, Israel
- Azrieli Faculty of Medicine, Bar Ilan University, Safed, Israel
| | - Ariela Knaanie
- Department of Ophthalmology, Samson Assuta Ashdod Hospital, Ashdod, Israel
| | - Joseph Pikkel
- Department of Ophthalmology, Samson Assuta Ashdod Hospital, Ashdod, Israel
- Faculty of Health Sciences, Ben-Gurion University of the Negev, Be'er Sheva, Israel
| | - Michael Mimouni
- Department of Ophthalmology, Rambam Health Care Campus, Haifa, Israel
| | - Gilad Plopsky
- Department of Ophthalmology, Samson Assuta Ashdod Hospital, Ashdod, Israel
- Faculty of Health Sciences, Ben-Gurion University of the Negev, Be'er Sheva, Israel
| |
Collapse
|
24
|
Guastafierro V, Corbitt DN, Bressan A, Fernandes B, Mintemur Ö, Magnoli F, Ronchi S, La Rosa S, Uccella S, Renne SL. Unveiling the risks of ChatGPT in diagnostic surgical pathology. Virchows Arch 2024:10.1007/s00428-024-03918-1. [PMID: 39269615 DOI: 10.1007/s00428-024-03918-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2024] [Revised: 08/28/2024] [Accepted: 08/29/2024] [Indexed: 09/15/2024]
Abstract
ChatGPT, an AI capable of processing and generating human-like language, has been studied in medical education and care, yet its potential in histopathological diagnosis remains unexplored. This study evaluates ChatGPT's reliability in addressing pathology-related diagnostic questions across ten subspecialties and its ability to provide scientific references. We crafted five clinico-pathological scenarios per subspecialty, simulating a pathologist using ChatGPT to refine differential diagnoses. Each scenario, aligned with current diagnostic guidelines and validated by expert pathologists, was posed as open-ended or multiple-choice questions, either requesting scientific references or not. Outputs were assessed by six pathologists according to. (1) usefulness in supporting the diagnosis and (2) absolute number of errors. We used directed acyclic graphs and structural causal models to determine the effect of each scenario type, field, question modality, and pathologist evaluation. We yielded 894 evaluations. ChatGPT provided useful answers in 62.2% of cases, and 32.1% of outputs contained no errors, while the remaining had at least one error. ChatGPT provided 214 bibliographic references: 70.1% correct, 12.1% inaccurate, and 17.8% non-existing. Scenario variability had the greatest impact on ratings, and latent knowledge across fields showed minimal variation. Although ChatGPT provided useful responses in one-third of cases, the frequency of errors and variability underscores its inadequacy for routine diagnostic use and highlights the need for discretion as a support tool. Imprecise referencing also suggests caution as a self-learning tool. It is essential to recognize the irreplaceable role of human experts in synthesizing images, clinical data, and experience for the intricate task of histopathological diagnosis.
Collapse
Affiliation(s)
- Vincenzo Guastafierro
- Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, 20072, Pieve Emanuele, Milan, Italy
- Department of Pathology, IRCCS Humanitas Research Hospital, Via Manzoni 56, 20089, Rozzano, Milan, Italy
| | - Devin N Corbitt
- Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, 20072, Pieve Emanuele, Milan, Italy
| | - Alessandra Bressan
- Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, 20072, Pieve Emanuele, Milan, Italy
- Department of Pathology, IRCCS Humanitas Research Hospital, Via Manzoni 56, 20089, Rozzano, Milan, Italy
| | - Bethania Fernandes
- Department of Pathology, IRCCS Humanitas Research Hospital, Via Manzoni 56, 20089, Rozzano, Milan, Italy
| | - Ömer Mintemur
- Department of Pathology, IRCCS Humanitas Research Hospital, Via Manzoni 56, 20089, Rozzano, Milan, Italy
| | - Francesca Magnoli
- Unit of Pathology, Department of Oncology, ASST Sette Laghi, Varese, Italy
| | - Susanna Ronchi
- Unit of Pathology, Department of Oncology, ASST Sette Laghi, Varese, Italy
| | - Stefano La Rosa
- Unit of Pathology, Department of Oncology, ASST Sette Laghi, Varese, Italy
- Unit of Pathology, Department of Medicine and Technological Innovation, University of Insubria, Varese, Italy
| | - Silvia Uccella
- Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, 20072, Pieve Emanuele, Milan, Italy
- Department of Pathology, IRCCS Humanitas Research Hospital, Via Manzoni 56, 20089, Rozzano, Milan, Italy
| | - Salvatore Lorenzo Renne
- Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, 20072, Pieve Emanuele, Milan, Italy.
- Department of Pathology, IRCCS Humanitas Research Hospital, Via Manzoni 56, 20089, Rozzano, Milan, Italy.
| |
Collapse
|
25
|
Jung H, Oh J, Stephenson KAJ, Joe AW, Mammo ZN. Prompt engineering with ChatGPT3.5 and GPT4 to improve patient education on retinal diseases. CANADIAN JOURNAL OF OPHTHALMOLOGY 2024:S0008-4182(24)00258-8. [PMID: 39245293 DOI: 10.1016/j.jcjo.2024.08.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 04/24/2024] [Accepted: 08/18/2024] [Indexed: 09/10/2024]
Abstract
OBJECTIVE To assess the effect of prompt engineering on the accuracy, comprehensiveness, readability, and empathy of large language model (LLM)-generated responses to patient questions regarding retinal disease. DESIGN Prospective qualitative study. PARTICIPANTS Retina specialists, ChatGPT3.5, and GPT4. METHODS Twenty common patient questions regarding 5 retinal conditions were inputted to ChatGPT3.5 and GPT4 as a stand-alone question or preceded by an optimized prompt (prompt A) or preceded by prompt A with specified limits to length and grade reading level (prompt B). Accuracy and comprehensiveness were graded by 3 retina specialists on a Likert scale from 1 to 5 (1: very poor to 5: very good). Readability of responses was assessed using Readable.com, an online readability tool. RESULTS There were no significant differences between ChatGPT3.5 and GPT4 across any of the metrics tested. Median accuracy of responses to a stand-alone question, prompt A, and prompt B questions were 5.0, 5.0, and 4.0, respectively. Median comprehensiveness of responses to a stand-alone question, prompt A, and prompt B questions were 5.0, 5.0, and 4.0, respectively. The use of prompt B was associated with a lower accuracy and comprehensiveness than responses to stand-alone question or prompt A questions (p < 0.001). Average-grade reading level of responses across both LLMs were 13.45, 11.5, and 10.3 for a stand-alone question, prompt A, and prompt B questions, respectively (p < 0.001). CONCLUSIONS Prompt engineering can significantly improve readability of LLM-generated responses, although at the cost of reducing accuracy and comprehensiveness. Further study is needed to understand the utility and bioethical implications of LLMs as a patient educational resource.
Collapse
Affiliation(s)
- Hoyoung Jung
- Faculty of Medicine, University of British Columbia, Vancouver BC, Canada
| | - Jean Oh
- Faculty of Medicine, University of British Columbia, Vancouver BC, Canada
| | - Kirk A J Stephenson
- Department of Ophthalmology and Visual Sciences, University of British Columbia, Vancouver BC, Canada
| | - Aaron W Joe
- Department of Ophthalmology and Visual Sciences, University of British Columbia, Vancouver BC, Canada
| | - Zaid N Mammo
- Department of Ophthalmology and Visual Sciences, University of British Columbia, Vancouver BC, Canada.
| |
Collapse
|
26
|
Strzalkowski P, Strzalkowska A, Chhablani J, Pfau K, Errera MH, Roth M, Schaub F, Bechrakis NE, Hoerauf H, Reiter C, Schuster AK, Geerling G, Guthoff R. Evaluation of the accuracy and readability of ChatGPT-4 and Google Gemini in providing information on retinal detachment: a multicenter expert comparative study. Int J Retina Vitreous 2024; 10:61. [PMID: 39223678 PMCID: PMC11367851 DOI: 10.1186/s40942-024-00579-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2024] [Accepted: 08/22/2024] [Indexed: 09/04/2024] Open
Abstract
BACKGROUND Large language models (LLMs) such as ChatGPT-4 and Google Gemini show potential for patient health education, but concerns about their accuracy require careful evaluation. This study evaluates the readability and accuracy of ChatGPT-4 and Google Gemini in answering questions about retinal detachment. METHODS Comparative study analyzing responses from ChatGPT-4 and Google Gemini to 13 retinal detachment questions, categorized by difficulty levels (D1, D2, D3). Masked responses were reviewed by ten vitreoretinal specialists and rated on correctness, errors, thematic accuracy, coherence, and overall quality grading. Analysis included Flesch Readability Ease Score, word and sentence counts. RESULTS Both Artificial Intelligence tools required college-level understanding for all difficulty levels. Google Gemini was easier to understand (p = 0.03), while ChatGPT-4 provided more correct answers for the more difficult questions (p = 0.0005) with fewer serious errors. ChatGPT-4 scored highest on most challenging questions, showing superior thematic accuracy (p = 0.003). ChatGPT-4 outperformed Google Gemini in 8 of 13 questions, with higher overall quality grades in the easiest (p = 0.03) and hardest levels (p = 0.0002), showing a lower grade as question difficulty increased. CONCLUSIONS ChatGPT-4 and Google Gemini effectively address queries about retinal detachment, offering mostly accurate answers with few critical errors, though patients require higher education for comprehension. The implementation of AI tools may contribute to improving medical care by providing accurate and relevant healthcare information quickly.
Collapse
Affiliation(s)
- Piotr Strzalkowski
- Department of Ophthalmology, Medical Faculty and University Hospital Düsseldorf - Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
| | - Alicja Strzalkowska
- Department of Ophthalmology, Medical Faculty and University Hospital Düsseldorf - Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Jay Chhablani
- UPMC Eye Center, University of Pittsburgh, Pittsburgh, PA, USA
| | - Kristina Pfau
- Department of Ophthalmology, University Hospital of Basel, Basel, Switzerland
| | | | - Mathias Roth
- Department of Ophthalmology, Medical Faculty and University Hospital Düsseldorf - Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Friederike Schaub
- Department of Ophthalmology, University Medical Centre Rostock, Rostock, Germany
| | | | - Hans Hoerauf
- Department of Ophthalmology, University Medical Center Göttingen, Göttingen, Germany
| | - Constantin Reiter
- Department of Ophthalmology, Helios HSK Wiesbaden, Wiesbaden, Germany
| | - Alexander K Schuster
- Department of Ophthalmology, Mainz University Medical Centre of the Johannes Gutenberg, University of Mainz, Mainz, Germany
| | - Gerd Geerling
- Department of Ophthalmology, Medical Faculty and University Hospital Düsseldorf - Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Rainer Guthoff
- Department of Ophthalmology, Medical Faculty and University Hospital Düsseldorf - Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| |
Collapse
|
27
|
Tailor PD, D'Souza HS, Li H, Starr MR. Vision of the future: large language models in ophthalmology. Curr Opin Ophthalmol 2024; 35:391-402. [PMID: 38814572 DOI: 10.1097/icu.0000000000001062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2024]
Abstract
PURPOSE OF REVIEW Large language models (LLMs) are rapidly entering the landscape of medicine in areas from patient interaction to clinical decision-making. This review discusses the evolving role of LLMs in ophthalmology, focusing on their current applications and future potential in enhancing ophthalmic care. RECENT FINDINGS LLMs in ophthalmology have demonstrated potential in improving patient communication and aiding preliminary diagnostics because of their ability to process complex language and generate human-like domain-specific interactions. However, some studies have shown potential for harm and there have been no prospective real-world studies evaluating the safety and efficacy of LLMs in practice. SUMMARY While current applications are largely theoretical and require rigorous safety testing before implementation, LLMs exhibit promise in augmenting patient care quality and efficiency. Challenges such as data privacy and user acceptance must be overcome before LLMs can be fully integrated into clinical practice.
Collapse
Affiliation(s)
| | - Haley S D'Souza
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Hanzhou Li
- Department of Radiology, Emory University, Atlanta, Georgia, USA
| | - Matthew R Starr
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
28
|
Carlà MM, Gambini G, Baldascino A, Boselli F, Giannuzzi F, Margollicci F, Rizzo S. Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison. Graefes Arch Clin Exp Ophthalmol 2024; 262:2945-2959. [PMID: 38573349 PMCID: PMC11377518 DOI: 10.1007/s00417-024-06470-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 03/11/2024] [Accepted: 03/20/2024] [Indexed: 04/05/2024] Open
Abstract
PURPOSE The aim of this study was to define the capability of ChatGPT-4 and Google Gemini in analyzing detailed glaucoma case descriptions and suggesting an accurate surgical plan. METHODS Retrospective analysis of 60 medical records of surgical glaucoma was divided into "ordinary" (n = 40) and "challenging" (n = 20) scenarios. Case descriptions were entered into ChatGPT and Bard's interfaces with the question "What kind of surgery would you perform?" and repeated three times to analyze the answers' consistency. After collecting the answers, we assessed the level of agreement with the unified opinion of three glaucoma surgeons. Moreover, we graded the quality of the responses with scores from 1 (poor quality) to 5 (excellent quality), according to the Global Quality Score (GQS) and compared the results. RESULTS ChatGPT surgical choice was consistent with those of glaucoma specialists in 35/60 cases (58%), compared to 19/60 (32%) of Gemini (p = 0.0001). Gemini was not able to complete the task in 16 cases (27%). Trabeculectomy was the most frequent choice for both chatbots (53% and 50% for ChatGPT and Gemini, respectively). In "challenging" cases, ChatGPT agreed with specialists in 9/20 choices (45%), outperforming Google Gemini performances (4/20, 20%). Overall, GQS scores were 3.5 ± 1.2 and 2.1 ± 1.5 for ChatGPT and Gemini (p = 0.002). This difference was even more marked if focusing only on "challenging" cases (1.5 ± 1.4 vs. 3.0 ± 1.5, p = 0.001). CONCLUSION ChatGPT-4 showed a good analysis performance for glaucoma surgical cases, either ordinary or challenging. On the other side, Google Gemini showed strong limitations in this setting, presenting high rates of unprecise or missed answers.
Collapse
Affiliation(s)
- Matteo Mario Carlà
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy.
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy.
| | - Gloria Gambini
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| | - Antonio Baldascino
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| | - Francesco Boselli
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| | - Federico Giannuzzi
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| | - Fabio Margollicci
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| | - Stanislao Rizzo
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| |
Collapse
|
29
|
Kenney RC, Requarth TW, Jack AI, Hyman SW, Galetta SL, Grossman SN. AI in Neuro-Ophthalmology: Current Practice and Future Opportunities. J Neuroophthalmol 2024; 44:308-318. [PMID: 38965655 DOI: 10.1097/wno.0000000000002205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2024]
Abstract
BACKGROUND Neuro-ophthalmology frequently requires a complex and multi-faceted clinical assessment supported by sophisticated imaging techniques in order to assess disease status. The current approach to diagnosis requires substantial expertise and time. The emergence of AI has brought forth innovative solutions to streamline and enhance this diagnostic process, which is especially valuable given the shortage of neuro-ophthalmologists. Machine learning algorithms, in particular, have demonstrated significant potential in interpreting imaging data, identifying subtle patterns, and aiding clinicians in making more accurate and timely diagnosis while also supplementing nonspecialist evaluations of neuro-ophthalmic disease. EVIDENCE ACQUISITION Electronic searches of published literature were conducted using PubMed and Google Scholar. A comprehensive search of the following terms was conducted within the Journal of Neuro-Ophthalmology: AI, artificial intelligence, machine learning, deep learning, natural language processing, computer vision, large language models, and generative AI. RESULTS This review aims to provide a comprehensive overview of the evolving landscape of AI applications in neuro-ophthalmology. It will delve into the diverse applications of AI, optical coherence tomography (OCT), and fundus photography to the development of predictive models for disease progression. Additionally, the review will explore the integration of generative AI into neuro-ophthalmic education and clinical practice. CONCLUSIONS We review the current state of AI in neuro-ophthalmology and its potentially transformative impact. The inclusion of AI in neuro-ophthalmic practice and research not only holds promise for improving diagnostic accuracy but also opens avenues for novel therapeutic interventions. We emphasize its potential to improve access to scarce subspecialty resources while examining the current challenges associated with the integration of AI into clinical practice and research.
Collapse
Affiliation(s)
- Rachel C Kenney
- Departments of Neurology (RCK, AJ, SH, SG, SNG), Population Health (RCK), and Ophthalmology (SG), New York University Grossman School of Medicine, New York, New York; and Vilcek Institute of Graduate Biomedical Sciences (TR), New York University Grossman School of Medicine, New York, New York
| | | | | | | | | | | |
Collapse
|
30
|
Wu JH, Nishida T, Liu TYA. Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis. Asia Pac J Ophthalmol (Phila) 2024; 13:100106. [PMID: 39374807 DOI: 10.1016/j.apjo.2024.100106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 09/20/2024] [Accepted: 09/26/2024] [Indexed: 10/09/2024] Open
Abstract
PURPOSE To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions. DESIGN Meta-analysis. METHODS Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed. RESULTS Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61-0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73-0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51-0.54). LLMs performed best in "pathology" (0.78 [95 % CI: 0.70-0.86]) and worst in "fundamentals and principles of ophthalmology" (0.52 [95 % CI: 0.48-0.56]). CONCLUSIONS The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.
Collapse
Affiliation(s)
- Jo-Hsuan Wu
- Edward S. Harkness Eye Institute, Department of Ophthalmology, Columbia University Irving Medical Center, New York, NY 10032, USA; Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA
| | - Takashi Nishida
- Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA
| | - T Y Alvin Liu
- Retina Division, Wilmer Eye Institute, Johns Hopkins University, Baltimore, MD 21287, USA.
| |
Collapse
|
31
|
Gill GS, Tsai J, Moxam J, Sanghvi HA, Gupta S. Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks. Cureus 2024; 16:e69612. [PMID: 39421095 PMCID: PMC11486483 DOI: 10.7759/cureus.69612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/17/2024] [Indexed: 10/19/2024] Open
Abstract
Background With advancements in natural language processing, tools such as Chat Generative Pre-Trained Transformers (ChatGPT) version 4.0 and Google Bard's Gemini Advanced are being increasingly evaluated for their potential in various medical applications. The objective of this study was to systematically assess the performance of these language learning models (LLMs) on both image and non-image-based questions within the specialized field of Ophthalmology. We used a review question bank for the Ophthalmic Knowledge Assessment Program (OKAP) used by ophthalmology residents nationally to prepare for the Ophthalmology Board Exam to assess the accuracy and performance of ChatGPT and Gemini Advanced. Methodology A total of 260 randomly generated multiple-choice questions from the OphthoQuestions question bank were run through ChatGPT and Gemini Advanced. A simulated 260-question OKAP examination was created at random from the bank. Question-specific data were analyzed, including overall percent correct, subspecialty accuracy, whether the question was "high yield," difficulty (1-4), and question type (e.g., image, text). To compare the performance of ChatGPT-4 and Gemini on the difficulty of questions, we utilized the standard deviation of user answer choices to determine question difficulty. In this study, a statistical analysis of Google Sheets was conducted using two-tailed t-tests with unequal variance to compare the performance of ChatGPT-4.0 and Google's Gemini Advanced across various question types, subspecialties, and difficulty levels. Results In total, 259 of the 260 questions were included in the study as one question used a video that any form of ChatGPT could not interpret as of May 1, 2024. For text-only questions, ChatGPT-4.0.0 correctly answered 57.14% (148/259, p < 0.018), and Gemini Advanced correctly answered 46.72% (121/259, p < 0.018). Both versions answered most questions without a prompt and would have received a below-average score on the OKAP. Moreover, there were 27 questions utilizing a secondary prompt in ChatGPT-4.0 compared to 67 questions in Gemini Advanced. ChatGPT-4.0 performed 68.99% on easier questions (<2 on a scale from 1-4) and 44.96% on harder questions (>2 on a scale from 1-4). On the other hand, Gemini Advanced performed 49.61% on easier questions (<2 on a scale from 1-4) and 44.19% on harder questions (>2 on a scale from 1-4). There was a statistically significant difference in accuracy between ChatGPT-4.0 and Gemini Advanced for easy (p < 0.0015) but not for hard (p < 0.55) questions. For image-only questions, ChatGPT-4.0 correctly answered 39.58% (19/48, p < 0.013), and Gemini Advanced correctly answered 33.33% (16/48, p < 0.022), resulting in a statistically insignificant difference in accuracy between ChatGPT-4.0 and Gemini Advanced (p < 0.530). A comparison against text-only and image-based questions demonstrated a statistically significant difference in accuracy for both ChatGPT-4.0 (p < 0.013) and Gemini Advanced (p < 0.022). Conclusions This study provides evidence that ChatGPT-4.0 performs better on the OKAP-style exams and is improved compared to Gemini Advanced within the context of ophthalmic multiple-choice questions. This may show an opportunity for increased worth for ChatGPT in ophthalmic medical education. While showing promise within medical education, caution should be used as a more detailed evaluation of reliability is needed.
Collapse
Affiliation(s)
- Gurnoor S Gill
- Medical School, Florida Atlantic University Charles E. Schmidt College of Medicine, Boca Raton, USA
| | - Joby Tsai
- Ophthalmology, Broward Health, Fort Lauderdale, USA
| | - Jillene Moxam
- School of Medicine, University of Florida, Gainesville, USA
- Department of Technology and Clinical Trials, Advanced Research, Deerfield Beach, USA
| | - Harshal A Sanghvi
- Department of Biomedical Sciences, Florida Atlantic University, Boca Raton, USA
- Department of Technology and Clinical Trials, Advanced Research, Deerfield Beach, USA
| | | |
Collapse
|
32
|
Al-Naser Y, Halka F, Ng B, Mountford D, Sharma S, Niure K, Yong-Hing C, Khosa F, Van der Pol C. Evaluating Artificial Intelligence Competency in Education: Performance of ChatGPT-4 in the American Registry of Radiologic Technologists (ARRT) Radiography Certification Exam. Acad Radiol 2024:S1076-6332(24)00572-5. [PMID: 39153961 DOI: 10.1016/j.acra.2024.08.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 07/12/2024] [Accepted: 08/06/2024] [Indexed: 08/19/2024]
Abstract
RATIONALE AND OBJECTIVES The American Registry of Radiologic Technologists (ARRT) leads the certification process with an exam comprising 200 multiple-choice questions. This study aims to evaluate ChatGPT-4's performance in responding to practice questions similar to those found in the ARRT board examination. MATERIALS AND METHODS We used a dataset of 200 practice multiple-choice questions for the ARRT certification exam from BoardVitals. Each question was fed to ChatGPT-4 fifteen times, resulting in 3000 observations to account for response variability. RESULTS ChatGPT's overall performance was 80.56%, with higher accuracy on text-based questions (86.3%) compared to image-based questions (45.6%). Response times were longer for image-based questions (18.01 s) than for text-based questions (13.27 s). Performance varied by domain: 72.6% for Safety, 70.6% for Image Production, 67.3% for Patient Care, and 53.4% for Procedures. As anticipated, performance was best on on easy questions (78.5%). CONCLUSION ChatGPT demonstrated effective performance on the BoardVitals question bank for ARRT certification. Future studies could benefit from analyzing the correlation between BoardVitals scores and actual exam outcomes. Further development in AI, particularly in image processing and interpretation, is necessary to enhance its utility in educational settings.
Collapse
Affiliation(s)
- Yousif Al-Naser
- Medical Radiation Sciences, McMaster University, Hamilton, ON, Canada; Department of Diagnostic Imaging, Trillium Health Partners, Mississauga, ON, Canada.
| | - Felobater Halka
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine & Dentistry, Western University, Canada
| | - Boris Ng
- Department of Mechanical and Industrial Engineering, University of Toronto, ON, Canada
| | - Dwight Mountford
- Medical Radiation Sciences, McMaster University, Hamilton, ON, Canada
| | - Sonali Sharma
- Department of Radiology, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Ken Niure
- Department of Diagnostic Imaging, Trillium Health Partners, Mississauga, ON, Canada
| | - Charlotte Yong-Hing
- Department of Radiology, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Faisal Khosa
- Department of Radiology, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Christian Van der Pol
- Department of Diagnostic Imaging, Juravinski Hospital and Cancer Centre, Hamilton Health Sciences, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
33
|
Yurtcu E, Ozvural S, Keyif B. Analyzing the performance of ChatGPT in answering inquiries about cervical cancer. Int J Gynaecol Obstet 2024. [PMID: 39148482 DOI: 10.1002/ijgo.15861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 07/04/2024] [Accepted: 08/05/2024] [Indexed: 08/17/2024]
Abstract
OBJECTIVE To analyze the knowledge of ChatGPT about cervical cancer (CC). METHODS Official websites of professional health institutes, and websites created by patients and charities underwent strict screening. Using CC-related keywords, common inquiries by the public and comments about CC were searched in social media applications with these data, a list of frequently asked questions (FAQs) was prepared. When preparing question about CC, the European Society of Gynecological Oncology (ESGO), European Society for Radiotherapy and Oncology (ESTRO), and European Society of Pathology (ESP) guidelines were used. The answers given by ChatGPT were scored according to the Global Quality Score (GQS). RESULTS When all ChatGPT answers to FAQs about CC were evaluated with regard to GQS, 68 ChatGPT answers were classified as score 5, and none of ChatGPT answers for FAQs were scored as 2 or 1. Moreover, ChatGPT answered 33 of 53 (62.3%) CC-related questions based on ESGO, ESTRO, and ESP guidelines with completely accurate and satisfactory responses (GQS 5). In addition, eight answers (15.1%), seven answers (13.2%), four answers (7.5%), and one answer (1.9%) were categorized as GQS 4, GQS 3, GQS 2, and GQS 1, respectively. The reproducibility rate of ChatGPT answers about CC-related FAQs and responses about those guideline-based questions was 93.2% and 88.7%, respectively. CONCLUSION ChatGPT had an accurate and satisfactory response rate for FAQs about CC with regards to GQS. However, the accuracy and quality of ChatGPT answers significantly decreased for questions based on guidelines.
Collapse
Affiliation(s)
- Engin Yurtcu
- Department of Obstetrics and Gynecology, Faculty of Medicine, Duzce University, Duzce, Türkiye
| | - Seyfettin Ozvural
- Department of Obstetrics and Gynecology, Acıbadem Hospital, Biruni University, Istanbul, Türkiye
| | - Betul Keyif
- Department of Obstetrics and Gynecology, Faculty of Medicine, Duzce University, Duzce, Türkiye
| |
Collapse
|
34
|
Deng J, Qin Y. Current Status, Hotspots, and Prospects of Artificial Intelligence in Ophthalmology: A Bibliometric Analysis (2003-2023). Ophthalmic Epidemiol 2024:1-14. [PMID: 39146462 DOI: 10.1080/09286586.2024.2373956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2024] [Revised: 06/01/2024] [Accepted: 06/18/2024] [Indexed: 08/17/2024]
Abstract
PURPOSE Artificial intelligence (AI) has gained significant attention in ophthalmology. This paper reviews, classifies, and summarizes the research literature in this field and aims to provide readers with a detailed understanding of the current status and future directions, laying a solid foundation for further research and decision-making. METHODS Literature was retrieved from the Web of Science database. Bibliometric analysis was performed using VOSviewer, CiteSpace, and the R package Bibliometrix. RESULTS The study included 3,377 publications from 4,035 institutions in 98 countries. China and the United States had the most publications. Sun Yat-sen University is a leading institution. Translational Vision Science & Technology"published the most articles, while "Ophthalmology" had the most co-citations. Among 13,145 researchers, Ting DSW had the most publications and citations. Keywords included "Deep learning," "Diabetic retinopathy," "Machine learning," and others. CONCLUSION The study highlights the promising prospects of AI in ophthalmology. Automated eye disease screening, particularly its core technology of retinal image segmentation and recognition, has become a research hotspot. AI is also expanding to complex areas like surgical assistance, predictive models. Multimodal AI, Generative Adversarial Networks, and ChatGPT have driven further technological innovation. However, implementing AI in ophthalmology also faces many challenges, including technical, regulatory, and ethical issues, and others. As these challenges are overcome, we anticipate more innovative applications, paving the way for more effective and safer eye disease treatments.
Collapse
Affiliation(s)
- Jie Deng
- First Clinical College of Traditional Chinese Medicine, Hunan University of Chinese Medicine, Changsha, Hunan, China
- Graduate School, Hunan University of Chinese Medicine, Changsha, Hunan, China
| | - YuHui Qin
- First Clinical College of Traditional Chinese Medicine, Hunan University of Chinese Medicine, Changsha, Hunan, China
- Graduate School, Hunan University of Chinese Medicine, Changsha, Hunan, China
| |
Collapse
|
35
|
Sadeq MA, Ghorab RMF, Ashry MH, Abozaid AM, Banihani HA, Salem M, Aisheh MTA, Abuzahra S, Mourid MR, Assker MM, Ayyad M, Moawad MHED. AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study. Sci Rep 2024; 14:18859. [PMID: 39143077 PMCID: PMC11324724 DOI: 10.1038/s41598-024-68996-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 07/30/2024] [Indexed: 08/16/2024] Open
Abstract
Large language models (LLMs) like ChatGPT have potential applications in medical education such as helping students study for their licensing exams by discussing unclear questions with them. However, they require evaluation on these complex tasks. The purpose of this study was to evaluate how well publicly accessible LLMs performed on simulated UK medical board exam questions. 423 board-style questions from 9 UK exams (MRCS, MRCP, etc.) were answered by seven LLMs (ChatGPT-3.5, ChatGPT-4, Bard, Perplexity, Claude, Bing, Claude Instant). There were 406 multiple-choice, 13 true/false, and 4 "choose N" questions covering topics in surgery, pediatrics, and other disciplines. The accuracy of the output was graded. Statistics were used to analyze differences among LLMs. Leaked questions were excluded from the primary analysis. ChatGPT 4.0 scored (78.2%), Bing (67.2%), Claude (64.4%), and Claude Instant (62.9%). Perplexity scored the lowest (56.1%). Scores differed significantly between LLMs overall (p < 0.001) and in pairwise comparisons. All LLMs scored higher on multiple-choice vs true/false or "choose N" questions. LLMs demonstrated limitations in answering certain questions, indicating refinements needed before primary reliance in medical education. However, their expanding capabilities suggest a potential to improve training if thoughtfully implemented. Further research should explore specialty specific LLMs and optimal integration into medical curricula.
Collapse
Affiliation(s)
- Mohammed Ahmed Sadeq
- Misr University for Science and Technology, 6th of October, Egypt.
- Medical Research Platform (MRP), Giza, Egypt.
- Emergency Medicine Department, Elsheikh Zayed Specialized Hospital, Elsheikh Zayed City, Egypt.
| | - Reem Mohamed Farouk Ghorab
- Misr University for Science and Technology, 6th of October, Egypt
- Medical Research Platform (MRP), Giza, Egypt
- Emergency Medicine Department, Elsheikh Zayed Specialized Hospital, Elsheikh Zayed City, Egypt
| | - Mohamed Hady Ashry
- Medical Research Platform (MRP), Giza, Egypt
- School of Medicine, New Giza University (NGU), Giza, Egypt
| | - Ahmed Mohamed Abozaid
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Medicine, Tanta University, Tanta, Egypt
| | - Haneen A Banihani
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Medicine, University of Jordan, Amman, Jordan
| | - Moustafa Salem
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Medicine, Mansoura University, Mansoura, Egypt
| | - Mohammed Tawfiq Abu Aisheh
- Medical Research Platform (MRP), Giza, Egypt
- Department of Medicine, College of Medicine and Health Sciences, An-Najah National University, Nablus, 44839, Palestine
| | - Saad Abuzahra
- Medical Research Platform (MRP), Giza, Egypt
- Department of Medicine, College of Medicine and Health Sciences, An-Najah National University, Nablus, 44839, Palestine
| | - Marina Ramzy Mourid
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Medicine, Alexandria University, Alexandria, Egypt
| | - Mohamad Monif Assker
- Medical Research Platform (MRP), Giza, Egypt
- Sheikh Khalifa Medical City, Abu Dhabi, UAE
| | - Mohammed Ayyad
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Medicine, Al-Quds University, Jerusalem, Palestine
| | - Mostafa Hossam El Din Moawad
- Medical Research Platform (MRP), Giza, Egypt
- Faculty of Pharmacy Clinical Department, Alexandria University, Alexandria, Egypt
- Faculty of Medicine, Suez Canal University, Ismailia, Egypt
| |
Collapse
|
36
|
Ming S, Guo Q, Cheng W, Lei B. Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study. JMIR MEDICAL EDUCATION 2024; 10:e52784. [PMID: 39140269 PMCID: PMC11336778 DOI: 10.2196/52784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 05/20/2024] [Accepted: 06/20/2024] [Indexed: 08/15/2024]
Abstract
Background With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model's accuracy and consistency. Results GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.
Collapse
Affiliation(s)
- Shuai Ming
- Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People’s Hospital, Zhengzhou, China
- Eye Institute, Henan Academy of Innovations in Medical Science, Zhengzhou, China
- Henan Clinical Research Center for Ocular Diseases, People’s Hospital of Zhengzhou University, Zhengzhou, China
| | - Qingge Guo
- Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People’s Hospital, Zhengzhou, China
- Eye Institute, Henan Academy of Innovations in Medical Science, Zhengzhou, China
- Henan Clinical Research Center for Ocular Diseases, People’s Hospital of Zhengzhou University, Zhengzhou, China
| | - Wenjun Cheng
- Department of Ophthalmology, People’s Hospital of Zhengzhou University, Zhengzhou, China
| | - Bo Lei
- Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People’s Hospital, Zhengzhou, China
- Eye Institute, Henan Academy of Innovations in Medical Science, Zhengzhou, China
- Henan Clinical Research Center for Ocular Diseases, People’s Hospital of Zhengzhou University, Zhengzhou, China
| |
Collapse
|
37
|
Alqudah AA, Aleshawi AJ, Baker M, Alnajjar Z, Ayasrah I, Ta’ani Y, Al Salkhadi M, Aljawarneh S. Evaluating accuracy and reproducibility of ChatGPT responses to patient-based questions in Ophthalmology: An observational study. Medicine (Baltimore) 2024; 103:e39120. [PMID: 39121263 PMCID: PMC11315477 DOI: 10.1097/md.0000000000039120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 07/08/2024] [Indexed: 08/11/2024] Open
Abstract
Chat Generative Pre-Trained Transformer (ChatGPT) is an online large language model that appears to be a popular source of health information, as it can provide patients with answers in the form of human-like text, although the accuracy and safety of its responses are not evident. This study aims to evaluate the accuracy and reproducibility of ChatGPT responses to patients-based questions in ophthalmology. We collected 150 questions from the "Ask an ophthalmologist" page of the American Academy of Ophthalmology, which were reviewed and refined by two ophthalmologists for their eligibility. Each question was inputted into ChatGPT twice using the "new chat" option. The grading scale included the following: (1) comprehensive, (2) correct but inadequate, (3) some correct and some incorrect, and (4) completely incorrect. Totally, 117 questions were inputted into ChatGPT, which provided "comprehensive" responses to 70/117 (59.8%) of questions. Concerning reproducibility, it was defined as no difference in grading categories (1 and 2 vs 3 and 4) between the 2 responses for each question. ChatGPT provided reproducible responses to 91.5% of questions. This study shows moderate accuracy and reproducibility of ChatGPT responses to patients' questions in ophthalmology. ChatGPT may be-after more modifications-a supplementary health information source, which should be used as an adjunct, but not a substitute, to medical advice. The reliability of ChatGPT should undergo more investigations.
Collapse
Affiliation(s)
- Asem A. Alqudah
- Faculty of Medicine, Jordan University of Science and Technology (JUST), Irbid, Jordan
| | | | - Mohammed Baker
- Faculty of Medicine, Jordan University of Science and Technology (JUST), Irbid, Jordan
| | - Zaina Alnajjar
- Faculty of Medicine, Hashemite University, Zarqa, Jordan
| | - Ibrahim Ayasrah
- Faculty of Medicine, Jordan University of Science and Technology (JUST), Irbid, Jordan
| | - Yaqoot Ta’ani
- Faculty of Medicine, Jordan University of Science and Technology (JUST), Irbid, Jordan
| | - Mohammad Al Salkhadi
- Faculty of Medicine, Jordan University of Science and Technology (JUST), Irbid, Jordan
| | - Shaima’a Aljawarneh
- Faculty of Medicine, Jordan University of Science and Technology (JUST), Irbid, Jordan
| |
Collapse
|
38
|
Wang Y, Chen Y, Sheng J. Assessing ChatGPT as a Medical Consultation Assistant for Chronic Hepatitis B: Cross-Language Study of English and Chinese. JMIR Med Inform 2024; 12:e56426. [PMID: 39115930 PMCID: PMC11342014 DOI: 10.2196/56426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 05/24/2024] [Accepted: 07/21/2024] [Indexed: 08/10/2024] Open
Abstract
BACKGROUND Chronic hepatitis B (CHB) imposes substantial economic and social burdens globally. The management of CHB involves intricate monitoring and adherence challenges, particularly in regions like China, where a high prevalence of CHB intersects with health care resource limitations. This study explores the potential of ChatGPT-3.5, an emerging artificial intelligence (AI) assistant, to address these complexities. With notable capabilities in medical education and practice, ChatGPT-3.5's role is examined in managing CHB, particularly in regions with distinct health care landscapes. OBJECTIVE This study aimed to uncover insights into ChatGPT-3.5's potential and limitations in delivering personalized medical consultation assistance for CHB patients across diverse linguistic contexts. METHODS Questions sourced from published guidelines, online CHB communities, and search engines in English and Chinese were refined, translated, and compiled into 96 inquiries. Subsequently, these questions were presented to both ChatGPT-3.5 and ChatGPT-4.0 in independent dialogues. The responses were then evaluated by senior physicians, focusing on informativeness, emotional management, consistency across repeated inquiries, and cautionary statements regarding medical advice. Additionally, a true-or-false questionnaire was employed to further discern the variance in information accuracy for closed questions between ChatGPT-3.5 and ChatGPT-4.0. RESULTS Over half of the responses (228/370, 61.6%) from ChatGPT-3.5 were considered comprehensive. In contrast, ChatGPT-4.0 exhibited a higher percentage at 74.5% (172/222; P<.001). Notably, superior performance was evident in English, particularly in terms of informativeness and consistency across repeated queries. However, deficiencies were identified in emotional management guidance, with only 3.2% (6/186) in ChatGPT-3.5 and 8.1% (15/154) in ChatGPT-4.0 (P=.04). ChatGPT-3.5 included a disclaimer in 10.8% (24/222) of responses, while ChatGPT-4.0 included a disclaimer in 13.1% (29/222) of responses (P=.46). When responding to true-or-false questions, ChatGPT-4.0 achieved an accuracy rate of 93.3% (168/180), significantly surpassing ChatGPT-3.5's accuracy rate of 65.0% (117/180) (P<.001). CONCLUSIONS In this study, ChatGPT demonstrated basic capabilities as a medical consultation assistant for CHB management. The choice of working language for ChatGPT-3.5 was considered a potential factor influencing its performance, particularly in the use of terminology and colloquial language, and this potentially affects its applicability within specific target populations. However, as an updated model, ChatGPT-4.0 exhibits improved information processing capabilities, overcoming the language impact on information accuracy. This suggests that the implications of model advancement on applications need to be considered when selecting large language models as medical consultation assistants. Given that both models performed inadequately in emotional guidance management, this study highlights the importance of providing specific language training and emotional management strategies when deploying ChatGPT for medical purposes. Furthermore, the tendency of these models to use disclaimers in conversations should be further investigated to understand the impact on patients' experiences in practical applications.
Collapse
Affiliation(s)
- Yijie Wang
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Disease, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yining Chen
- Department of Urology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Jifang Sheng
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Disease, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
39
|
Sawamura S, Kohiyama K, Takenaka T, Sera T, Inoue T, Nagai T. Performance of ChatGPT 4.0 on Japan's National Physical Therapist Examination: A Comprehensive Analysis of Text and Visual Question Handling. Cureus 2024; 16:e67347. [PMID: 39310431 PMCID: PMC11413471 DOI: 10.7759/cureus.67347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/20/2024] [Indexed: 09/25/2024] Open
Abstract
INTRODUCTION ChatGPT 4.0, a large-scale language model (LLM) developed by OpenAI, has demonstrated the capability to pass Japan's national medical examination and other medical assessments. However, the impact of imaging-based questions and different question types on its performance has not been thoroughly examined. This study evaluated ChatGPT 4.0's performance on Japan's national examination for physical therapists, particularly its ability to handle complex questions involving images and tables. The study also assessed the model's potential in the field of rehabilitation and its performance with Japanese language inputs. METHODS The evaluation utilized 1,000 questions from the 54th to 58th national exams for physical therapists in Japan, comprising 160 general questions and 40 practical questions per exam. All questions were input in Japanese and included additional information such as images or tables. The answers generated by ChatGPT were then compared with the official correct answers. ANALYSIS ChatGPT's performance was evaluated based on accuracy rates using various criteria: general and practical questions were analyzed with Fisher's exact test, A-type (single correct answer) and X2-type (two correct answers) questions, text-only questions versus questions with images and tables, and different question lengths using Student's t-test. RESULTS ChatGPT 4.0 met the passing criteria with an overall accuracy of 73.4%. The accuracy rates for general and practical questions were 80.1% and 46.6%, respectively. No significant difference was found between the accuracy rates for A-type (74.3%) and X2-type (67.4%) questions. However, a significant difference was observed between the accuracy rates for text-only questions (80.5%) and questions with images and tables (35.4%). DISCUSSION The results indicate that ChatGPT 4.0 satisfies the passing criteria for the national exam and demonstrates adequate knowledge and application skills. However, its performance on practical questions and those with images and tables is lower, indicating areas for improvement. The effective handling of Japanese inputs suggests its potential use in non-English-speaking regions. CONCLUSION ChatGPT 4.0 can pass the national examination for physical therapists, particularly with text-based questions. However, improvements are needed for specialized practical questions and those involving images and tables. The model shows promise for supporting clinical rehabilitation and medical education in Japanese-speaking contexts, though further enhancements are required for a comprehensive application.
Collapse
Affiliation(s)
- Shogo Sawamura
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| | - Kengo Kohiyama
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| | - Takahiro Takenaka
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| | - Tatsuya Sera
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| | - Tadatoshi Inoue
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| | - Takashi Nagai
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| |
Collapse
|
40
|
Moulaei K, Yadegari A, Baharestani M, Farzanbakhsh S, Sabet B, Reza Afrash M. Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications. Int J Med Inform 2024; 188:105474. [PMID: 38733640 DOI: 10.1016/j.ijmedinf.2024.105474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 05/03/2024] [Accepted: 05/04/2024] [Indexed: 05/13/2024]
Abstract
BACKGROUND Generative artificial intelligence (GAI) is revolutionizing healthcare with solutions for complex challenges, enhancing diagnosis, treatment, and care through new data and insights. However, its integration raises questions about applications, benefits, and challenges. Our study explores these aspects, offering an overview of GAI's applications and future prospects in healthcare. METHODS This scoping review searched Web of Science, PubMed, and Scopus . The selection of studies involved screening titles, reviewing abstracts, and examining full texts, adhering to the PRISMA-ScR guidelines throughout the process. RESULTS From 1406 articles across three databases, 109 met inclusion criteria after screening and deduplication. Nine GAI models were utilized in healthcare, with ChatGPT (n = 102, 74 %), Google Bard (Gemini) (n = 16, 11 %), and Microsoft Bing AI (n = 10, 7 %) being the most frequently employed. A total of 24 different applications of GAI in healthcare were identified, with the most common being "offering insights and information on health conditions through answering questions" (n = 41) and "diagnosis and prediction of diseases" (n = 17). In total, 606 benefits and challenges were identified, which were condensed to 48 benefits and 61 challenges after consolidation. The predominant benefits included "Providing rapid access to information and valuable insights" and "Improving prediction and diagnosis accuracy", while the primary challenges comprised "generating inaccurate or fictional content", "unknown source of information and fake references for texts", and "lower accuracy in answering questions". CONCLUSION This scoping review identified the applications, benefits, and challenges of GAI in healthcare. This synthesis offers a crucial overview of GAI's potential to revolutionize healthcare, emphasizing the imperative to address its limitations.
Collapse
Affiliation(s)
- Khadijeh Moulaei
- Department of Health Information Technology, School of Paramedical, Ilam University of Medical Sciences, Ilam, Iran
| | - Atiye Yadegari
- Department of Pediatric Dentistry, School of Dentistry, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Mahdi Baharestani
- Network of Interdisciplinarity in Neonates and Infants (NINI), Universal Scientific Education and Research Network (USERN), Tehran, Iran
| | - Shayan Farzanbakhsh
- Network of Interdisciplinarity in Neonates and Infants (NINI), Universal Scientific Education and Research Network (USERN), Tehran, Iran
| | - Babak Sabet
- Department of Surgery, Faculty of Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mohammad Reza Afrash
- Department of Artificial Intelligence, Smart University of Medical Sciences, Tehran, Iran.
| |
Collapse
|
41
|
Rodgers DL, Hernandez J, Ahmed RA. Response to Bhutiani, Hester, and Lonsdale. Simul Healthc 2024; 19:270. [PMID: 39073873 DOI: 10.1097/sih.0000000000000817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
|
42
|
Tan DNH, Tham YC, Koh V, Loon SC, Aquino MC, Lun K, Cheng CY, Ngiam KY, Tan M. Evaluating Chatbot responses to patient questions in the field of glaucoma. Front Med (Lausanne) 2024; 11:1359073. [PMID: 39050528 PMCID: PMC11267485 DOI: 10.3389/fmed.2024.1359073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 06/20/2024] [Indexed: 07/27/2024] Open
Abstract
Objective The aim of this study was to evaluate the accuracy, comprehensiveness, and safety of a publicly available large language model (LLM)-ChatGPT in the sub-domain of glaucoma. Design Evaluation of diagnostic test or technology. Subjects participants and/or controls We seek to evaluate the responses of an artificial intelligence chatbot ChatGPT (version GPT-3.5, OpenAI). Methods intervention or testing We curated 24 clinically relevant questions in the domain of glaucoma. The questions spanned four categories: pertaining to diagnosis, treatment, surgeries, and ocular emergencies. Each question was posed to the LLM and the responses obtained were graded by an expert grader panel of three glaucoma specialists with combined experience of more than 30 years in the field. For responses which performed poorly, the LLM was further prompted to self-correct. The subsequent responses were then re-evaluated by the expert panel. Main outcome measures Accuracy, comprehensiveness, and safety of the responses of a public domain LLM. Results There were a total of 24 questions and three expert graders with a total number of responses of n = 72. The scores were ranked from 1 to 4, where 4 represents the best score with a complete and accurate response. The mean score of the expert panel was 3.29 with a standard deviation of 0.484. Out of the 24 question-response pairs, seven (29.2%) of them had a mean inter-grader score of 3 or less. The mean score of the original seven question-response pairs was 2.96 which rose to 3.58 after an opportunity to self-correct (z-score - 3.27, p = 0.001, Mann-Whitney U). The seven out of 24 question-response pairs which performed poorly were given a chance to self-correct. After self-correction, the proportion of responses obtaining a full score increased from 22/72 (30.6%) to 12/21 (57.1%), (p = 0.026, χ2 test). Conclusion LLMs show great promise in the realm of glaucoma with additional capabilities of self-correction. The application of LLMs in glaucoma is still in its infancy, and still requires further research and validation.
Collapse
Affiliation(s)
| | - Yih-Chung Tham
- Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
- Eye Academic Clinical Program (Eye ACP), Duke NUS Medical School, Singapore, Singapore
| | - Victor Koh
- Department of Ophthalmology, National University Hospital, Singapore, Singapore
| | - Seng Chee Loon
- Department of Ophthalmology, National University Hospital, Singapore, Singapore
| | | | - Katherine Lun
- Department of Ophthalmology, National University Hospital, Singapore, Singapore
| | - Ching-Yu Cheng
- Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
- Eye Academic Clinical Program (Eye ACP), Duke NUS Medical School, Singapore, Singapore
| | - Kee Yuan Ngiam
- Division of General Surgery (Endocrine & Thyroid Surgery), Department of Surgery, National University Hospital, Singapore, Singapore
| | - Marcus Tan
- Department of Ophthalmology, National University Hospital, Singapore, Singapore
| |
Collapse
|
43
|
Lee Y, Tessier L, Brar K, Malone S, Jin D, McKechnie T, Jung JJ, Kroh M, Dang JT. Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions. Surg Obes Relat Dis 2024; 20:609-613. [PMID: 38782611 DOI: 10.1016/j.soard.2024.04.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Accepted: 04/14/2024] [Indexed: 05/25/2024]
Abstract
BACKGROUND The American Society for Metabolic and Bariatric Surgery (ASMBS) textbook serves as a comprehensive resource for bariatric surgery, covering recent advancements and clinical questions. Testing artificial intelligence (AI) engines using this authoritative source ensures accurate and up-to-date information and provides insight in its potential implications for surgical education and training. OBJECTIVES To determine the quality and to compare different large language models' (LLMs) ability to respond to textbook questions relating to bariatric surgery. SETTING Remote. METHODS Prompts to be entered into the LLMs were multiple-choice questions found in "The ASMBS Textbook of Bariatric Surgery, second Edition. The prompts were queried into 3 LLMs: OpenAI's ChatGPT-4, Microsoft's Bing, and Google's Bard. The generated responses were assessed based on overall accuracy, the number of correct answers according to subject matter, and the number of correct answers based on question type. Statistical analysis was performed to determine the number of responses per LLMs per category that were correct. RESULTS Two hundred questions were used to query the AI models. There was an overall significant difference in the accuracy of answers, with an accuracy of 83.0% for ChatGPT-4, followed by Bard (76.0%) and Bing (65.0%). Subgroup analysis revealed a significant difference between the models' performance in question categories, with ChatGPT-4's demonstrating the highest proportion of correct answers in questions related to treatment and surgical procedures (83.1%) and complications (91.7%). There was also a significant difference between the performance in different question types, with ChatGPT-4 showing superior performance in inclusionary questions. Bard and Bing were unable to answer certain questions whereas ChatGPT-4 left no questions unanswered. CONCLUSIONS LLMs, particularly ChatGPT-4, demonstrated promising accuracy when answering clinical questions related to bariatric surgery. Continued AI advancements and research is required to elucidate the potential applications of LLMs in training and education.
Collapse
Affiliation(s)
- Yung Lee
- Division of General Surgery, McMaster University, Hamilton, Ontario, Canada; Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts
| | - Léa Tessier
- Division of General Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Karanbir Brar
- Division of General Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Sarah Malone
- Division of General Surgery, McMaster University, Hamilton, Ontario, Canada
| | - David Jin
- Division of General Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, McMaster University, Hamilton, Ontario, Canada
| | - James J Jung
- Division of General Surgery, University of Toronto, Toronto, Ontario, Canada
| | - Matthew Kroh
- Digestive Disease Institute, Cleveland Clinic, Cleveland, Ohio
| | - Jerry T Dang
- Digestive Disease Institute, Cleveland Clinic, Cleveland, Ohio.
| |
Collapse
|
44
|
Kerci SG, Sahan B. An Analysis of ChatGPT4 to Respond to Glaucoma-Related Questions. J Glaucoma 2024; 33:486-489. [PMID: 38647417 DOI: 10.1097/ijg.0000000000002408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 03/11/2024] [Indexed: 04/25/2024]
Abstract
PRCIS In recent years, ChatGPT has been widely used as a source of information. In our study, it was revealed that ChatGPT gives accurate information about glaucoma. PURPOSE We examined the knowledge of ChatGPT about glaucoma. MATERIALS AND METHODS Frequently asked questions about glaucoma found on websites of ophthalmology associations and hospitals, and social media applications were assessed. Evidence-Based Recommendations in the European Glaucoma Society Terminology and Guidelines for Glaucoma, Fifth Edition were evaluated. Using the ChatGPT-4, each question was asked twice on different computers to assess the reproducibility of answers. The answers provided were recorded and 2 specialist ophthalmologists evaluated them independently, assigning scores ranging from 1 to 4. RESULTS The answers to all questions about glaucoma resulted in 88.7% completely correct, 7.5% correct but insufficient, and 3.8% misleading information and correct information. No question was answered completely incorrectly. While 85.8% of the general knowledge questions were answered correctly, 91.7%, 86.6%, and 91.7% of questions about diagnosis, treatment, and prevention were answered correctly, respectively. The number of questions prepared based on the European Glaucoma Society Terminology and Guidelines for Glaucoma was 16. The rate of completely correct answers to these questions was 75.0% (12). While 3 (18.8%) answers were correct but insufficient, 1 response (6.3%) contained false information and correct information. CONCLUSIONS Our study revealed that ChatGPT answered 9 out of 10 questions about general information, diagnosis, treatment, and preventive and follow-up about glaucoma with acceptable and satisfactory accuracy rates. In addition, 3 of 4 answers given by ChatGPT were completely correct according to Terminology and Guidelines for Glaucoma.
Collapse
Affiliation(s)
- Suleyman G Kerci
- Department of Ophthalmology, Medicana International Izmir Hospital, İzmir, Turkey
| | | |
Collapse
|
45
|
Yang Z, Wang D, Zhou F, Song D, Zhang Y, Jiang J, Kong K, Liu X, Qiao Y, Chang RT, Han Y, Li F, Tham CC, Zhang X. Understanding natural language: Potential application of large language models to ophthalmology. Asia Pac J Ophthalmol (Phila) 2024; 13:100085. [PMID: 39059558 DOI: 10.1016/j.apjo.2024.100085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 06/19/2024] [Accepted: 07/19/2024] [Indexed: 07/28/2024] Open
Abstract
Large language models (LLMs), a natural language processing technology based on deep learning, are currently in the spotlight. These models closely mimic natural language comprehension and generation. Their evolution has undergone several waves of innovation similar to convolutional neural networks. The transformer architecture advancement in generative artificial intelligence marks a monumental leap beyond early-stage pattern recognition via supervised learning. With the expansion of parameters and training data (terabytes), LLMs unveil remarkable human interactivity, encompassing capabilities such as memory retention and comprehension. These advances make LLMs particularly well-suited for roles in healthcare communication between medical practitioners and patients. In this comprehensive review, we discuss the trajectory of LLMs and their potential implications for clinicians and patients. For clinicians, LLMs can be used for automated medical documentation, and given better inputs and extensive validation, LLMs may be able to autonomously diagnose and treat in the future. For patient care, LLMs can be used for triage suggestions, summarization of medical documents, explanation of a patient's condition, and customizing patient education materials tailored to their comprehension level. The limitations of LLMs and possible solutions for real-world use are also presented. Given the rapid advancements in this area, this review attempts to briefly cover many roles that LLMs may play in the ophthalmic space, with a focus on improving the quality of healthcare delivery.
Collapse
Affiliation(s)
- Zefeng Yang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Deming Wang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Fengqi Zhou
- Ophthalmology, Mayo Clinic Health System, Eau Claire, Wisconsin, USA
| | - Diping Song
- Shanghai Artificial Intelligence Laboratory, Shanghai, China
| | - Yinhang Zhang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Jiaxuan Jiang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Kangjie Kong
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Xiaoyi Liu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Yu Qiao
- Shanghai Artificial Intelligence Laboratory, Shanghai, China
| | - Robert T Chang
- Department of Ophthalmology, Byers Eye Institute at Stanford University, Palo Alto, CA, USA
| | - Ying Han
- Department of Ophthalmology, University of California, San Francisco, San Francisco, CA, USA
| | - Fei Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China.
| | - Clement C Tham
- Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China; Hong Kong Eye Hospital, Kowloon, Hong Kong SAR, China; Department of Ophthalmology and Visual Sciences, Prince of Wales Hospital, Shatin, Hong Kong SAR, China.
| | - Xiulan Zhang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China.
| |
Collapse
|
46
|
Heinke A, Radgoudarzi N, Huang BB, Baxter SL. A review of ophthalmology education in the era of generative artificial intelligence. Asia Pac J Ophthalmol (Phila) 2024; 13:100089. [PMID: 39134176 DOI: 10.1016/j.apjo.2024.100089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Revised: 07/31/2024] [Accepted: 08/02/2024] [Indexed: 08/18/2024] Open
Abstract
PURPOSE To explore the integration of generative AI, specifically large language models (LLMs), in ophthalmology education and practice, addressing their applications, benefits, challenges, and future directions. DESIGN A literature review and analysis of current AI applications and educational programs in ophthalmology. METHODS Analysis of published studies, reviews, articles, websites, and institutional reports on AI use in ophthalmology. Examination of educational programs incorporating AI, including curriculum frameworks, training methodologies, and evaluations of AI performance on medical examinations and clinical case studies. RESULTS Generative AI, particularly LLMs, shows potential to improve diagnostic accuracy and patient care in ophthalmology. Applications include aiding in patient, physician, and medical students' education. However, challenges such as AI hallucinations, biases, lack of interpretability, and outdated training data limit clinical deployment. Studies revealed varying levels of accuracy of LLMs on ophthalmology board exam questions, underscoring the need for more reliable AI integration. Several educational programs nationwide provide AI and data science training relevant to clinical medicine and ophthalmology. CONCLUSIONS Generative AI and LLMs offer promising advancements in ophthalmology education and practice. Addressing challenges through comprehensive curricula that include fundamental AI principles, ethical guidelines, and updated, unbiased training data is crucial. Future directions include developing clinically relevant evaluation metrics, implementing hybrid models with human oversight, leveraging image-rich data, and benchmarking AI performance against ophthalmologists. Robust policies on data privacy, security, and transparency are essential for fostering a safe and ethical environment for AI applications in ophthalmology.
Collapse
Affiliation(s)
- Anna Heinke
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Jacobs Retina Center, 9415 Campus Point Drive, La Jolla, CA 92037, USA
| | - Niloofar Radgoudarzi
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA
| | - Bonnie B Huang
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA; Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Sally L Baxter
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
47
|
Shemer A, Cohen M, Altarescu A, Atar-Vardi M, Hecht I, Dubinsky-Pertzov B, Shoshany N, Zmujack S, Or L, Einan-Lifshitz A, Pras E. Diagnostic capabilities of ChatGPT in ophthalmology. Graefes Arch Clin Exp Ophthalmol 2024; 262:2345-2352. [PMID: 38183467 DOI: 10.1007/s00417-023-06363-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 12/04/2023] [Accepted: 12/23/2023] [Indexed: 01/08/2024] Open
Abstract
PURPOSE The purpose of this study is to assess the diagnostic accuracy of ChatGPT in the field of ophthalmology. METHODS This is a retrospective cohort study conducted in one academic tertiary medical center. We reviewed data of patients admitted to the ophthalmology department from 06/2022 to 01/2023. We then created two clinical cases for each patient. The first case is according to the medical history alone (Hx). The second case includes an addition of the clinical examination (Hx and Ex). For each case, we asked for the three most likely diagnoses from ChatGPT, residents, and attendings. Then, we compared the accuracy rates (at least one correct diagnosis) of all groups. Additionally, we evaluated the total duration for completing the assignment between the groups. RESULTS ChatGPT, residents, and attendings evaluated 126 cases from 63 patients (history only or history and exam findings for each patient). ChatGPT achieved a significantly lower accurate diagnosis rate (54%) in the Hx, as compared to the residents (75%; p < 0.01) and attendings (71%; p < 0.01). After adding the clinical examination findings, the diagnosis rate of ChatGPT was 68%, whereas for the residents and the attendings, it increased to 94% (p < 0.01) and 86% (p < 0.01), respectively. ChatGPT was 4 to 5 times faster than the attendings and residents. CONCLUSIONS AND RELEVANCE ChatGPT showed low diagnostic rates in ophthalmology cases compared to residents and attendings based on patient history alone or with additional clinical examination findings. However, ChatGPT completed the task faster than the physicians.
Collapse
Affiliation(s)
- Asaf Shemer
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel.
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
| | - Michal Cohen
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
- Faculty of Health Science, Ben-Gurion University of the Negev, South District, Beer-Sheva, Israel
| | - Aya Altarescu
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Maya Atar-Vardi
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Idan Hecht
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Biana Dubinsky-Pertzov
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Nadav Shoshany
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Sigal Zmujack
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Lior Or
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Adi Einan-Lifshitz
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eran Pras
- Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
- The Matlow's Ophthalmo-Genetics Laboratory, Department of Ophthalmology, Shamir Medical Center (Formerly Assaf-Harofeh), Tzrifin, Israel
| |
Collapse
|
48
|
Tong L, Wang J, Rapaka S, Garg PS. Can ChatGPT generate practice question explanations for medical students, a new faculty teaching tool? MEDICAL TEACHER 2024:1-5. [PMID: 38900675 DOI: 10.1080/0142159x.2024.2363486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 05/30/2024] [Indexed: 06/22/2024]
Abstract
INTRODUCTION Multiple-choice questions (MCQs) are frequently used for formative assessment in medical school but often lack sufficient answer explanations given time-restraints of faculty. Chat Generated Pre-trained Transformer (ChatGPT) has emerged as a potential student learning aid and faculty teaching tool. This study aims to evaluate ChatGPT's performance in answering and providing explanations for MCQs. METHOD Ninety-four faculty-generated MCQs were collected from the pre-clerkship curriculum at a US medical school. ChatGPT's accuracy in answering MCQ's were tracked on first attempt without an answer prompt (Pass 1) and after being given a prompt for the correct answer (Pass 2). Explanations provided by ChatGPT were compared with faculty-generated explanations, and a 3-point evaluation scale was used to assess accuracy and thoroughness compared to faculty-generated answers. RESULTS On first attempt, ChatGPT demonstrated a 75% accuracy in correctly answering faculty-generated MCQs. Among correctly answered questions, 66.4% of ChatGPT's explanations matched faculty explanations, and 89.1% captured some key aspects without providing inaccurate information. The amount of inaccurately generated explanations increases significantly if the questions was not answered correctly on the first pass (2.7% if correct on first pass vs. 34.6% if incorrect on first pass, p < 0.001). CONCLUSION ChatGPT shows promise in assisting faculty and students with explanations for practice MCQ's but should be used with caution. Faculty should review explanations and supplement to ensure coverage of learning objectives. Students can benefit from ChatGPT for immediate feedback through explanations if ChatGPT answers the question correctly on the first try. If the question is answered incorrectly students should remain cautious of the explanation and seek clarification from instructors.
Collapse
Affiliation(s)
- Lilin Tong
- Boston University Chobanian and Avedisian School of Medicine, Boston, MA, USA
| | - Jennifer Wang
- Boston University Chobanian and Avedisian School of Medicine, Boston, MA, USA
| | - Srikar Rapaka
- Boston University Chobanian and Avedisian School of Medicine, Boston, MA, USA
| | - Priya S Garg
- Medical Education Office and Department of Pediatrics, Boston University Chobanian and Avedisian School of Medicine, Boston, MA, USA
| |
Collapse
|
49
|
Antaki F, Chopra R, Keane PA. Vision-Language Models for Feature Detection of Macular Diseases on Optical Coherence Tomography. JAMA Ophthalmol 2024; 142:573-576. [PMID: 38696177 PMCID: PMC11066758 DOI: 10.1001/jamaophthalmol.2024.1165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 02/24/2024] [Indexed: 05/05/2024]
Abstract
Importance Vision-language models (VLMs) are a novel artificial intelligence technology capable of processing image and text inputs. While demonstrating strong generalist capabilities, their performance in ophthalmology has not been extensively studied. Objective To assess the performance of the Gemini Pro VLM in expert-level tasks for macular diseases from optical coherence tomography (OCT) scans. Design, Setting, and Participants This was a cross-sectional diagnostic accuracy study evaluating a generalist VLM on ophthalmology-specific tasks using the open-source Optical Coherence Tomography Image Database. The dataset included OCT B-scans from 50 unique patients: healthy individuals and those with macular hole, diabetic macular edema, central serous chorioretinopathy, and age-related macular degeneration. Each OCT scan was labeled for 10 key pathological features, referral recommendations, and treatments. The images were captured using a Cirrus high definition OCT machine (Carl Zeiss Meditec) at Sankara Nethralaya Eye Hospital, Chennai, India, and the dataset was published in December 2018. Image acquisition dates were not specified. Exposures Gemini Pro, using a standard prompt to extract structured responses on December 15, 2023. Main Outcomes and Measures The primary outcome was model responses compared against expert labels, calculating F1 scores for each pathological feature. Secondary outcomes included accuracy in diagnosis, referral urgency, and treatment recommendation. The model's internal concordance was evaluated by measuring the alignment between referral and treatment recommendations, independent of diagnostic accuracy. Results The mean F1 score was 10.7% (95% CI, 2.4-19.2). Measurable F1 scores were obtained for macular hole (36.4%; 95% CI, 0-71.4), pigment epithelial detachment (26.1%; 95% CI, 0-46.2), subretinal hyperreflective material (24.0%; 95% CI, 0-45.2), and subretinal fluid (20.0%; 95% CI, 0-45.5). A correct diagnosis was achieved in 17 of 50 cases (34%; 95% CI, 22-48). Referral recommendations varied: 28 of 50 were correct (56%; 95% CI, 42-70), 10 of 50 were overcautious (20%; 95% CI, 10-32), and 12 of 50 were undercautious (24%; 95% CI, 12-36). Referral and treatment concordance were very high, with 48 of 50 (96%; 95 % CI, 90-100) and 48 of 49 (98%; 95% CI, 94-100) correct answers, respectively. Conclusions and Relevance In this study, a generalist VLM demonstrated limited vision capabilities for feature detection and management of macular disease. However, it showed low self-contradiction, suggesting strong language capabilities. As VLMs continue to improve, validating their performance on large benchmarking datasets will help ascertain their potential in ophthalmology.
Collapse
Affiliation(s)
- Fares Antaki
- Institute of Ophthalmology, University College London, London, United Kingdom
- Moorfields Eye Hospital National Health Service Foundation Trust, London, United Kingdom
- The Centre Hospitalier de l’Université de Montréal School of Artificial Intelligence in Healthcare, Montreal, Quebec, Canada
| | - Reena Chopra
- Institute of Ophthalmology, University College London, London, United Kingdom
- Moorfields Eye Hospital National Health Service Foundation Trust, London, United Kingdom
- National Institute for Health and Care Research Biomedical Research Centre at Moorfields Eye Hospital National Health Service Foundation Trust, London, United Kingdom
| | - Pearse A. Keane
- Institute of Ophthalmology, University College London, London, United Kingdom
- Moorfields Eye Hospital National Health Service Foundation Trust, London, United Kingdom
- National Institute for Health and Care Research Biomedical Research Centre at Moorfields Eye Hospital National Health Service Foundation Trust, London, United Kingdom
| |
Collapse
|
50
|
Maywood MJ, Parikh R, Deobhakta A, Begaj T. PERFORMANCE ASSESSMENT OF AN ARTIFICIAL INTELLIGENCE CHATBOT IN CLINICAL VITREORETINAL SCENARIOS. Retina 2024; 44:954-964. [PMID: 38271674 DOI: 10.1097/iae.0000000000004053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2024]
Abstract
PURPOSE To determine how often ChatGPT is able to provide accurate and comprehensive information regarding clinical vitreoretinal scenarios. To assess the types of sources ChatGPT primarily uses and to determine whether they are hallucinated. METHODS This was a retrospective cross-sectional study. The authors designed 40 open-ended clinical scenarios across four main topics in vitreoretinal disease. Responses were graded on correctness and comprehensiveness by three blinded retina specialists. The primary outcome was the number of clinical scenarios that ChatGPT answered correctly and comprehensively. Secondary outcomes included theoretical harm to patients, the distribution of the type of references used by the chatbot, and the frequency of hallucinated references. RESULTS In June 2023, ChatGPT answered 83% of clinical scenarios (33/40) correctly but provided a comprehensive answer in only 52.5% of cases (21/40). Subgroup analysis demonstrated an average correct score of 86.7% in neovascular age-related macular degeneration, 100% in diabetic retinopathy, 76.7% in retinal vascular disease, and 70% in the surgical domain. There were six incorrect responses with one case (16.7%) of no harm, three cases (50%) of possible harm, and two cases (33.3%) of definitive harm. CONCLUSION ChatGPT correctly answered more than 80% of complex open-ended vitreoretinal clinical scenarios, with a reduced capability to provide a comprehensive response.
Collapse
Affiliation(s)
- Michael J Maywood
- Department of Ophthalmology, Corewell Health William Beaumont University Hospital, Royal Oak, Michigan
| | - Ravi Parikh
- Manhattan Retina and Eye Consultants, New York, New York
- Department of Ophthalmology, New York University School of Medicine, New York, New York
| | | | - Tedi Begaj
- Department of Ophthalmology, Corewell Health William Beaumont University Hospital, Royal Oak, Michigan
- Associated Retinal Consultants, Royal Oak, Michigan
| |
Collapse
|