1
|
Kalaw FGP, Baxter SL. Ethical considerations for large language models in ophthalmology. Curr Opin Ophthalmol 2024; 35:438-446. [PMID: 39259616 PMCID: PMC11427135 DOI: 10.1097/icu.0000000000001083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/13/2024]
Abstract
PURPOSE OF REVIEW This review aims to summarize and discuss the ethical considerations regarding large language model (LLM) use in the field of ophthalmology. RECENT FINDINGS This review of 47 articles on LLM applications in ophthalmology highlights their diverse potential uses, including education, research, clinical decision support, and surgical assistance (as an aid in operative notes). We also review ethical considerations such as the inability of LLMs to interpret data accurately, the risk of promoting controversial or harmful recommendations, and breaches of data privacy. These concerns imply the need for cautious integration of artificial intelligence in healthcare, emphasizing human oversight, transparency, and accountability to mitigate risks and uphold ethical standards. SUMMARY The integration of LLMs in ophthalmology offers potential advantages such as aiding in clinical decision support and facilitating medical education through their ability to process queries and analyze ophthalmic imaging and clinical cases. However, their utilization also raises ethical concerns regarding data privacy, potential misinformation, and biases inherent in the datasets used. Awareness of these concerns should be addressed in order to optimize its utility in the healthcare setting. More importantly, promoting responsible and careful use by consumers should be practiced.
Collapse
Affiliation(s)
- Fritz Gerald P Kalaw
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute
- Department of Biomedical Informatics, University of California San Diego Health System, University of California San Diego, La Jolla, California, USA
| | - Sally L Baxter
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute
- Department of Biomedical Informatics, University of California San Diego Health System, University of California San Diego, La Jolla, California, USA
| |
Collapse
|
2
|
Kayabaşı M, Köksaldı S, Durmaz Engin C. Evaluating the reliability of the responses of large language models to keratoconus-related questions. Clin Exp Optom 2024:1-8. [PMID: 39448387 DOI: 10.1080/08164622.2024.2419524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2024] [Revised: 09/15/2024] [Accepted: 10/15/2024] [Indexed: 10/26/2024] Open
Abstract
CLINICAL RELEVANCE Artificial intelligence has undergone a rapid evolution and large language models (LLMs) have become promising tools for healthcare, with the ability of providing human-like responses to questions. The capabilities of these tools in addressing questions related to keratoconus (KCN) have not been previously explored. BACKGROUND In this study, the responses were evaluated from three LLMs - ChatGPT-4, Copilot, and Gemini - to common patient questions regarding KCN. METHODS Fifty real-life patient inquiries regarding general information, aetiology, symptoms and diagnosis, progression, and treatment of KCN were presented to the LLMs. Evaluations of the answers were conducted by three ophthalmologists with a 5-point Likert scale ranging from 'strongly disagreed' to 'strongly agreed'. The reliability of the responses provided by LLMs was evaluated using the DISCERN and the Ensuring Quality Information for Patients (EQIP) scales. Readability metrics (Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau Index) were calculated to evaluate the complexity of responses. RESULTS ChatGPT-4 consistently scored 3 points or higher for all (100%) its responses, while Copilot had five (10%) and Gemini had two (4%) responses scoring 2 points or below. ChatGPT-4 achieved a 'strongly agree' rate of 74% across all questions, markedly superior to Copilot at 34% and Gemini at 42% (p < 0.001); and recorded the highest 'strongly agree' rates in general information and symptoms & diagnosis categories (90% for both). The median Likert scores differed among LLMs (p < 0.001), with ChatGPT-4 scoring highest and Copilot scoring lowest. Although ChatGPT-4 exhibited more reliability based on the DISCERN scale, it was characterised by lower readability and higher complexity. While all LLMs provided responses categorised as 'extremely difficult to read', the responses provided by Copilot showed higher readability. CONCLUSIONS Despite the responses provided by ChatGPT-4 exhibiting lower readability and greater complexity, it emerged as the most proficient in answering KCN-related questions.
Collapse
Affiliation(s)
| | - Seher Köksaldı
- Department of Ophthalmology, Mus State Hospital, Mus, Turkey
| | - Ceren Durmaz Engin
- Department of Ophthalmology, Izmir Democracy University Buca Seyfi Demirsoy Education and Research Hospital, Izmir, Turkey
| |
Collapse
|
3
|
Bellanda VCF, Santos MLD, Ferraz DA, Jorge R, Melo GB. Applications of ChatGPT in the diagnosis, management, education, and research of retinal diseases: a scoping review. Int J Retina Vitreous 2024; 10:79. [PMID: 39420407 PMCID: PMC11487877 DOI: 10.1186/s40942-024-00595-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Accepted: 10/04/2024] [Indexed: 10/19/2024] Open
Abstract
PURPOSE This scoping review aims to explore the current applications of ChatGPT in the retina field, highlighting its potential, challenges, and limitations. METHODS A comprehensive literature search was conducted across multiple databases, including PubMed, Scopus, MEDLINE, and Embase, to identify relevant articles published from 2022 onwards. The inclusion criteria focused on studies evaluating the use of ChatGPT in retinal healthcare. Data were extracted and synthesized to map the scope of ChatGPT's applications in retinal care, categorizing articles into various practical application areas such as academic research, charting, coding, diagnosis, disease management, and patient counseling. RESULTS A total of 68 articles were included in the review, distributed across several categories: 8 related to academics and research, 5 to charting, 1 to coding and billing, 44 to diagnosis, 49 to disease management, 2 to literature consulting, 23 to medical education, and 33 to patient counseling. Many articles were classified into multiple categories due to overlapping topics. The findings indicate that while ChatGPT shows significant promise in areas such as medical education and diagnostic support, concerns regarding accuracy, reliability, and the potential for misinformation remain prevalent. CONCLUSION ChatGPT offers substantial potential in advancing retinal healthcare by supporting clinical decision-making, enhancing patient education, and automating administrative tasks. However, its current limitations, particularly in clinical accuracy and the risk of generating misinformation, necessitate cautious integration into practice, with continuous oversight from healthcare professionals. Future developments should focus on improving accuracy, incorporating up-to-date medical guidelines, and minimizing the risks associated with AI-driven healthcare tools.
Collapse
Affiliation(s)
- Victor C F Bellanda
- Ribeirão Preto Medical School, University of São Paulo, 3900 Bandeirantes Ave, Ribeirão Preto, SP, 14049-900, Brazil.
| | | | | | - Rodrigo Jorge
- Ribeirão Preto Medical School, University of São Paulo, 3900 Bandeirantes Ave, Ribeirão Preto, SP, 14049-900, Brazil
| | - Gustavo Barreto Melo
- Sergipe Eye Hospital, Aracaju, SE, Brazil
- Paulista School of Medicine, Federal University of São Paulo, São Paulo, SP, Brazil
| |
Collapse
|
4
|
Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, Wornow M, Swaminathan A, Lehmann LS, Hong HJ, Kashyap M, Chaurasia AR, Shah NR, Singh K, Tazbaz T, Milstein A, Pfeffer MA, Shah NH. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2024:2825147. [PMID: 39405325 PMCID: PMC11480901 DOI: 10.1001/jama.2024.21700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 09/30/2024] [Indexed: 10/19/2024]
Abstract
Importance Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas. Objective To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty. Data Sources A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024. Study Selection Studies evaluating 1 or more LLMs in health care. Data Extraction and Synthesis Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty. Results Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented. Conclusions and Relevance Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.
Collapse
Affiliation(s)
- Suhana Bedi
- Department of Biomedical Data Science, Stanford School of Medicine, Stanford, California
| | - Yutong Liu
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Lucy Orr-Ewing
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Dev Dash
- Clinical Excellence Research Center, Stanford University, Stanford, California
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Sanmi Koyejo
- Department of Computer Science, Stanford University, Stanford, California
| | - Alison Callahan
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Jason A. Fries
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Michael Wornow
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Akshay Swaminathan
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | | | - Hyo Jung Hong
- Department of Anesthesiology, Stanford University, Stanford, California
| | - Mehr Kashyap
- Stanford University School of Medicine, Stanford, California
| | - Akash R. Chaurasia
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Nirav R. Shah
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Karandeep Singh
- Digital Health Innovation, University of California San Diego Health, San Diego
| | - Troy Tazbaz
- Digital Health Center of Excellence, US Food and Drug Administration, Washington, DC
| | - Arnold Milstein
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Michael A. Pfeffer
- Department of Medicine, Stanford University School of Medicine, Stanford, California
| | - Nigam H. Shah
- Clinical Excellence Research Center, Stanford University, Stanford, California
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| |
Collapse
|
5
|
Guirguis PG, Youssef MP, Punreddy A, Botros M, Raiford M, McDowell S. Is Information About Musculoskeletal Malignancies From Large Language Models or Web Resources at a Suitable Reading Level for Patients? Clin Orthop Relat Res 2024:00003086-990000000-01751. [PMID: 39330944 DOI: 10.1097/corr.0000000000003263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 09/06/2024] [Indexed: 09/28/2024]
Abstract
BACKGROUND Patients and caregivers may experience immense distress when receiving the diagnosis of a primary musculoskeletal malignancy and subsequently turn to internet resources for more information. It is not clear whether these resources, including Google and ChatGPT, offer patients information that is readable, a measure of how easy text is to understand. Since many patients turn to Google and artificial intelligence resources for healthcare information, we thought it was important to ascertain whether the information they find is readable and easy to understand. The objective of this study was to compare readability of Google search results and ChatGPT answers to frequently asked questions and assess whether these sources meet NIH recommendations for readability. QUESTIONS/PURPOSES (1) What is the readability of ChatGPT-3.5 as a source of patient information for the three most common primary bone malignancies compared with top online resources from Google search? (2) Do ChatGPT-3.5 responses and online resources meet NIH readability guidelines for patient education materials? METHODS This was a cross-sectional analysis of the 12 most common online questions about osteosarcoma, chondrosarcoma, and Ewing sarcoma. To be consistent with other studies of similar design that utilized national society frequently asked questions lists, questions were selected from the American Cancer Society and categorized based on content, including diagnosis, treatment, and recovery and prognosis. Google was queried using all 36 questions, and top responses were recorded. Author types, such as hospital systems, national health organizations, or independent researchers, were recorded. ChatGPT-3.5 was provided each question in independent queries without further prompting. Responses were assessed with validated reading indices to determine readability by grade level. An independent t-test was performed with significance set at p < 0.05. RESULTS Google (n = 36) and ChatGPT-3.5 (n = 36) answers were recorded, 12 for each of the three cancer types. Reading grade levels based on mean readability scores were 11.0 ± 2.9 and 16.1 ± 3.6, respectively. This corresponds to the eleventh grade reading level for Google and a fourth-year undergraduate student level for ChatGPT-3.5. Google answers were more readable across all individual indices, without differences in word count. No difference in readability was present across author type, question category, or cancer type. Of 72 total responses across both search modalities, none met NIH readability criteria at the sixth-grade level. CONCLUSION Google material was presented at a high school reading level, whereas ChatGPT-3.5 was at an undergraduate reading level. The readability of both resources was inadequate based on NIH recommendations. Improving readability is crucial for better patient understanding during cancer treatment. Physicians should assess patients' needs, offer them tailored materials, and guide them to reliable resources to prevent reliance on online information that is hard to understand. LEVEL OF EVIDENCE Level III, prognostic study.
Collapse
Affiliation(s)
- Paul G Guirguis
- University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | - Mark P Youssef
- A.T. Still School of Osteopathic Medicine, Mesa, AZ, USA
| | - Ankit Punreddy
- University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | - Mina Botros
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| | - Mattie Raiford
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| | - Susan McDowell
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| |
Collapse
|
6
|
Tong L, Zhang C, Liu R, Yang J, Sun Z. Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis. J Orthop Surg Res 2024; 19:574. [PMID: 39289734 PMCID: PMC11409482 DOI: 10.1186/s13018-024-04996-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 08/12/2024] [Indexed: 09/19/2024] Open
Abstract
BACKGROUNDS The use of large language models (LLMs) in medicine can help physicians improve the quality and effectiveness of health care by increasing the efficiency of medical information management, patient care, medical research, and clinical decision-making. METHODS We collected 34 frequently asked questions about glucocorticoid-induced osteoporosis (GIOP), covering topics related to the disease's clinical manifestations, pathogenesis, diagnosis, treatment, prevention, and risk factors. We also generated 25 questions based on the 2022 American College of Rheumatology Guideline for the Prevention and Treatment of Glucocorticoid-Induced Osteoporosis (2022 ACR-GIOP Guideline). Each question was posed to the LLM (ChatGPT-3.5, ChatGPT-4, and Google Gemini), and three senior orthopedic surgeons independently rated the responses generated by the LLMs. Three senior orthopedic surgeons independently rated the answers based on responses ranging between 1 and 4 points. A total score (TS) > 9 indicated 'good' responses, 6 ≤ TS ≤ 9 indicated 'moderate' responses, and TS < 6 indicated 'poor' responses. RESULTS In response to the general questions related to GIOP and the 2022 ACR-GIOP Guidelines, Google Gemini provided more concise answers than the other LLMs. In terms of pathogenesis, ChatGPT-4 had significantly higher total scores (TSs) than ChatGPT-3.5. The TSs for answering questions related to the 2022 ACR-GIOP Guideline by ChatGPT-4 were significantly higher than those for Google Gemini. ChatGPT-3.5 and ChatGPT-4 had significantly higher self-corrected TSs than pre-corrected TSs, while Google Gemini self-corrected for responses that were not significantly different than before. CONCLUSIONS Our study showed that Google Gemini provides more concise and intuitive responses than ChatGPT-3.5 and ChatGPT-4. ChatGPT-4 performed significantly better than ChatGPT3.5 and Google Gemini in terms of answering general questions about GIOP and the 2022 ACR-GIOP Guidelines. ChatGPT3.5 and ChatGPT-4 self-corrected better than Google Gemini.
Collapse
Affiliation(s)
- Linjian Tong
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China
| | - Chaoyang Zhang
- Department of Orthopedics, Tianjin Medical University Baodi Hospital, Tianjin, 301800, China
| | - Rui Liu
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China
| | - Jia Yang
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China
| | - Zhiming Sun
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China.
| |
Collapse
|
7
|
Tailor PD, D'Souza HS, Li H, Starr MR. Vision of the future: large language models in ophthalmology. Curr Opin Ophthalmol 2024; 35:391-402. [PMID: 38814572 DOI: 10.1097/icu.0000000000001062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2024]
Abstract
PURPOSE OF REVIEW Large language models (LLMs) are rapidly entering the landscape of medicine in areas from patient interaction to clinical decision-making. This review discusses the evolving role of LLMs in ophthalmology, focusing on their current applications and future potential in enhancing ophthalmic care. RECENT FINDINGS LLMs in ophthalmology have demonstrated potential in improving patient communication and aiding preliminary diagnostics because of their ability to process complex language and generate human-like domain-specific interactions. However, some studies have shown potential for harm and there have been no prospective real-world studies evaluating the safety and efficacy of LLMs in practice. SUMMARY While current applications are largely theoretical and require rigorous safety testing before implementation, LLMs exhibit promise in augmenting patient care quality and efficiency. Challenges such as data privacy and user acceptance must be overcome before LLMs can be fully integrated into clinical practice.
Collapse
Affiliation(s)
| | - Haley S D'Souza
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Hanzhou Li
- Department of Radiology, Emory University, Atlanta, Georgia, USA
| | - Matthew R Starr
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
8
|
Shi R, Liu S, Xu X, Ye Z, Yang J, Le Q, Qiu J, Tian L, Wei A, Shan K, Zhao C, Sun X, Zhou X, Hong J. Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study. Heliyon 2024; 10:e34391. [PMID: 39113991 PMCID: PMC11305187 DOI: 10.1016/j.heliyon.2024.e34391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 07/08/2024] [Accepted: 07/09/2024] [Indexed: 08/10/2024] Open
Abstract
Purpose To evaluate the performance of four large language models (LLMs)-GPT-4, PaLM 2, Qwen, and Baichuan 2-in generating responses to inquiries from Chinese patients about dry eye disease (DED). Design Two-phase study, including a cross-sectional test in the first phase and a real-world clinical assessment in the second phase. Subjects Eight board-certified ophthalmologists and 46 patients with DED. Methods The chatbots' responses to Chinese patients' inquiries about DED were assessed by the evaluation. In the first phase, six senior ophthalmologists subjectively rated the chatbots' responses using a 5-point Likert scale across five domains: correctness, completeness, readability, helpfulness, and safety. Objective readability analysis was performed using a Chinese readability analysis platform. In the second phase, 46 representative patients with DED asked the two language models (GPT-4 and Baichuan 2) that performed best in the in the first phase questions and then rated the answers for satisfaction and readability. Two senior ophthalmologists then assessed the responses across the five domains. Main outcome measures Subjective scores for the five domains and objective readability scores in the first phase. The patient satisfaction, readability scores, and subjective scores for the five-domains in the second phase. Results In the first phase, GPT-4 exhibited superior performance across the five domains (correctness: 4.47; completeness: 4.39; readability: 4.47; helpfulness: 4.49; safety: 4.47, p < 0.05). However, the readability analysis revealed that GPT-4's responses were highly complex, with an average score of 12.86 (p < 0.05) compared to scores of 10.87, 11.53, and 11.26 for Qwen, Baichuan 2, and PaLM 2, respectively. In the second phase, as shown by the scores for the five domains, both GPT-4 and Baichuan 2 were adept in answering questions posed by patients with DED. However, the completeness of Baichuan 2's responses was relatively poor (4.04 vs. 4.48 for GPT-4, p < 0.05). Nevertheless, Baichuan 2's recommendations more comprehensible than those of GPT-4 (patient readability: 3.91 vs. 4.61, p < 0.05; ophthalmologist readability: 2.67 vs. 4.33). Conclusions The findings underscore the potential of LLMs, particularly that of GPT-4 and Baichuan 2, in delivering accurate and comprehensive responses to questions from Chinese patients about DED.
Collapse
Affiliation(s)
- Runhan Shi
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
- NHC Key laboratory of molecular engineering of polymers, Fudan University, Shanghai, 200031, China
- Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, 200032, China
- Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, China
| | - Steven Liu
- Department of Statistics, College of Liberal Arts & Sciences, University of Illinois Urbana-Champaign, Illinois, USA
| | - Xinwei Xu
- Faculty of Business and Economics, Hong Kong University, Hong Kong Special Administrative Region, China
| | - Zhengqiang Ye
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
| | - Jin Yang
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
| | - Qihua Le
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
| | - Jini Qiu
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
| | - Lijia Tian
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
| | - Anji Wei
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
| | - Kun Shan
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
| | - Chen Zhao
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
| | - Xinghuai Sun
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
| | - Xingtao Zhou
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
| | - Jiaxu Hong
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
- NHC Key laboratory of molecular engineering of polymers, Fudan University, Shanghai, 200031, China
- Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, 200032, China
- Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, China
| |
Collapse
|
9
|
Kim YI, Kim KH, Oh HJ, Seo Y, Kwon SM, Sung KS, Chong K, Lee MH. Assessing the Suitability of Artificial Intelligence-Based Chatbots as Counseling Agents for Patients with Brain Tumor: A Comprehensive Survey Analysis. World Neurosurg 2024; 187:e963-e981. [PMID: 38735564 DOI: 10.1016/j.wneu.2024.05.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 05/06/2024] [Indexed: 05/14/2024]
Abstract
OBJECTIVE The internet, particularly social media, has become a popular resource for learning about health and investigating one's own health conditions. The development of artificial intelligence (AI) chatbots has been fueled by the increasing availability of digital health data and advances in natural language processing techniques. While these chatbots are more accessible than before, they sometimes fail to provide accurate information. METHODS We used representative chatbots currently available (Chat Generative Pretrained Transformer-3.5, Bing Chat, and Google Bard) to answer questions commonly asked by brain tumor patients. The simulated situations with questions were made and selected by the brain tumor committee. These questions are commonly asked by brain tumor patients. The goal of the study was introduced to each chatbot, the situation was explained, and questions were asked. All responses were collected without modification. The answers were shown to the committee members, and they were asked to judge the responses while blinded to the type of chatbot. RESULTS There was no significant difference in accuracy and communication ability among the 3 groups (P = 0.253, 0.090, respectively). For empathy, Bing Chat and Google Bard were superior to Chat Generative Pretrained Transformer (P = 0.004, 0.002, respectively). The purpose of this study was not to assess or verify the relative superiority of each chatbot. Instead, the aim was to identify the shortcomings and changes needed if AI chatbots are to be used for patient medical purposes. CONCLUSION AI-based chatbots are a convenient way for patients and the general public to access medical information. Under such circumstances, medical professionals must ensure that the information provided to chatbot users is accurate and safe.
Collapse
Affiliation(s)
- Young Il Kim
- Department of Neurosurgery, St. Vincent's Hospital, College of Medicine, The Catholic University of Korea, Seoul, South Korea
| | - Kyung Hwan Kim
- Department of Neurosurgery, Chungnam National University Hospital, Chungnam National University School of Medicine, Daejeon, South Korea
| | - Hyuk-Jin Oh
- Department of Neurosurgery, Soonchunhyang University Cheonan Hospital, Cheonan, South Korea
| | - Youngbeom Seo
- Department of Neurosurgery, Yeungnam University Hospital, Yeungnam University College of Medicine, Daegu, South Korea
| | - Sae Min Kwon
- Department of Neurosurgery, Dongsan Medical Center, Keimyung University School of Medicine, Daegu, South Korea
| | - Kyoung Su Sung
- Department of Neurosurgery, Dong-A University Hospital, Dong-A University College of Medicine, Busan, South Korea
| | - Kyuha Chong
- Department of Neurosurgery, Brain Tumor Center, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, South Korea
| | - Min Ho Lee
- Department of Neurosurgery, Uijeongbu St. Mary's Hospital, School of Medicine, The Catholic University of Korea, Seoul, South Korea.
| |
Collapse
|
10
|
Yüce A, Yerli M, Misir A, Çakar M. Enhancing patient information texts in orthopaedics: How OpenAI's 'ChatGPT' can help. J Exp Orthop 2024; 11:e70019. [PMID: 39291057 PMCID: PMC11406043 DOI: 10.1002/jeo2.70019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 08/15/2024] [Accepted: 08/20/2024] [Indexed: 09/19/2024] Open
Abstract
Purpose The internet has become a primary source for patients seeking healthcare information, but the quality of online information, particularly in orthopaedics, often falls short. Orthopaedic surgeons now have the added responsibility of evaluating and guiding patients to credible online resources. This study aimed to assess ChatGPT's ability to identify deficiencies in patient information texts related to total hip arthroplasty websites and to evaluate its potential for enhancing the quality of these texts. Methods In August 2023, 25 websites related to total hip arthroplasty were assessed using a standardized search on Google. Peer-reviewed scientific articles, empty pages, dictionary definitions, and unrelated content were excluded. The remaining 10 websites were evaluated using the hip information scoring system (HISS). ChatGPT was then used to assess these texts, identify deficiencies and provide recommendations. Results The mean HISS score of the websites was 9.5, indicating low to moderate quality. However, after implementing ChatGPT's suggested improvements, the score increased to 21.5, signifying excellent quality. ChatGPT's recommendations included using simpler language, adding FAQs, incorporating patient experiences, addressing cost and insurance issues, detailing preoperative and postoperative phases, including references, and emphasizing emotional and psychological support. The study demonstrates that ChatGPT can significantly enhance patient information quality. Conclusion ChatGPT's role in elevating patient education regarding total hip arthroplasty is promising. This study sheds light on the potential of ChatGPT as an aid to orthopaedic surgeons in producing high-quality patient information materials. Although it cannot replace human expertise, it offers a valuable means of enhancing the quality of healthcare information available online. Level of Evidence Level IV.
Collapse
Affiliation(s)
- Ali Yüce
- Department of Orthopedic and Traumatology Prof. Dr. Cemil Taşcıoğlu City Hospital İstanbul Turkey
| | - Mustafa Yerli
- Department of Orthopedic and Traumatology Prof. Dr. Cemil Taşcıoğlu City Hospital İstanbul Turkey
| | - Abdulhamit Misir
- Department of Orthopedic and Traumatology Göztepe Medical Park Hospital İstanbul Turkey
| | - Murat Çakar
- Department of Orthopedic and Traumatology Prof. Dr. Cemil Taşcıoğlu City Hospital İstanbul Turkey
| |
Collapse
|
11
|
Yang Z, Wang D, Zhou F, Song D, Zhang Y, Jiang J, Kong K, Liu X, Qiao Y, Chang RT, Han Y, Li F, Tham CC, Zhang X. Understanding natural language: Potential application of large language models to ophthalmology. Asia Pac J Ophthalmol (Phila) 2024; 13:100085. [PMID: 39059558 DOI: 10.1016/j.apjo.2024.100085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 06/19/2024] [Accepted: 07/19/2024] [Indexed: 07/28/2024] Open
Abstract
Large language models (LLMs), a natural language processing technology based on deep learning, are currently in the spotlight. These models closely mimic natural language comprehension and generation. Their evolution has undergone several waves of innovation similar to convolutional neural networks. The transformer architecture advancement in generative artificial intelligence marks a monumental leap beyond early-stage pattern recognition via supervised learning. With the expansion of parameters and training data (terabytes), LLMs unveil remarkable human interactivity, encompassing capabilities such as memory retention and comprehension. These advances make LLMs particularly well-suited for roles in healthcare communication between medical practitioners and patients. In this comprehensive review, we discuss the trajectory of LLMs and their potential implications for clinicians and patients. For clinicians, LLMs can be used for automated medical documentation, and given better inputs and extensive validation, LLMs may be able to autonomously diagnose and treat in the future. For patient care, LLMs can be used for triage suggestions, summarization of medical documents, explanation of a patient's condition, and customizing patient education materials tailored to their comprehension level. The limitations of LLMs and possible solutions for real-world use are also presented. Given the rapid advancements in this area, this review attempts to briefly cover many roles that LLMs may play in the ophthalmic space, with a focus on improving the quality of healthcare delivery.
Collapse
Affiliation(s)
- Zefeng Yang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Deming Wang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Fengqi Zhou
- Ophthalmology, Mayo Clinic Health System, Eau Claire, Wisconsin, USA
| | - Diping Song
- Shanghai Artificial Intelligence Laboratory, Shanghai, China
| | - Yinhang Zhang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Jiaxuan Jiang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Kangjie Kong
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Xiaoyi Liu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Yu Qiao
- Shanghai Artificial Intelligence Laboratory, Shanghai, China
| | - Robert T Chang
- Department of Ophthalmology, Byers Eye Institute at Stanford University, Palo Alto, CA, USA
| | - Ying Han
- Department of Ophthalmology, University of California, San Francisco, San Francisco, CA, USA
| | - Fei Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China.
| | - Clement C Tham
- Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China; Hong Kong Eye Hospital, Kowloon, Hong Kong SAR, China; Department of Ophthalmology and Visual Sciences, Prince of Wales Hospital, Shatin, Hong Kong SAR, China.
| | - Xiulan Zhang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China.
| |
Collapse
|
12
|
Mandalos A, Tsouris D. Artificial Versus Human Intelligence in the Diagnostic Approach of Ophthalmic Case Scenarios: A Qualitative Evaluation of Performance and Consistency. Cureus 2024; 16:e62471. [PMID: 39015855 PMCID: PMC11251728 DOI: 10.7759/cureus.62471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/16/2024] [Indexed: 07/18/2024] Open
Abstract
PURPOSE To evaluate the efficiency of three artificial intelligence (AI) chatbots (ChatGPT-3.5 (OpenAI, San Francisco, California, United States), Bing Copilot (Microsoft Corporation, Redmond, Washington, United States), Google Gemini (Google LLC, Mountain View, California, United States)) in assisting the ophthalmologist in the diagnostic approach and management of challenging ophthalmic cases and compare their performance with that of a practicing human ophthalmic specialist. The secondary aim was to assess the short- and medium-term consistency of ChatGPT's responses. METHODS Eleven ophthalmic case scenarios of variable complexity were presented to the AI chatbots and to an ophthalmic specialist in a stepwise fashion. Advice regarding the initial differential diagnosis, the final diagnosis, further investigation, and management was asked for. One month later, the same process was repeated twice on the same day for ChatGPT only. RESULTS The individual diagnostic performance of all three AI chatbots was inferior to that of the ophthalmic specialist; however, they provided useful complementary input in the diagnostic algorithm. This was especially true for ChatGPT and Bing Copilot. ChatGPT exhibited reasonable short- and medium-term consistency, with the mean Jaccard similarity coefficient of responses varying between 0.58 and 0.76. CONCLUSION AI chatbots may act as useful assisting tools in the diagnosis and management of challenging ophthalmic cases; however, their responses should be scrutinized for potential inaccuracies, and by no means can they replace consultation with an ophthalmic specialist.
Collapse
|
13
|
Berce C. Artificial intelligence generated clinical score sheets: looking at the two faces of Janus. Lab Anim Res 2024; 40:21. [PMID: 38750604 PMCID: PMC11097593 DOI: 10.1186/s42826-024-00206-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 04/22/2024] [Accepted: 04/28/2024] [Indexed: 05/18/2024] Open
Abstract
In vivo experiments are increasingly using clinical score sheets to ensure minimal distress to the animals. A score sheet is a document that includes a list of specific symptoms, behaviours and intervention guidelines, all balanced to for an objective clinical assessment of experimental animals. Artificial Intelligence (AI) technologies are increasingly being applied in the field of preclinical research, not only in analysis but also in documentation processes, reflecting a significant shift towards more technologically advanced research methodologies. The present study explores the application of Large Language Models (LLM) in generating score sheets for an animal welfare assessment in a preclinical research setting. Focusing on a mouse model of inflammatory bowel disease, the study evaluates the performance of three LLM - ChatGPT-4, ChatGPT-3.5, and Google Bard - in creating clinical score sheets based on specified criteria such as weight loss, stool consistency, and visible fecal blood. Key parameters evaluated include the consistency of structure, accuracy in representing severity levels, and appropriateness of intervention thresholds. The findings reveal a duality in LLM-generated score sheets: while some LLM consistently structure their outputs effectively, all models exhibit notable variations in assigning numerical values to symptoms and defining intervention thresholds accurately. This emphasizes the dual nature of AI performance in this field-its potential to create useful foundational drafts and the critical need for professional review to ensure precision and reliability. The results highlight the significance of balancing AI-generated tools with expert oversight in preclinical research.
Collapse
Affiliation(s)
- Cristian Berce
- Animal Health and Welfare Division, Federal Food Safety and Veterinary Office, Bern, Switzerland.
| |
Collapse
|
14
|
Biswas S, Davies LN, Sheppard AL, Logan NS, Wolffsohn JS. Utility of artificial intelligence-based large language models in ophthalmic care. Ophthalmic Physiol Opt 2024; 44:641-671. [PMID: 38404172 DOI: 10.1111/opo.13284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 01/23/2024] [Accepted: 01/25/2024] [Indexed: 02/27/2024]
Abstract
PURPOSE With the introduction of ChatGPT, artificial intelligence (AI)-based large language models (LLMs) are rapidly becoming popular within the scientific community. They use natural language processing to generate human-like responses to queries. However, the application of LLMs and comparison of the abilities among different LLMs with their human counterparts in ophthalmic care remain under-reported. RECENT FINDINGS Hitherto, studies in eye care have demonstrated the utility of ChatGPT in generating patient information, clinical diagnosis and passing ophthalmology question-based examinations, among others. LLMs' performance (median accuracy, %) is influenced by factors such as the iteration, prompts utilised and the domain. Human expert (86%) demonstrated the highest proficiency in disease diagnosis, while ChatGPT-4 outperformed others in ophthalmology examinations (75.9%), symptom triaging (98%) and providing information and answering questions (84.6%). LLMs exhibited superior performance in general ophthalmology but reduced accuracy in ophthalmic subspecialties. Although AI-based LLMs like ChatGPT are deemed more efficient than their human counterparts, these AIs are constrained by their nonspecific and outdated training, no access to current knowledge, generation of plausible-sounding 'fake' responses or hallucinations, inability to process images, lack of critical literature analysis and ethical and copyright issues. A comprehensive evaluation of recently published studies is crucial to deepen understanding of LLMs and the potential of these AI-based LLMs. SUMMARY Ophthalmic care professionals should undertake a conservative approach when using AI, as human judgement remains essential for clinical decision-making and monitoring the accuracy of information. This review identified the ophthalmic applications and potential usages which need further exploration. With the advancement of LLMs, setting standards for benchmarking and promoting best practices is crucial. Potential clinical deployment requires the evaluation of these LLMs to move away from artificial settings, delve into clinical trials and determine their usefulness in the real world.
Collapse
Affiliation(s)
- Sayantan Biswas
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Leon N Davies
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Amy L Sheppard
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - Nicola S Logan
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| | - James S Wolffsohn
- School of Optometry, College of Health and Life Sciences, Aston University, Birmingham, UK
| |
Collapse
|
15
|
Mondal H, Komarraju S, D S, Muralidharan S. Assessing the Capability of Large Language Models in Naturopathy Consultation. Cureus 2024; 16:e59457. [PMID: 38826991 PMCID: PMC11141616 DOI: 10.7759/cureus.59457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/30/2024] [Indexed: 06/04/2024] Open
Abstract
Background The rapid advancements in natural language processing have brought about the widespread use of large language models (LLMs) across various medical domains. However, their effectiveness in specialized fields, such as naturopathy, remains relatively unexplored. Objective The study aimed to assess the capability of freely available LLM chatbots in providing naturopathy consultations for various types of diseases and disorders. Methods Five free LLMs (viz., Gemini, Copilot, ChatGPT, Claude, and Perplexity) were used to converse with 20 clinical cases (simulation of real-world scenarios). Each case had the case details and questions pertinent to naturopathy. The responses were presented to three naturopathy doctors with > 5 years of practice. The answers were rated by them on a five-point Likert-like scale for language fluency, coherence, accuracy, and relevancy. The average of these four attributes is termed perfection in his study. Results The overall score of the LLMs were Gemini 3.81±0.23, Copilot 4.34±0.28, ChatGPT 4.43±0.2, Claude 3.8±0.26, and Perplexity 3.91±0.28 (ANOVA F [3.034, 57.64] = 33.47, P <0.0001. Together, they showed overall ~80% perfection in consultation. The average measure intraclass correlation coefficient among the LLMs for the overall score was 0.463 (95% CI = -0.028 to 0.76), P = 0.03. Conclusion Although the LLM chatbots could help in providing naturopathy and yoga treatment consultation with approximately an overall fair level of perfection, their solution to the user varies across different chatbots and there was very low reliability among them.
Collapse
Affiliation(s)
- Himel Mondal
- Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, IND
| | | | - Sathyanath D
- Naturopathy and Yoga, National Institute of Naturopathy, Pune, IND
| | | |
Collapse
|
16
|
Ocakoglu SR, Coskun B. The Emerging Role of AI in Patient Education: A Comparative Analysis of LLM Accuracy for Pelvic Organ Prolapse. Med Princ Pract 2024; 33:000538538. [PMID: 38527444 PMCID: PMC11324208 DOI: 10.1159/000538538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Accepted: 03/21/2024] [Indexed: 03/27/2024] Open
Abstract
OBJECTIVE This study aimed to evaluate the accuracy, completeness, precision, and readability of outputs generated by three Large Language Models (LLMs): GPT by OpenAI, BARD by Google, and Bing by Microsoft, in comparison to patient education material on Pelvic Organ Prolapse (POP) provided by the Royal College of Obstetricians and Gynecologists (RCOG). METHODS A total of 15 questions were retrieved from the RCOG website and input into the three LLMs. Two independent reviewers evaluated the outputs for accuracy, completeness, and precision. Readability was assessed using the Simplified Measure of Gobbledygook (SMOG) score and the Flesch-Kincaid Grade Level (FKGL) score. RESULTS Significant differences were observed in completeness and precision metrics. ChatGPT ranked highest in completeness (66.7%), while Bing led in precision (100%). No significant differences were observed in accuracy across all models. In terms of readability, ChatGPT exhibited higher difficulty than BARD, Bing, and the original RCOG answers. CONCLUSION While all models displayed a variable degree of correctness, ChatGPT excelled in completeness, significantly surpassing BARD and Bing. However, Bing led in precision, providing the most relevant and concise answers. Regarding readability, ChatGPT exhibited higher difficulty. The study found that while all LLMs showed varying degrees of correctness in answering RCOG questions on patient information for Pelvic Organ Prolapse (POP), ChatGPT was the most comprehensive, but its answers were harder to read. Bing, on the other hand, was the most precise. The findings highlight the potential of LLMs in health information dissemination and the need for careful interpretation of their outputs.
Collapse
Affiliation(s)
| | - Burhan Coskun
- Department of Urology, Bursa Uludag University, Bursa, Turkey
| |
Collapse
|
17
|
Wang L, Chen X, Deng X, Wen H, You M, Liu W, Li Q, Li J. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med 2024; 7:41. [PMID: 38378899 PMCID: PMC10879172 DOI: 10.1038/s41746-024-01029-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Accepted: 02/05/2024] [Indexed: 02/22/2024] Open
Abstract
The use of large language models (LLMs) in clinical medicine is currently thriving. Effectively transferring LLMs' pertinent theoretical knowledge from computer science to their application in clinical medicine is crucial. Prompt engineering has shown potential as an effective method in this regard. To explore the application of prompt engineering in LLMs and to examine the reliability of LLMs, different styles of prompts were designed and used to ask different LLMs about their agreement with the American Academy of Orthopedic Surgeons (AAOS) osteoarthritis (OA) evidence-based guidelines. Each question was asked 5 times. We compared the consistency of the findings with guidelines across different evidence levels for different prompts and assessed the reliability of different prompts by asking the same question 5 times. gpt-4-Web with ROT prompting had the highest overall consistency (62.9%) and a significant performance for strong recommendations, with a total consistency of 77.5%. The reliability of the different LLMs for different prompts was not stable (Fleiss kappa ranged from -0.002 to 0.984). This study revealed that different prompts had variable effects across various models, and the gpt-4-Web with ROT prompt was the most consistent. An appropriate prompt could improve the accuracy of responses to professional medical questions.
Collapse
Affiliation(s)
- Li Wang
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - Xi Chen
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - XiangWen Deng
- Shenzhen International Graduate School, Tsinghua University, Beijing, China
| | - Hao Wen
- Shenzhen International Graduate School, Tsinghua University, Beijing, China
| | - MingKe You
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - WeiZhi Liu
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - Qi Li
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China.
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China.
| | - Jian Li
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China.
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China.
| |
Collapse
|
18
|
Zandi R, Fahey JD, Drakopoulos M, Bryan JM, Dong S, Bryar PJ, Bidwell AE, Bowen RC, Lavine JA, Mirza RG. Exploring Diagnostic Precision and Triage Proficiency: A Comparative Study of GPT-4 and Bard in Addressing Common Ophthalmic Complaints. Bioengineering (Basel) 2024; 11:120. [PMID: 38391606 PMCID: PMC10886029 DOI: 10.3390/bioengineering11020120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 01/20/2024] [Accepted: 01/24/2024] [Indexed: 02/24/2024] Open
Abstract
In the modern era, patients often resort to the internet for answers to their health-related concerns, and clinics face challenges to providing timely response to patient concerns. This has led to a need to investigate the capabilities of AI chatbots for ophthalmic diagnosis and triage. In this in silico study, 80 simulated patient complaints in ophthalmology with varying urgency levels and clinical descriptors were entered into both ChatGPT and Bard in a systematic 3-step submission process asking chatbots to triage, diagnose, and evaluate urgency. Three ophthalmologists graded chatbot responses. Chatbots were significantly better at ophthalmic triage than diagnosis (90.0% appropriate triage vs. 48.8% correct leading diagnosis; p < 0.001), and GPT-4 performed better than Bard for appropriate triage recommendations (96.3% vs. 83.8%; p = 0.008), grader satisfaction for patient use (81.3% vs. 55.0%; p < 0.001), and lower potential harm rates (6.3% vs. 20.0%; p = 0.010). More descriptors improved the accuracy of diagnosis for both GPT-4 and Bard. These results indicate that chatbots may not need to recognize the correct diagnosis to provide appropriate ophthalmic triage, and there is a potential utility of these tools in aiding patients or triage staff; however, they are not a replacement for professional ophthalmic evaluation or advice.
Collapse
Affiliation(s)
- Roya Zandi
- Department of Ophthalmology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Joseph D Fahey
- Department of Ophthalmology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Michael Drakopoulos
- Department of Ophthalmology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - John M Bryan
- Department of Ophthalmology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Siyuan Dong
- Division of Biostatistics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Paul J Bryar
- Department of Ophthalmology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Ann E Bidwell
- Department of Ophthalmology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - R Chris Bowen
- Department of Ophthalmology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Jeremy A Lavine
- Department of Ophthalmology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Rukhsana G Mirza
- Department of Ophthalmology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| |
Collapse
|
19
|
Malik S, Zaheer S. ChatGPT as an aid for pathological diagnosis of cancer. Pathol Res Pract 2024; 253:154989. [PMID: 38056135 DOI: 10.1016/j.prp.2023.154989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 11/26/2023] [Accepted: 11/27/2023] [Indexed: 12/08/2023]
Abstract
Diagnostic workup of cancer patients is highly reliant on the science of pathology using cytopathology, histopathology, and other ancillary techniques like immunohistochemistry and molecular cytogenetics. Data processing and learning by means of artificial intelligence (AI) has become a spearhead for the advancement of medicine, with pathology and laboratory medicine being no exceptions. ChatGPT, an artificial intelligence (AI)-based chatbot, that was recently launched by OpenAI, is currently a talk of the town, and its role in cancer diagnosis is also being explored meticulously. Pathology workflow by integration of digital slides, implementation of advanced algorithms, and computer-aided diagnostic techniques extend the frontiers of the pathologist's view beyond a microscopic slide and enables effective integration, assimilation, and utilization of knowledge that is beyond human limits and boundaries. Despite of it's numerous advantages in the pathological diagnosis of cancer, it comes with several challenges like integration of digital slides with input language parameters, problems of bias, and legal issues which have to be addressed and worked up soon so that we as a pathologists diagnosing malignancies are on the same band wagon and don't miss the train.
Collapse
Affiliation(s)
- Shaivy Malik
- Department of Pathology, Vardhman Mahavir Medical College and Safdarjung Hospital, New Delhi, India
| | - Sufian Zaheer
- Department of Pathology, Vardhman Mahavir Medical College and Safdarjung Hospital, New Delhi, India.
| |
Collapse
|
20
|
Alotaibi SS, Rehman A, Hasnain M. Revolutionizing ocular cancer management: a narrative review on exploring the potential role of ChatGPT. Front Public Health 2023; 11:1338215. [PMID: 38192545 PMCID: PMC10773849 DOI: 10.3389/fpubh.2023.1338215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 12/04/2023] [Indexed: 01/10/2024] Open
Abstract
This paper pioneers the exploration of ocular cancer, and its management with the help of Artificial Intelligence (AI) technology. Existing literature presents a significant increase in new eye cancer cases in 2023, experiencing a higher incidence rate. Extensive research was conducted using online databases such as PubMed, ACM Digital Library, ScienceDirect, and Springer. To conduct this review, Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines are used. Of the collected 62 studies, only 20 documents met the inclusion criteria. The review study identifies seven ocular cancer types. Important challenges associated with ocular cancer are highlighted, including limited awareness about eye cancer, restricted healthcare access, financial barriers, and insufficient infrastructure support. Financial barriers is one of the widely examined ocular cancer challenges in the literature. The potential role and limitations of ChatGPT are discussed, emphasizing its usefulness in providing general information to physicians, noting its inability to deliver up-to-date information. The paper concludes by presenting the potential future applications of ChatGPT to advance research on ocular cancer globally.
Collapse
Affiliation(s)
- Saud S. Alotaibi
- Information Systems Department, Umm Al-Qura University, Makkah, Saudi Arabia
| | - Amna Rehman
- Department of Computer Science, Lahore Leads University, Lahore, Pakistan
| | - Muhammad Hasnain
- Department of Computer Science, Lahore Leads University, Lahore, Pakistan
| |
Collapse
|