1
|
Kalaw FGP, Baxter SL. Ethical considerations for large language models in ophthalmology. Curr Opin Ophthalmol 2024; 35:438-446. [PMID: 39259616 PMCID: PMC11427135 DOI: 10.1097/icu.0000000000001083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/13/2024]
Abstract
PURPOSE OF REVIEW This review aims to summarize and discuss the ethical considerations regarding large language model (LLM) use in the field of ophthalmology. RECENT FINDINGS This review of 47 articles on LLM applications in ophthalmology highlights their diverse potential uses, including education, research, clinical decision support, and surgical assistance (as an aid in operative notes). We also review ethical considerations such as the inability of LLMs to interpret data accurately, the risk of promoting controversial or harmful recommendations, and breaches of data privacy. These concerns imply the need for cautious integration of artificial intelligence in healthcare, emphasizing human oversight, transparency, and accountability to mitigate risks and uphold ethical standards. SUMMARY The integration of LLMs in ophthalmology offers potential advantages such as aiding in clinical decision support and facilitating medical education through their ability to process queries and analyze ophthalmic imaging and clinical cases. However, their utilization also raises ethical concerns regarding data privacy, potential misinformation, and biases inherent in the datasets used. Awareness of these concerns should be addressed in order to optimize its utility in the healthcare setting. More importantly, promoting responsible and careful use by consumers should be practiced.
Collapse
Affiliation(s)
- Fritz Gerald P Kalaw
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute
- Department of Biomedical Informatics, University of California San Diego Health System, University of California San Diego, La Jolla, California, USA
| | - Sally L Baxter
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute
- Department of Biomedical Informatics, University of California San Diego Health System, University of California San Diego, La Jolla, California, USA
| |
Collapse
|
2
|
Bellanda VCF, Santos MLD, Ferraz DA, Jorge R, Melo GB. Applications of ChatGPT in the diagnosis, management, education, and research of retinal diseases: a scoping review. Int J Retina Vitreous 2024; 10:79. [PMID: 39420407 PMCID: PMC11487877 DOI: 10.1186/s40942-024-00595-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Accepted: 10/04/2024] [Indexed: 10/19/2024] Open
Abstract
PURPOSE This scoping review aims to explore the current applications of ChatGPT in the retina field, highlighting its potential, challenges, and limitations. METHODS A comprehensive literature search was conducted across multiple databases, including PubMed, Scopus, MEDLINE, and Embase, to identify relevant articles published from 2022 onwards. The inclusion criteria focused on studies evaluating the use of ChatGPT in retinal healthcare. Data were extracted and synthesized to map the scope of ChatGPT's applications in retinal care, categorizing articles into various practical application areas such as academic research, charting, coding, diagnosis, disease management, and patient counseling. RESULTS A total of 68 articles were included in the review, distributed across several categories: 8 related to academics and research, 5 to charting, 1 to coding and billing, 44 to diagnosis, 49 to disease management, 2 to literature consulting, 23 to medical education, and 33 to patient counseling. Many articles were classified into multiple categories due to overlapping topics. The findings indicate that while ChatGPT shows significant promise in areas such as medical education and diagnostic support, concerns regarding accuracy, reliability, and the potential for misinformation remain prevalent. CONCLUSION ChatGPT offers substantial potential in advancing retinal healthcare by supporting clinical decision-making, enhancing patient education, and automating administrative tasks. However, its current limitations, particularly in clinical accuracy and the risk of generating misinformation, necessitate cautious integration into practice, with continuous oversight from healthcare professionals. Future developments should focus on improving accuracy, incorporating up-to-date medical guidelines, and minimizing the risks associated with AI-driven healthcare tools.
Collapse
Affiliation(s)
- Victor C F Bellanda
- Ribeirão Preto Medical School, University of São Paulo, 3900 Bandeirantes Ave, Ribeirão Preto, SP, 14049-900, Brazil.
| | | | | | - Rodrigo Jorge
- Ribeirão Preto Medical School, University of São Paulo, 3900 Bandeirantes Ave, Ribeirão Preto, SP, 14049-900, Brazil
| | - Gustavo Barreto Melo
- Sergipe Eye Hospital, Aracaju, SE, Brazil
- Paulista School of Medicine, Federal University of São Paulo, São Paulo, SP, Brazil
| |
Collapse
|
3
|
Chotcomwongse P, Ruamviboonsuk P, Grzybowski A. Utilizing Large Language Models in Ophthalmology: The Current Landscape and Challenges. Ophthalmol Ther 2024; 13:2543-2558. [PMID: 39180701 PMCID: PMC11408418 DOI: 10.1007/s40123-024-01018-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 08/01/2024] [Indexed: 08/26/2024] Open
Abstract
A large language model (LLM) is an artificial intelligence (AI) model that uses natural language processing (NLP) to understand, interpret, and generate human-like language responses from unstructured text input. Its real-time response capabilities and eloquent dialogue enhance the interactive user experience in human-AI communication like never before. By gathering several sources on the internet, LLM chatbots can interact and respond to a wide range of queries, including problem solving, text summarization, and creating informative notes. Since ophthalmology is one of the medical fields integrating image analysis, telemedicine, AI, and other technologies, LLMs are likely to play an important role in eye care in the near future. This review summarizes the performance and potential applicability of LLMs in ophthalmology according to currently available publications.
Collapse
Affiliation(s)
- Peranut Chotcomwongse
- Vitreoretina Unit, Department of Ophthalmology, Rajavithi Hospital, Rungsit University, Bangkok, Thailand
| | - Paisan Ruamviboonsuk
- Vitreoretina Unit, Department of Ophthalmology, Rajavithi Hospital, Rungsit University, Bangkok, Thailand
| | - Andrzej Grzybowski
- University of Warmia and Mazury, Olsztyn, Poland.
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, 61-553, Poznan, Poland.
| |
Collapse
|
4
|
Wang Y, Liu C, Zhou K, Zhu T, Han X. Towards regulatory generative AI in ophthalmology healthcare: a security and privacy perspective. Br J Ophthalmol 2024; 108:1349-1353. [PMID: 38834290 DOI: 10.1136/bjo-2024-325167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Accepted: 05/19/2024] [Indexed: 06/06/2024]
Abstract
As the healthcare community increasingly harnesses the power of generative artificial intelligence (AI), critical issues of security, privacy and regulation take centre stage. In this paper, we explore the security and privacy risks of generative AI from model-level and data-level perspectives. Moreover, we elucidate the potential consequences and case studies within the domain of ophthalmology. Model-level risks include knowledge leakage from the model and model safety under AI-specific attacks, while data-level risks involve unauthorised data collection and data accuracy concerns. Within the healthcare context, these risks can bear severe consequences, encompassing potential breaches of sensitive information, violating privacy rights and threats to patient safety. This paper not only highlights these challenges but also elucidates governance-driven solutions that adhere to AI and healthcare regulations. We advocate for preparedness against potential threats, call for transparency enhancements and underscore the necessity of clinical validation before real-world implementation. The objective of security and privacy improvement in generative AI warrants emphasising the role of ophthalmologists and other healthcare providers, and the timely introduction of comprehensive regulations.
Collapse
Affiliation(s)
- Yueye Wang
- Sun Yat-sen University Zhongshan Ophthalmic Center State Key Laboratory of Ophthalmology, Guangzhou, Guangdong, China
| | - Chi Liu
- Faculty of Data Science, City University of Macau, Macao SAR, China
| | - Keyao Zhou
- Department of Ophthalmology, Guangdong Provincial People's Hospital, Guangzhou, Guangdong, China
- Department of Neurosurgery, Huashan Hospital, Fudan University, Shanghai, China
| | - Tianqing Zhu
- Faculty of Data Science, City University of Macau, Macao SAR, China
| | - Xiaotong Han
- Sun Yat-sen University Zhongshan Ophthalmic Center State Key Laboratory of Ophthalmology, Guangzhou, Guangdong, China
| |
Collapse
|
5
|
Wu JH, Nishida T, Liu TYA. Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis. Asia Pac J Ophthalmol (Phila) 2024; 13:100106. [PMID: 39374807 DOI: 10.1016/j.apjo.2024.100106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 09/20/2024] [Accepted: 09/26/2024] [Indexed: 10/09/2024] Open
Abstract
PURPOSE To evaluate the accuracy of large language models (LLMs) in answering ophthalmology board-style questions. DESIGN Meta-analysis. METHODS Literature search was conducted using PubMed and Embase in March 2024. We included full-length articles and research letters published in English that reported the accuracy of LLMs in answering ophthalmology board-style questions. Data on LLM performance, including the number of questions submitted and correct responses generated, were extracted for each question set from individual studies. Pooled accuracy was calculated using a random-effects model. Subgroup analyses were performed based on the LLMs used and specific ophthalmology topics assessed. RESULTS Among the 14 studies retrieved, 13 (93 %) tested LLMs on multiple ophthalmology topics. ChatGPT-3.5, ChatGPT-4, Bard, and Bing Chat were assessed in 12 (86 %), 11 (79 %), 4 (29 %), and 4 (29 %) studies, respectively. The overall pooled accuracy of LLMs was 0.65 (95 % CI: 0.61-0.69). Among the different LLMs, ChatGPT-4 achieved the highest pooled accuracy at 0.74 (95 % CI: 0.73-0.79), while ChatGPT-3.5 recorded the lowest at 0.52 (95 % CI: 0.51-0.54). LLMs performed best in "pathology" (0.78 [95 % CI: 0.70-0.86]) and worst in "fundamentals and principles of ophthalmology" (0.52 [95 % CI: 0.48-0.56]). CONCLUSIONS The overall accuracy of LLMs in answering ophthalmology board-style questions was acceptable but not exceptional, with ChatGPT-4 and Bing Chat being top-performing models. Performance varied significantly based on specific ophthalmology topics tested. Inconsistent performances are of concern, highlighting the need for future studies to include ophthalmology board-style questions with images to more comprehensively examine the competency of LLMs.
Collapse
Affiliation(s)
- Jo-Hsuan Wu
- Edward S. Harkness Eye Institute, Department of Ophthalmology, Columbia University Irving Medical Center, New York, NY 10032, USA; Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA
| | - Takashi Nishida
- Shiley Eye Institute and Viterbi Family Department of Ophthalmology, University of California, San Diego, CA 92093, USA
| | - T Y Alvin Liu
- Retina Division, Wilmer Eye Institute, Johns Hopkins University, Baltimore, MD 21287, USA.
| |
Collapse
|
6
|
Vaishya R, Iyengar KP, Patralekh MK, Botchu R, Shirodkar K, Jain VK, Vaish A, Scarlat MM. Effectiveness of AI-powered Chatbots in responding to orthopaedic postgraduate exam questions-an observational study. INTERNATIONAL ORTHOPAEDICS 2024; 48:1963-1969. [PMID: 38619565 DOI: 10.1007/s00264-024-06182-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 04/03/2024] [Indexed: 04/16/2024]
Abstract
PURPOSE This study analyses the performance and proficiency of the three Artificial Intelligence (AI) generative chatbots (ChatGPT-3.5, ChatGPT-4.0, Bard Google AI®) and in answering the Multiple Choice Questions (MCQs) of postgraduate (PG) level orthopaedic qualifying examinations. METHODS A series of 120 mock Single Best Answer' (SBA) MCQs with four possible options named A, B, C and D as answers on various musculoskeletal (MSK) conditions covering Trauma and Orthopaedic curricula were compiled. A standardised text prompt was used to generate and feed ChatGPT (both 3.5 and 4.0 versions) and Google Bard programs, which were then statistically analysed. RESULTS Significant differences were found between responses from Chat GPT 3.5 with Chat GPT 4.0 (Chi square = 27.2, P < 0.001) and on comparing both Chat GPT 3.5 (Chi square = 63.852, P < 0.001) with Chat GPT 4.0 (Chi square = 44.246, P < 0.001) with. Bard Google AI® had 100% efficiency and was significantly more efficient than both Chat GPT 3.5 with Chat GPT 4.0 (p < 0.0001). CONCLUSION The results demonstrate the variable potential of the different AI generative chatbots (Chat GPT 3.5, Chat GPT 4.0 and Bard Google) in their ability to answer the MCQ of PG-level orthopaedic qualifying examinations. Bard Google AI® has shown superior performance than both ChatGPT versions, underlining the potential of such large language processing models in processing and applying orthopaedic subspecialty knowledge at a PG level.
Collapse
Affiliation(s)
- Raju Vaishya
- Department of Orthopaedics, Indraprastha Apollo Hospitals, Sarita Vihar, New Delhi, 110076, India.
| | - Karthikeyan P Iyengar
- Department of Orthopaedics, Southport and Ormskirk Hospital, Mersey West Lancashire Teaching NHS Trust, Southport, UK
| | | | - Rajesh Botchu
- Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, UK
| | - Kapil Shirodkar
- Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, UK
| | | | - Abhishek Vaish
- Department of Orthopaedics, Indraprastha Apollo Hospitals, Sarita Vihar, New Delhi, 110076, India
| | | |
Collapse
|
7
|
Yang Z, Wang D, Zhou F, Song D, Zhang Y, Jiang J, Kong K, Liu X, Qiao Y, Chang RT, Han Y, Li F, Tham CC, Zhang X. Understanding natural language: Potential application of large language models to ophthalmology. Asia Pac J Ophthalmol (Phila) 2024; 13:100085. [PMID: 39059558 DOI: 10.1016/j.apjo.2024.100085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 06/19/2024] [Accepted: 07/19/2024] [Indexed: 07/28/2024] Open
Abstract
Large language models (LLMs), a natural language processing technology based on deep learning, are currently in the spotlight. These models closely mimic natural language comprehension and generation. Their evolution has undergone several waves of innovation similar to convolutional neural networks. The transformer architecture advancement in generative artificial intelligence marks a monumental leap beyond early-stage pattern recognition via supervised learning. With the expansion of parameters and training data (terabytes), LLMs unveil remarkable human interactivity, encompassing capabilities such as memory retention and comprehension. These advances make LLMs particularly well-suited for roles in healthcare communication between medical practitioners and patients. In this comprehensive review, we discuss the trajectory of LLMs and their potential implications for clinicians and patients. For clinicians, LLMs can be used for automated medical documentation, and given better inputs and extensive validation, LLMs may be able to autonomously diagnose and treat in the future. For patient care, LLMs can be used for triage suggestions, summarization of medical documents, explanation of a patient's condition, and customizing patient education materials tailored to their comprehension level. The limitations of LLMs and possible solutions for real-world use are also presented. Given the rapid advancements in this area, this review attempts to briefly cover many roles that LLMs may play in the ophthalmic space, with a focus on improving the quality of healthcare delivery.
Collapse
Affiliation(s)
- Zefeng Yang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Deming Wang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Fengqi Zhou
- Ophthalmology, Mayo Clinic Health System, Eau Claire, Wisconsin, USA
| | - Diping Song
- Shanghai Artificial Intelligence Laboratory, Shanghai, China
| | - Yinhang Zhang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Jiaxuan Jiang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Kangjie Kong
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Xiaoyi Liu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China
| | - Yu Qiao
- Shanghai Artificial Intelligence Laboratory, Shanghai, China
| | - Robert T Chang
- Department of Ophthalmology, Byers Eye Institute at Stanford University, Palo Alto, CA, USA
| | - Ying Han
- Department of Ophthalmology, University of California, San Francisco, San Francisco, CA, USA
| | - Fei Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China.
| | - Clement C Tham
- Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China; Hong Kong Eye Hospital, Kowloon, Hong Kong SAR, China; Department of Ophthalmology and Visual Sciences, Prince of Wales Hospital, Shatin, Hong Kong SAR, China.
| | - Xiulan Zhang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou 510060, China.
| |
Collapse
|
8
|
Rossettini G, Rodeghiero L, Corradi F, Cook C, Pillastrini P, Turolla A, Castellini G, Chiappinotto S, Gianola S, Palese A. Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC MEDICAL EDUCATION 2024; 24:694. [PMID: 38926809 PMCID: PMC11210096 DOI: 10.1186/s12909-024-05630-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 06/04/2024] [Indexed: 06/28/2024]
Abstract
BACKGROUND Artificial intelligence (AI) chatbots are emerging educational tools for students in healthcare science. However, assessing their accuracy is essential prior to adoption in educational settings. This study aimed to assess the accuracy of predicting the correct answers from three AI chatbots (ChatGPT-4, Microsoft Copilot and Google Gemini) in the Italian entrance standardized examination test of healthcare science degrees (CINECA test). Secondarily, we assessed the narrative coherence of the AI chatbots' responses (i.e., text output) based on three qualitative metrics: the logical rationale behind the chosen answer, the presence of information internal to the question, and presence of information external to the question. METHODS An observational cross-sectional design was performed in September of 2023. Accuracy of the three chatbots was evaluated for the CINECA test, where questions were formatted using a multiple-choice structure with a single best answer. The outcome is binary (correct or incorrect). Chi-squared test and a post hoc analysis with Bonferroni correction assessed differences among chatbots performance in accuracy. A p-value of < 0.05 was considered statistically significant. A sensitivity analysis was performed, excluding answers that were not applicable (e.g., images). Narrative coherence was analyzed by absolute and relative frequencies of correct answers and errors. RESULTS Overall, of the 820 CINECA multiple-choice questions inputted into all chatbots, 20 questions were not imported in ChatGPT-4 (n = 808) and Google Gemini (n = 808) due to technical limitations. We found statistically significant differences in the ChatGPT-4 vs Google Gemini and Microsoft Copilot vs Google Gemini comparisons (p-value < 0.001). The narrative coherence of AI chatbots revealed "Logical reasoning" as the prevalent correct answer (n = 622, 81.5%) and "Logical error" as the prevalent incorrect answer (n = 40, 88.9%). CONCLUSIONS Our main findings reveal that: (A) AI chatbots performed well; (B) ChatGPT-4 and Microsoft Copilot performed better than Google Gemini; and (C) their narrative coherence is primarily logical. Although AI chatbots showed promising accuracy in predicting the correct answer in the Italian entrance university standardized examination test, we encourage candidates to cautiously incorporate this new technology to supplement their learning rather than a primary resource. TRIAL REGISTRATION Not required.
Collapse
Affiliation(s)
- Giacomo Rossettini
- School of Physiotherapy, University of Verona, Verona, Italy.
- Department of Physiotherapy, Faculty of Sport Sciences, Universidad Europea de Madrid, Villaviciosa de Odón, 28670, Spain.
| | - Lia Rodeghiero
- Department of Rehabilitation, Hospital of Merano (SABES-ASDAA), Teaching Hospital of Paracelsus Medical University (PMU), Merano-Meran, Italy.
| | | | - Chad Cook
- Department of Orthopaedics, Duke University, Durham, NC, USA
- Duke Clinical Research Institute, Duke University, Durham, NC, USA
- Department of Population Health Sciences, Duke University, Durham, NC, USA
| | - Paolo Pillastrini
- Department of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy
- Unit of Occupational Medicine, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, Bologna, Italy
| | - Andrea Turolla
- Department of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy
- Unit of Occupational Medicine, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, Bologna, Italy
| | - Greta Castellini
- Unit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| | | | - Silvia Gianola
- Unit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, Italy.
| | - Alvisa Palese
- Department of Medical Sciences, University of Udine, Udine, Italy.
| |
Collapse
|
9
|
Wang L, Chen X, Deng X, Wen H, You M, Liu W, Li Q, Li J. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med 2024; 7:41. [PMID: 38378899 PMCID: PMC10879172 DOI: 10.1038/s41746-024-01029-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Accepted: 02/05/2024] [Indexed: 02/22/2024] Open
Abstract
The use of large language models (LLMs) in clinical medicine is currently thriving. Effectively transferring LLMs' pertinent theoretical knowledge from computer science to their application in clinical medicine is crucial. Prompt engineering has shown potential as an effective method in this regard. To explore the application of prompt engineering in LLMs and to examine the reliability of LLMs, different styles of prompts were designed and used to ask different LLMs about their agreement with the American Academy of Orthopedic Surgeons (AAOS) osteoarthritis (OA) evidence-based guidelines. Each question was asked 5 times. We compared the consistency of the findings with guidelines across different evidence levels for different prompts and assessed the reliability of different prompts by asking the same question 5 times. gpt-4-Web with ROT prompting had the highest overall consistency (62.9%) and a significant performance for strong recommendations, with a total consistency of 77.5%. The reliability of the different LLMs for different prompts was not stable (Fleiss kappa ranged from -0.002 to 0.984). This study revealed that different prompts had variable effects across various models, and the gpt-4-Web with ROT prompt was the most consistent. An appropriate prompt could improve the accuracy of responses to professional medical questions.
Collapse
Affiliation(s)
- Li Wang
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - Xi Chen
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - XiangWen Deng
- Shenzhen International Graduate School, Tsinghua University, Beijing, China
| | - Hao Wen
- Shenzhen International Graduate School, Tsinghua University, Beijing, China
| | - MingKe You
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - WeiZhi Liu
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - Qi Li
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China.
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China.
| | - Jian Li
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China.
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China.
| |
Collapse
|