1
|
Hack S, Alsleibi S, Saleh N, Alon EE, Rabinovics N, Remer E. Are chatbots a reliable source for patient frequently asked questions on neck masses? Eur Arch Otorhinolaryngol 2025:10.1007/s00405-025-09433-6. [PMID: 40307608 DOI: 10.1007/s00405-025-09433-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2025] [Accepted: 04/08/2025] [Indexed: 05/02/2025]
Abstract
PURPOSE To evaluate the reliability and accuracy of Large Language Models in answering patient Frequently Asked Questions about adult neck masses. METHODS Twenty-four questions from the American Academy of Otolaryngology-Head and Neck Surgery were presented to ChatGPT, Claude, and Gemini. Five independent otolaryngologists evaluated responses using six criteria: accuracy, extensiveness, misleading information, resource quality, guideline citations, and overall reliability. Statistical analysis used Fisher's exact tests and Fleiss' Kappa. RESULTS All models showed high reliability (91.7-100%). Paid GPT and Gemini achieved highest accuracy (95.8%). Extensiveness varied significantly (p = 0.012), with Gemini scoring lowest (62.5%). Resource quality ranged from 58.3% (Claude) to 100% (Paid GPT). Guideline citations were highest for GPT models (50%) and lowest for Gemini (16.7%). Misleading information was rare (0-16.7%). Inter-rater reliability was near-perfect across five reviewers (κ = 0.95). CONCLUSION Large Language Models demonstrate high reliability and accuracy for neck mass patient education, with paid versions showing marginally better performance. While promising as educational tools, variable guideline adherence and occasional misinformation suggest they should complement rather than replace professional medical advice.
Collapse
Affiliation(s)
- Sholem Hack
- St. Georges University London School of Medicine, Program Delivered by University of Nicosia at The Chaim Sheba Medical Center, Ramat Gan, Israel.
| | - Shibli Alsleibi
- Department of Otolaryngology, Sheba Medical Center, Ramat Gan, Israel
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Naseem Saleh
- Department of Otolaryngology, Sheba Medical Center, Ramat Gan, Israel
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Eran E Alon
- Department of Otolaryngology, Sheba Medical Center, Ramat Gan, Israel
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Naomi Rabinovics
- Department of Otolaryngology, Sheba Medical Center, Ramat Gan, Israel
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Eric Remer
- Department of Otolaryngology, Sheba Medical Center, Ramat Gan, Israel
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
2
|
Shaari AL, Bhalla SR, Salehi PP. Improving Accessibility to Facial Plastic and Reconstructive Surgery Patient Resources Using Artificial Intelligence: A Pilot Study in Patient Education Materials. Facial Plast Surg Aesthet Med 2025. [PMID: 40241315 DOI: 10.1089/fpsam.2024.0376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/18/2025] Open
Abstract
Background: The applications of artificial intelligence (AI) are evolving, offering new opportunities to enhance patient care. Objective: To determine whether the use of AI platforms for translating patient education materials (PEMs) improves their readability for patients seeking information on facial plastic and reconstructive surgery (FPRS) procedures. Methods: Text from 25 PEMs on topics such as rhytidectomy, rhinoplasty, and blepharoplasty was extracted. ChatGPT 4.o, ChatGPT 3.5, Microsoft Copilot, and Google Gemini were prompted to translate AAFPRS PEMs to the 6th-grade reading level, the accepted readability standard for PEMs. Readability was determined using Flesch Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), and Flesch Kincaid Reading Ease (FKRE). Statistical analysis was performed. Results: A total of 125 PEMs were reviewed. Original PEMs had a mean FKGL, GFI, and FKRE of 10.7, 13.48, and 50.8 respectively, which exceed the recommended level. The translated AI-generated PEMs had a mean FKGL, GFI, and FKRE of 8.41, 10.62, and 64.43 respectively, representing an improvement in readability (p < 0.001). Conclusion: With physician supervision, AI platforms can improve the readability of PEMs for common FPRS procedures. This strategy may increase the accessibility of educational resources for diverse patient populations.
Collapse
Affiliation(s)
| | - Shreya R Bhalla
- Rutgers Robert Wood Johnson Medical School, New Brunswick, New Jersey, USA
| | - Parsa P Salehi
- SalehiMD Facial Plastic Surgery, Facial Plastic and Reconstructive Surgeon, Beverly Hills, California, USA
- Beverly Hills Center for Plastic and Laser Surgery, Facial Plastic and Reconstructive Surgeon, Beverly Hills, California, USA
| |
Collapse
|
3
|
Yan C, Li Z, Liang Y, Shao S, Ma F, Zhang N, Li B, Wang C, Zhou K. Assessing large language models as assistive tools in medical consultations for Kawasaki disease. Front Artif Intell 2025; 8:1571503. [PMID: 40231209 PMCID: PMC11994668 DOI: 10.3389/frai.2025.1571503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2025] [Accepted: 03/06/2025] [Indexed: 04/16/2025] Open
Abstract
Background Kawasaki disease (KD) presents complex clinical challenges in diagnosis, treatment, and long-term management, requiring a comprehensive understanding by both parents and healthcare providers. With advancements in artificial intelligence (AI), large language models (LLMs) have shown promise in supporting medical practice. This study aims to evaluate and compare the appropriateness and comprehensibility of different LLMs in answering clinically relevant questions about KD and assess the impact of different prompting strategies. Methods Twenty-five questions were formulated, incorporating three prompting strategies: No prompting (NO), Parent-friendly (PF), and Doctor-level (DL). These questions were input into three LLMs: ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Responses were evaluated based on appropriateness, educational quality, comprehensibility, cautionary statements, references, and potential misinformation, using Information Quality Grade, Global Quality Scale (GQS), Flesch Reading Ease (FRE) score, and word count. Results Significant differences were found among the LLMs in terms of response educational quality, accuracy, and comprehensibility (p < 0.001). Claude 3.5 provided the highest proportion of completely correct responses (51.1%) and achieved the highest median GQS score (5.0), outperforming GPT-4o (4.0) and Gemini 1.5 (3.0) significantly. Gemini 1.5 achieved the highest FRE score (31.5) and provided highest proportion of responses assessed as comprehensible (80.4%). Prompting strategies significantly affected LLM responses. Claude 3.5 Sonnet with DL prompting had the highest completely correct rate (81.3%), while PF prompting yielded the most acceptable responses (97.3%). Gemini 1.5 Pro showed minimal variation across prompts but excelled in comprehensibility (98.7% under PF prompting). Conclusion This study indicates that LLMs have great potential in providing information about KD, but their use requires caution due to quality inconsistencies and misinformation risks. Significant discrepancies existed across LLMs and prompting strategies. Claude 3.5 Sonnet offered the best response quality and accuracy, while Gemini 1.5 Pro excelled in comprehensibility. PF prompting with Claude 3.5 Sonnet is most recommended for parents seeking KD information. As AI evolves, expanding research and refining models is crucial to ensure reliable, high-quality information.
Collapse
Affiliation(s)
- Chunyi Yan
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Zexi Li
- Department of Cardiology, West China Hospital, Sichuan University, Chengdu, China
| | - Yongzhou Liang
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Shuran Shao
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Fan Ma
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Nanjun Zhang
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Bowen Li
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Chuan Wang
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| | - Kaiyu Zhou
- Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, China
- Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, China
| |
Collapse
|
4
|
Kleebayoon A, Wiwanitkit V. Comment on Evaluation of Rhinoplasty Information from ChatGPT, Gemini, and Claude. Aesthetic Plast Surg 2024:10.1007/s00266-024-04465-5. [PMID: 39470819 DOI: 10.1007/s00266-024-04465-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Accepted: 10/08/2024] [Indexed: 11/01/2024]
Abstract
Level of Evidence V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
Affiliation(s)
| | - Viroj Wiwanitkit
- University Centre for Research & Development Department of Pharmaceutical Sciences, Chandigarh University Gharuan, Mohali, Punjab, India
| |
Collapse
|