1
|
Lechien JR. Generative AI and Otolaryngology-Head & Neck Surgery. Otolaryngol Clin North Am 2024; 57:753-765. [PMID: 38839556 DOI: 10.1016/j.otc.2024.04.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2024]
Abstract
The increasing development of artificial intelligence (AI) generative models in otolaryngology-head and neck surgery will progressively change our practice. Practitioners and patients have access to AI resources, improving information, knowledge, and practice of patient care. This article summarizes the currently investigated applications of AI generative models, particularly Chatbot Generative Pre-trained Transformer, in otolaryngology-head and neck surgery.
Collapse
Affiliation(s)
- Jérôme R Lechien
- Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France; Division of Laryngology and Broncho-esophagology, Department of Otolaryngology-Head Neck Surgery, EpiCURA Hospital, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), Mons, Belgium; Department of Otorhinolaryngology and Head and Neck Surgery, Foch Hospital, Paris Saclay University, Phonetics and Phonology Laboratory (UMR 7018 CNRS, Université Sorbonne Nouvelle/Paris 3), Paris, France; Department of Otorhinolaryngology and Head and Neck Surgery, CHU Saint-Pierre, Brussels, Belgium.
| |
Collapse
|
2
|
Incerti Parenti S, Bartolucci ML, Biondi E, Maglioni A, Corazza G, Gracco A, Alessandri-Bonetti G. Online Patient Education in Obstructive Sleep Apnea: ChatGPT versus Google Search. Healthcare (Basel) 2024; 12:1781. [PMID: 39273804 PMCID: PMC11394980 DOI: 10.3390/healthcare12171781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Revised: 08/30/2024] [Accepted: 09/03/2024] [Indexed: 09/15/2024] Open
Abstract
The widespread implementation of artificial intelligence technologies provides an appealing alternative to traditional search engines for online patient healthcare education. This study assessed ChatGPT-3.5's capabilities as a source of obstructive sleep apnea (OSA) information, using Google Search as a comparison. Ten frequently searched questions related to OSA were entered into Google Search and ChatGPT-3.5. The responses were assessed by two independent researchers using the Global Quality Score (GQS), Patient Education Materials Assessment Tool (PEMAT), DISCERN instrument, CLEAR tool, and readability scores (Flesch Reading Ease and Flesch-Kincaid Grade Level). ChatGPT-3.5 significantly outperformed Google Search in terms of GQS (5.00 vs. 2.50, p < 0.0001), DISCERN reliability (35.00 vs. 29.50, p = 0.001), and quality (11.50 vs. 7.00, p = 0.02). The CLEAR tool scores indicated that ChatGPT-3.5 provided excellent content (25.00 vs. 15.50, p < 0.001). PEMAT scores showed higher understandability (60-91% vs. 44-80%) and actionability for ChatGPT-3.5 (0-40% vs. 0%). Readability analysis revealed that Google Search responses were easier to read (FRE: 56.05 vs. 22.00; FKGL: 9.00 vs. 14.00, p < 0.0001). ChatGPT-3.5 delivers higher quality and more comprehensive OSA information compared to Google Search, although its responses are less readable. This suggests that while ChatGPT-3.5 can be a valuable tool for patient education, efforts to improve readability are necessary to ensure accessibility and utility for all patients. Healthcare providers should be aware of the strengths and weaknesses of various healthcare information resources and emphasize the importance of critically evaluating online health information, advising patients on its reliability and relevance.
Collapse
Affiliation(s)
- Serena Incerti Parenti
- Unit of Orthodontics and Sleep Dentistry, Department of Biomedical and Neuromotor Sciences (DIBINEM), University of Bologna, Via San Vitale 59, 40125 Bologna, Italy
| | - Maria Lavinia Bartolucci
- Unit of Orthodontics and Sleep Dentistry, Department of Biomedical and Neuromotor Sciences (DIBINEM), University of Bologna, Via San Vitale 59, 40125 Bologna, Italy
| | - Elena Biondi
- Unit of Orthodontics and Sleep Dentistry, Department of Biomedical and Neuromotor Sciences (DIBINEM), University of Bologna, Via San Vitale 59, 40125 Bologna, Italy
- Postgraduate School of Orthodontics, University of Bologna, Via San Vitale 59, 40125 Bologna, Italy
| | - Alessandro Maglioni
- Unit of Orthodontics and Sleep Dentistry, Department of Biomedical and Neuromotor Sciences (DIBINEM), University of Bologna, Via San Vitale 59, 40125 Bologna, Italy
- Postgraduate School of Orthodontics, University of Bologna, Via San Vitale 59, 40125 Bologna, Italy
| | - Giulia Corazza
- Unit of Orthodontics and Sleep Dentistry, Department of Biomedical and Neuromotor Sciences (DIBINEM), University of Bologna, Via San Vitale 59, 40125 Bologna, Italy
| | - Antonio Gracco
- Postgraduate School of Orthodontics, Department of Neurosciences, Section of Dentistry, University of Padua, 35122 Padua, Italy
| | - Giulio Alessandri-Bonetti
- Unit of Orthodontics and Sleep Dentistry, Department of Biomedical and Neuromotor Sciences (DIBINEM), University of Bologna, Via San Vitale 59, 40125 Bologna, Italy
| |
Collapse
|
3
|
Lechien JR, Rameau A. Applications of ChatGPT in Otolaryngology-Head Neck Surgery: A State of the Art Review. Otolaryngol Head Neck Surg 2024; 171:667-677. [PMID: 38716790 DOI: 10.1002/ohn.807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 04/01/2024] [Accepted: 04/19/2024] [Indexed: 08/28/2024]
Abstract
OBJECTIVE To review the current literature on the application, accuracy, and performance of Chatbot Generative Pre-Trained Transformer (ChatGPT) in Otolaryngology-Head and Neck Surgery. DATA SOURCES PubMED, Cochrane Library, and Scopus. REVIEW METHODS A comprehensive review of the literature on the applications of ChatGPT in otolaryngology was conducted according to Preferred Reporting Items for Systematic Reviews and Meta-analyses statement. CONCLUSIONS ChatGPT provides imperfect patient information or general knowledge related to diseases found in Otolaryngology-Head and Neck Surgery. In clinical practice, despite suboptimal performance, studies reported that the model is more accurate in providing diagnoses, than in suggesting the most adequate additional examinations and treatments related to clinical vignettes or real clinical cases. ChatGPT has been used as an adjunct tool to improve scientific reports (referencing, spelling correction), to elaborate study protocols, or to take student or resident exams reporting several levels of accuracy. The stability of ChatGPT responses throughout repeated questions appeared high but many studies reported some hallucination events, particularly in providing scientific references. IMPLICATIONS FOR PRACTICE To date, most applications of ChatGPT are limited in generating disease or treatment information, and in the improvement of the management of clinical cases. The lack of comparison of ChatGPT performance with other large language models is the main limitation of the current research. Its ability to analyze clinical images has not yet been investigated in otolaryngology although upper airway tract or ear images are an important step in the diagnosis of most common ear, nose, and throat conditions. This review may help otolaryngologists to conceive new applications in further research.
Collapse
Affiliation(s)
- Jérôme R Lechien
- Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France
- Division of Laryngology and Broncho-Esophagology, Department of Otolaryngology-Head Neck Surgery, EpiCURA Hospital, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), Mons, Belgium
- Department of Otorhinolaryngology and Head and Neck Surgery, Foch Hospital, Phonetics and Phonology Laboratory (UMR 7018 CNRS, Université Sorbonne Nouvelle/Paris 3), Paris Saclay University, Paris, France
- Department of Otorhinolaryngology and Head and Neck Surgery, CHU Saint-Pierre, Brussels, Belgium
| | - Anais Rameau
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medicine, New York City, New York, USA
| |
Collapse
|
4
|
Oliva AD, Pasick LJ, Hoffer ME, Rosow DE. Improving readability and comprehension levels of otolaryngology patient education materials using ChatGPT. Am J Otolaryngol 2024; 45:104502. [PMID: 39197330 DOI: 10.1016/j.amjoto.2024.104502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Accepted: 08/24/2024] [Indexed: 09/01/2024]
Abstract
OBJECTIVE A publicly available large language learning model platform may help determine current readability levels of otolaryngology patient education materials, as well as translate these materials to the recommended 6th-grade and 8th-grade reading levels. STUDY DESIGN Cross-sectional analysis. SETTING Online using large language learning model, ChatGPT. METHODS The Patient Education pages of the American Laryngological Association (ALA) and American Academy of Otolaryngology-Head and Neck Surgery (AAO-HNS) websites were accessed. Materials were input into ChatGPT (OpenAI, San Francisco, CA; version 3.5) and Microsoft Word (Microsoft, Redmond, WA; version 16.74). Programs calculated Flesch Reading Ease (FRE) scores, with higher scores indicating easier readability, and Flesch-Kincaid (FK) grade levels, estimating U.S. grade level required to understand text. ChatGPT was prompted to "translate to a 5th-grade reading level" and provide new scores. Scores were compared for statistical differences, as well as differences between ChatGPT and Word gradings. RESULTS Patient education materials were reviewed and 37 ALA and 72 AAO-HNS topics were translated. Overall FRE scores and FK grades demonstrated significant improvements following translation of materials, as scored by ChatGPT (p < 0.001). Word also scored significant improvements in FRE and FK following translation by ChatGPT for AAO-HNS materials overall (p < 0.001) but not for individual topics or for subspecialty-specific categories. Compared with Word, ChatGPT significantly exaggerated the change in FRE grades and FK scores (p < 0.001). CONCLUSION Otolaryngology patient education materials were found to be written at higher reading levels than recommended. Artificial intelligence may prove to be a useful resource to simplify content to make it more accessible to patients.
Collapse
Affiliation(s)
- Allison D Oliva
- Department of Otolaryngology-Head and Neck Surgery, University of Miami Miller School of Medicine, United States of America
| | - Luke J Pasick
- Department of Otolaryngology-Head and Neck Surgery, University of Miami Miller School of Medicine, United States of America
| | - Michael E Hoffer
- Department of Otolaryngology-Head and Neck Surgery, University of Miami Miller School of Medicine, United States of America
| | - David E Rosow
- Department of Otolaryngology-Head and Neck Surgery, University of Miami Miller School of Medicine, United States of America.
| |
Collapse
|
5
|
Garg N, Campbell DJ, Yang A, McCann A, Moroco AE, Estephan LE, Palmer WJ, Krein H, Heffelfinger R. Chatbots as Patient Education Resources for Aesthetic Facial Plastic Surgery: Evaluation of ChatGPT and Google Bard Responses. Facial Plast Surg Aesthet Med 2024. [PMID: 38946595 DOI: 10.1089/fpsam.2023.0368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/02/2024] Open
Abstract
Background: ChatGPT and Google Bard™ are popular artificial intelligence chatbots with utility for patients, including those undergoing aesthetic facial plastic surgery. Objective: To compare the accuracy and readability of chatbot-generated responses to patient education questions regarding aesthetic facial plastic surgery using a response accuracy scale and readability testing. Method: ChatGPT and Google Bard™ were asked 28 identical questions using four prompts: none, patient friendly, eighth-grade level, and references. Accuracy was assessed using Global Quality Scale (range: 1-5). Flesch-Kincaid grade level was calculated, and chatbot-provided references were analyzed for veracity. Results: Although 59.8% of responses were good quality (Global Quality Scale ≥4), ChatGPT generated more accurate responses than Google Bard™ on patient-friendly prompting (p < 0.001). Google Bard™ responses were of a significantly lower grade level than ChatGPT for all prompts (p < 0.05). Despite eighth-grade prompting, response grade level for both chatbots was high: ChatGPT (10.5 ± 1.8) and Google Bard™ (9.6 ± 1.3). Prompting for references yielded 108/108 of chatbot-generated references. Forty-one (38.0%) citations were legitimate. Twenty (18.5%) provided accurately reported information from the reference. Conclusion: Although ChatGPT produced more accurate responses and at a higher education level than Google Bard™, both chatbots provided responses above recommended grade levels for patients and failed to provide accurate references.
Collapse
Affiliation(s)
- Neha Garg
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Daniel J Campbell
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Angela Yang
- Sidney Kimmel Medical College, Philadelphia, Pennsylvania, USA
| | - Adam McCann
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Annie E Moroco
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Leonard E Estephan
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - William J Palmer
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Howard Krein
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Ryan Heffelfinger
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| |
Collapse
|
6
|
Sina EM, Campbell DJ, Duffy A, Mandloi S, Benedict P, Farquhar D, Unsal A, Nyquist G. Evaluating ChatGPT as a Patient Education Tool for COVID-19-Induced Olfactory Dysfunction. OTO Open 2024; 8:e70011. [PMID: 39286736 PMCID: PMC11403001 DOI: 10.1002/oto2.70011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2024] [Revised: 08/20/2024] [Accepted: 08/28/2024] [Indexed: 09/19/2024] Open
Abstract
Objective While most patients with COVID-19-induced olfactory dysfunction (OD) recover spontaneously, those with persistent OD face significant physical and psychological sequelae. ChatGPT, an artificial intelligence chatbot, has grown as a tool for patient education. This study seeks to evaluate the quality of ChatGPT-generated responses for COVID-19 OD. Study Design Quantitative observational study. Setting Publicly available online website. Methods ChatGPT (GPT-4) was queried 4 times with 30 identical questions. Prior to questioning, Chat-GPT was "prompted" to respond (1) to a patient, (2) to an eighth grader, (3) with references, and (4) no prompt. Answer accuracy was independently scored by 4 rhinologists using the Global Quality Score (GCS, range: 1-5). Proportions of responses at incremental score thresholds were compared using χ 2 analysis. Flesch-Kincaid grade level was calculated for each answer. Relationship between prompt type and grade level was assessed via analysis of variance. Results Across all graded responses (n = 480), 364 responses (75.8%) were "at least good" (GCS ≥ 4). Proportions of responses that were "at least good" (P < .0001) or "excellent" (GCS = 5) (P < .0001) differed by prompt; "at least moderate" (GCS ≥ 3) responses did not (P = .687). Eighth-grade level (14.06 ± 2.3) and patient-friendly (14.33 ± 2.0) responses were significantly lower mean grade level than no prompting (P < .0001). Conclusion ChatGPT provides appropriate answers to most questions on COVID-19 OD regardless of prompting. However, prompting influences response quality and grade level. ChatGPT responds at grade levels above accepted recommendations for presenting medical information to patients. Currently, ChatGPT offers significant potential for patient education as an adjunct to the conventional patient-physician relationship.
Collapse
Affiliation(s)
- Elliott M Sina
- Sidney Kimmel Medical College Thomas Jefferson University Philadelphia Pennsylvania USA
| | - Daniel J Campbell
- Department of Otolaryngology Thomas Jefferson University Hospital Philadelphia Pennsylvania USA
| | - Alexander Duffy
- Department of Otolaryngology Thomas Jefferson University Hospital Philadelphia Pennsylvania USA
| | - Shreya Mandloi
- Department of Otolaryngology Thomas Jefferson University Hospital Philadelphia Pennsylvania USA
| | - Peter Benedict
- Department of Otolaryngology Thomas Jefferson University Hospital Philadelphia Pennsylvania USA
| | - Douglas Farquhar
- Department of Otolaryngology Thomas Jefferson University Hospital Philadelphia Pennsylvania USA
| | - Aykut Unsal
- Department of Otolaryngology Thomas Jefferson University Hospital Philadelphia Pennsylvania USA
| | - Gurston Nyquist
- Department of Otolaryngology Thomas Jefferson University Hospital Philadelphia Pennsylvania USA
| |
Collapse
|
7
|
Carnino JM, Pellegrini WR, Willis M, Cohen MB, Paz-Lansberg M, Davis EM, Grillone GA, Levi JR. Assessing ChatGPT's Responses to Otolaryngology Patient Questions. Ann Otol Rhinol Laryngol 2024; 133:658-664. [PMID: 38676440 DOI: 10.1177/00034894241249621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/28/2024]
Abstract
OBJECTIVE This study aims to evaluate ChatGPT's performance in addressing real-world otolaryngology patient questions, focusing on accuracy, comprehensiveness, and patient safety, to assess its suitability for integration into healthcare. METHODS A cross-sectional study was conducted using patient questions from the public online forum Reddit's r/AskDocs, where medical advice is sought from healthcare professionals. Patient questions were input into ChatGPT (GPT-3.5), and responses were reviewed by 5 board-certified otolaryngologists. The evaluation criteria included difficulty, accuracy, comprehensiveness, and bedside manner/empathy. Statistical analysis explored the relationship between patient question characteristics and ChatGPT response scores. Potentially dangerous responses were also identified. RESULTS Patient questions averaged 224.93 words, while ChatGPT responses were longer at 414.93 words. The accuracy scores for ChatGPT responses were 3.76/5, comprehensiveness scores were 3.59/5, and bedside manner/empathy scores were 4.28/5. Longer patient questions did not correlate with higher response ratings. However, longer ChatGPT responses scored higher in bedside manner/empathy. Higher question difficulty correlated with lower comprehensiveness. Five responses were flagged as potentially dangerous. CONCLUSION While ChatGPT exhibits promise in addressing otolaryngology patient questions, this study demonstrates its limitations, particularly in accuracy and comprehensiveness. The identification of potentially dangerous responses underscores the need for a cautious approach to AI in medical advice. Responsible integration of AI into healthcare necessitates thorough assessments of model performance and ethical considerations for patient safety.
Collapse
Affiliation(s)
- Jonathan M Carnino
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - William R Pellegrini
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Megan Willis
- Department of Biostatistics, Boston University, Boston, MA, USA
| | - Michael B Cohen
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Marianella Paz-Lansberg
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Elizabeth M Davis
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Gregory A Grillone
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Jessica R Levi
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| |
Collapse
|
8
|
Adelstein JM, Sinkler MA, Li LT, Mistovich RJ. ChatGPT Responses to Common Questions About Slipped Capital Femoral Epiphysis: A Reliable Resource for Parents? J Pediatr Orthop 2024; 44:353-357. [PMID: 38597253 DOI: 10.1097/bpo.0000000000002681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
BACKGROUND We sought to evaluate the ability of ChatGPT, an AI-powered online chatbot, to answer frequently asked questions (FAQs) regarding slipped capital femoral epiphysis (SCFE). METHODS Seven FAQs regarding SCFE were presented to ChatGPT. Initial responses were recorded and compared with evidence-based literature and reputable online resources. Responses were subjectively rated as "excellent response requiring no further clarification," "satisfactory response requiring minimal clarification," "satisfactory response requiring moderate clarification," or "unsatisfactory response requiring substantial clarification." RESULTS ChatGPT was frequently able to provide satisfactory responses that required only minimal clarification. One response received an excellent rating and required no further clarification, while only 1 response from ChatGPT was rated unsatisfactory and required substantial clarification. CONCLUSIONS ChatGPT is able to frequently provide satisfactory responses to FAQs regarding SCFE while appropriately reiterating the importance of always consulting a medical professional.
Collapse
Affiliation(s)
- Jeremy M Adelstein
- Department of Orthopaedic Surgery, Case Western Reserve University/University Hospitals, Cleveland, OH
| | - Margaret A Sinkler
- Department of Orthopaedic Surgery, Case Western Reserve University/University Hospitals, Cleveland, OH
| | - Lambert T Li
- Department of Orthopaedic Surgery, Case Western Reserve University/University Hospitals, Cleveland, OH
| | - R Justin Mistovich
- Department of Orthopaedic Surgery, Case Western Reserve University/University Hospitals, Cleveland, OH
- Division of Pediatric Orthopaedics, Rainbow Babies and Children's Hospital, Case Western Reserve University School of Medicine
| |
Collapse
|
9
|
Lee TJ, Campbell DJ, Rao AK, Hossain A, Elkattawy O, Radfar N, Lee P, Gardin JM. Evaluating ChatGPT Responses on Atrial Fibrillation for Patient Education. Cureus 2024; 16:e61680. [PMID: 38841294 PMCID: PMC11151148 DOI: 10.7759/cureus.61680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/04/2024] [Indexed: 06/07/2024] Open
Abstract
Background ChatGPT is a language model that has gained widespread popularity for its fine-tuned conversational abilities. However, a known drawback to the artificial intelligence (AI) chatbot is its tendency to confidently present users with inaccurate information. We evaluated the quality of ChatGPT responses to questions pertaining to atrial fibrillation for patient education. Our analysis included the accuracy and estimated grade level of answers and whether references were provided for the answers. Methodology ChatGPT was prompted four times and 16 frequently asked questions on atrial fibrillation from the American Heart Association were asked. Prompts included Form 1 (no prompt), Form 2 (patient-friendly prompt), Form 3 (physician-level prompt), and Form 4 (prompting for statistics/references). Responses were scored as incorrect, partially correct, or correct with references (perfect). Flesch-Kincaid grade-level unique words and response lengths were recorded for answers. Proportions of the responses at differing scores were compared using the chi-square analysis. The relationship between form and grade level was assessed using the analysis of variance. Results Across all forms, scoring frequencies were one (1.6%) incorrect, five (7.8%) partially correct, 55 (85.9%) correct, and three (4.7%) perfect. Proportions of responses that were at least correct did not differ by form (p = 0.350), but perfect responses did (p = 0.001). Form 2 answers had a lower mean grade level (12.80 ± 3.38) than Forms 1 (14.23 ± 2.34), 3 (16.73 ± 2.65), and 4 (14.85 ± 2.76) (p < 0.05). Across all forms, references were provided in only three (4.7%) answers. Notably, when additionally prompted for sources or references, ChatGPT still only provided sources on three responses out of 16 (18.8%). Conclusions ChatGPT holds significant potential for enhancing patient education through accurate, adaptive responses. Its ability to alter response complexity based on user input, combined with high accuracy rates, supports its use as an informational resource in healthcare settings. Future advancements and continuous monitoring of AI capabilities will be crucial in maximizing the benefits while mitigating the risks associated with AI-driven patient education.
Collapse
Affiliation(s)
- Thomas J Lee
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Daniel J Campbell
- Otolaryngology-Head and Neck Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| | - Abhinav K Rao
- Department of Medicine, Trident Medical Center, Charleston, USA
| | - Afif Hossain
- Department of Medicine/Division of Cardiology, Rutgers University New Jersey Medical School, Newark, USA
| | - Omar Elkattawy
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Navid Radfar
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Paul Lee
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Julius M Gardin
- Department of Medicine/Division of Cardiology, Rutgers University New Jersey Medical School, Newark, USA
| |
Collapse
|
10
|
Lee TJ, Rao AK, Campbell DJ, Radfar N, Dayal M, Khrais A. Evaluating ChatGPT-3.5 and ChatGPT-4.0 Responses on Hyperlipidemia for Patient Education. Cureus 2024; 16:e61067. [PMID: 38803402 PMCID: PMC11128363 DOI: 10.7759/cureus.61067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/25/2024] [Indexed: 05/29/2024] Open
Abstract
Introduction Hyperlipidemia is prevalent worldwide and affects a significant number of US adults. It significantly contributes to ischemic heart disease and millions of deaths annually. With the increasing use of the internet for health information, tools like ChatGPT (OpenAI, San Francisco, CA, USA) have gained traction. ChatGPT version 4.0, launched in March 2023, offers enhanced features over its predecessor but requires a monthly fee. This study compares the accuracy, comprehensibility, and response length of the free and paid versions of ChatGPT for patient education on hyperlipidemia. Materials and methods ChatGPT versions 3.5 and 4.0 were prompted in three different ways and 25 questions from the Cleveland Clinic's frequently asked questions (FAQs) on hyperlipidemia. Prompts included no prompting (Form 1), patient-friendly prompting (Form 2), and physician-level prompting (Form 3). Responses were categorized as incorrect, partially correct, or correct. Additionally, the grade level and word count from each response were recorded for analysis. Results Overall, scoring frequencies for ChatGPT version 3.5 were: five (6.67%) incorrect, 18 partially correct (24%), and 52 (69.33%) correct. Scoring frequencies for ChatGPT version 4.0 were: one (1.33%) incorrect, 18 (24.00%) partially correct, and 56 (74.67%) correct. Correct answers did not significantly differ between ChatGPT version 3.5 and ChatGPT version 4.0 (p = 0.586). ChatGPT version 3.5 had a significantly higher grade reading level than version 4.0 (p = 0.0002). ChatGPT version 3.5 had a significantly higher word count than version 4.0 (p = 0.0073). Discussion There was no significant difference in accuracy between the free and paid versions of hyperlipidemia FAQs. Both versions provided accurate but sometimes partially complete responses. Version 4.0 offered more concise and readable information, aligning with the readability of most online medical resources despite exceeding the National Institutes of Health's (NIH's) recommended eighth-grade reading level. The paid version demonstrated superior adaptability in tailoring responses based on the input. Conclusion Both versions of ChatGPT provide reliable medical information, with the paid version offering more adaptable and readable responses. Healthcare providers can recommend ChatGPT as a source of patient education, regardless of the version used. Future research should explore diverse question formulations and ChatGPT's handling of incorrect information.
Collapse
Affiliation(s)
- Thomas J Lee
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Abhinav K Rao
- Department of Medicine, Trident Medical Center, Charleston, USA
| | - Daniel J Campbell
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| | - Navid Radfar
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Manik Dayal
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Ayham Khrais
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| |
Collapse
|
11
|
Bragazzi NL, Garbarino S. Assessing the Accuracy of Generative Conversational Artificial Intelligence in Debunking Sleep Health Myths: Mixed Methods Comparative Study With Expert Analysis. JMIR Form Res 2024; 8:e55762. [PMID: 38501898 PMCID: PMC11061787 DOI: 10.2196/55762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/25/2024] [Accepted: 03/14/2024] [Indexed: 03/20/2024] Open
Abstract
BACKGROUND Adequate sleep is essential for maintaining individual and public health, positively affecting cognition and well-being, and reducing chronic disease risks. It plays a significant role in driving the economy, public safety, and managing health care costs. Digital tools, including websites, sleep trackers, and apps, are key in promoting sleep health education. Conversational artificial intelligence (AI) such as ChatGPT (OpenAI, Microsoft Corp) offers accessible, personalized advice on sleep health but raises concerns about potential misinformation. This underscores the importance of ensuring that AI-driven sleep health information is accurate, given its significant impact on individual and public health, and the spread of sleep-related myths. OBJECTIVE This study aims to examine ChatGPT's capability to debunk sleep-related disbeliefs. METHODS A mixed methods design was leveraged. ChatGPT categorized 20 sleep-related myths identified by 10 sleep experts and rated them in terms of falseness and public health significance, on a 5-point Likert scale. Sensitivity, positive predictive value, and interrater agreement were also calculated. A qualitative comparative analysis was also conducted. RESULTS ChatGPT labeled a significant portion (n=17, 85%) of the statements as "false" (n=9, 45%) or "generally false" (n=8, 40%), with varying accuracy across different domains. For instance, it correctly identified most myths about "sleep timing," "sleep duration," and "behaviors during sleep," while it had varying degrees of success with other categories such as "pre-sleep behaviors" and "brain function and sleep." ChatGPT's assessment of the degree of falseness and public health significance, on the 5-point Likert scale, revealed an average score of 3.45 (SD 0.87) and 3.15 (SD 0.99), respectively, indicating a good level of accuracy in identifying the falseness of statements and a good understanding of their impact on public health. The AI-based tool showed a sensitivity of 85% and a positive predictive value of 100%. Overall, this indicates that when ChatGPT labels a statement as false, it is highly reliable, but it may miss identifying some false statements. When comparing with expert ratings, high intraclass correlation coefficients (ICCs) between ChatGPT's appraisals and expert opinions could be found, suggesting that the AI's ratings were generally aligned with expert views on falseness (ICC=.83, P<.001) and public health significance (ICC=.79, P=.001) of sleep-related myths. Qualitatively, both ChatGPT and sleep experts refuted sleep-related misconceptions. However, ChatGPT adopted a more accessible style and provided a more generalized view, focusing on broad concepts, while experts sometimes used technical jargon, providing evidence-based explanations. CONCLUSIONS ChatGPT-4 can accurately address sleep-related queries and debunk sleep-related myths, with a performance comparable to sleep experts, even if, given its limitations, the AI cannot completely replace expert opinions, especially in nuanced and complex fields such as sleep health, but can be a valuable complement in the dissemination of updated information and promotion of healthy behaviors.
Collapse
Affiliation(s)
- Nicola Luigi Bragazzi
- Human Nutrition Unit, Department of Food and Drugs, University of Parma, Parma, Italy
- Department of Neuroscience, Rehabilitation, Ophthalmology, Genetics and Maternal/Child Sciences, University of Genoa, Genoa, Italy
- Laboratory for Industrial and Applied Mathematics, Department of Mathematics and Statistics, York University, Toronto, ON, Canada
| | - Sergio Garbarino
- Department of Neuroscience, Rehabilitation, Ophthalmology, Genetics and Maternal/Child Sciences, University of Genoa, Genoa, Italy
- Post-Graduate School of Occupational Health, Università Cattolica del Sacro Cuore, Rome, Italy
| |
Collapse
|
12
|
Garbarino S, Bragazzi NL. Evaluating the effectiveness of artificial intelligence-based tools in detecting and understanding sleep health misinformation: Comparative analysis using Google Bard and OpenAI ChatGPT-4. J Sleep Res 2024:e14210. [PMID: 38577714 DOI: 10.1111/jsr.14210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 03/26/2024] [Accepted: 03/28/2024] [Indexed: 04/06/2024]
Abstract
This study evaluates the performance of two major artificial intelligence-based tools (ChatGPT-4 and Google Bard) in debunking sleep-related myths. More in detail, the present research assessed 20 sleep misconceptions using a 5-point Likert scale for falseness and public health significance, comparing responses of artificial intelligence tools with expert opinions. The results indicated that Google Bard correctly identified 19 out of 20 statements as false (95.0% accuracy), not differing from ChatGPT-4 (85.0% accuracy, Fisher's exact test p = 0.615). Google Bard's ratings of the falseness of the sleep misconceptions averaged 4.25 ± 0.70, showing a moderately negative skewness (-0.42) and kurtosis (-0.83), and suggesting a distribution with fewer extreme values compared with ChatGPT-4. In assessing public health significance, Google Bard's mean score was 2.4 ± 0.80, with skewness and kurtosis of 0.36 and -0.07, respectively, indicating a more normal distribution compared with ChatGPT-4. The inter-rater agreement between Google Bard and sleep experts had an intra-class correlation coefficient of 0.58 for falseness and 0.69 for public health significance, showing moderate alignment (p = 0.065 and p = 0.014, respectively). Text-mining analysis revealed Google Bard's focus on practical advice, while ChatGPT-4 concentrated on theoretical aspects of sleep. The readability analysis suggested Google Bard's responses were more accessible, aligning with 8th-grade level material, versus ChatGPT-4's 12th-grade level complexity. The study demonstrates the potential of artificial intelligence in public health education, especially in sleep health, and underscores the importance of accurate, reliable artificial intelligence-generated information, calling for further collaboration between artificial intelligence developers, sleep health professionals and educators to enhance the effectiveness of sleep health promotion.
Collapse
Affiliation(s)
- Sergio Garbarino
- Department of Neuroscience, Rehabilitation, Ophthalmology, Genetics and Maternal, Child Sciences (DINOGMI), University of Genoa, Genoa, Italy
- Post-Graduate School of Occupational Health, Università Cattolica del Sacro Cuore, Rome, Italy
| | - Nicola Luigi Bragazzi
- Department of Neuroscience, Rehabilitation, Ophthalmology, Genetics and Maternal, Child Sciences (DINOGMI), University of Genoa, Genoa, Italy
- Laboratory for Industrial and Applied Mathematics (LIAM), Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada
- Human Nutrition Unit (HNU), Department of Food and Drugs, University of Parma, Parma, Italy
| |
Collapse
|
13
|
Alapati R, Campbell D, Molin N, Creighton E, Wei Z, Boon M, Huntley C. Evaluating insomnia queries from an artificial intelligence chatbot for patient education. J Clin Sleep Med 2024; 20:583-594. [PMID: 38217478 PMCID: PMC10985291 DOI: 10.5664/jcsm.10948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/28/2023] [Accepted: 11/28/2023] [Indexed: 01/15/2024]
Abstract
STUDY OBJECTIVES We evaluated the accuracy of ChatGPT in addressing insomnia-related queries for patient education and assessed ChatGPT's ability to provide varied responses based on differing prompting scenarios. METHODS Four identical sets of 20 insomnia-related queries were posed to ChatGPT. Each set differed by the context in which ChatGPT was prompted: no prompt, patient-centered, physician-centered, and with references and statistics. Responses were reviewed by 2 academic sleep surgeons, 1 academic sleep medicine physician, and 2 sleep medicine fellows across 4 domains: clinical accuracy, prompt adherence, referencing, and statistical precision, using a binary grading system. Flesch-Kincaid grade-level scores were calculated to estimate the grade level of the responses, with statistical differences between prompts analyzed via analysis of variance and Tukey's test. Interrater reliability was calculated using Fleiss's kappa. RESULTS The study revealed significant variations in the Flesch-Kincaid grade-level scores across 4 prompts: unprompted (13.2 ± 2.2), patient-centered (8.1 ± 1.9), physician-centered (15.4 ± 2.8), and with references and statistics (17.3 ± 2.3, P < .001). Despite poor Fleiss kappa scores, indicating low interrater reliability for clinical accuracy and relevance, all evaluators agreed that the majority of ChatGPT's responses were clinically accurate, with the highest variability on Form 4. The responses were also uniformly relevant to the given prompts (100% agreement). Eighty percent of the references ChatGPT cited were verified as both real and relevant, and only 25% of cited statistics were corroborated within referenced articles. CONCLUSIONS ChatGPT can be used to generate clinically accurate responses to insomnia-related inquiries. CITATION Alapati R, Campbell D, Molin N, et al. Evaluating insomnia queries from an artificial intelligence chatbot for patient education. J Clin Sleep Med. 2024;20(4):583-594.
Collapse
Affiliation(s)
- Rahul Alapati
- Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Daniel Campbell
- Department of Otolaryngology, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania
| | - Nicole Molin
- Department of Otolaryngology, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania
- Department of Neurology, Jefferson Sleep Disorders Center, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania
| | - Erin Creighton
- Department of Otolaryngology, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania
- Department of Neurology, Jefferson Sleep Disorders Center, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania
| | - Zhikui Wei
- Department of Neurology, Jefferson Sleep Disorders Center, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania
| | - Maurits Boon
- Department of Otolaryngology, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania
| | - Colin Huntley
- Department of Otolaryngology, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania
| |
Collapse
|
14
|
Dhar S, Kothari D, Vasquez M, Clarke T, Maroda A, McClain WG, Sheyn A, Tuliszewski RM, Tang DM, Rangarajan SV. The utility and accuracy of ChatGPT in providing post-operative instructions following tonsillectomy: A pilot study. Int J Pediatr Otorhinolaryngol 2024; 179:111901. [PMID: 38447265 DOI: 10.1016/j.ijporl.2024.111901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 02/20/2024] [Accepted: 02/24/2024] [Indexed: 03/08/2024]
Abstract
OBJECTIVE To investigate the utility of answers generated by ChatGPT, a large language model, to common questions parents have for their children following tonsillectomy. METHODS Twenty Otolaryngology residents anonymously submitted common questions asked by parents of pediatric patients following tonsillectomy. After identifying the 16 most common questions via consensus-based approach, we asked ChatGPT to generate responses to these queries. Satisfaction with the AI-generated answers was rated from 1 (Worst) to 5 (Best) by an expert panel of 3 pediatric Otolaryngologists. RESULTS The distribution of questions across the five most common domains, their mean satisfaction scores, and their Krippendorf's interrater reliability coefficient were: Pain management [6, (3.67), (0.434)], Complications [4, (3.58), (-0.267)], Diet [3, (4.33), (-0.357)], Physical Activity [2, (4.33), (-0.318)], and Follow-up [1, (2.67), (-0.250)]. The panel noted that answers for diet, bleeding complications, and return to school were thorough. Pain management and follow-up recommendations were inaccurate, including a recommendation to prescribe codeine to children despite a black-box warning, and a suggested post-operative follow-up at 1 week, rather than the customary 2-4 weeks for our panel. CONCLUSION Although ChatGPT can provide accurate answers for common patient questions following tonsillectomy, it sometimes provides eloquently written inaccurate information. This may lead to patients using AI-generated medical advice contrary to physician advice. The inaccuracy in pain management answers likely reflects regional practice variability. If trained appropriately, ChatGPT could be an excellent resource for Otolaryngologists and patients to answer questions in the postoperative period. Future research should investigate if Otolaryngologist-trained models can increase the accuracy of responses.
Collapse
Affiliation(s)
- Sarit Dhar
- Department of Otolaryngology Head & Neck Surgery, University of Tennessee Health Science Center, 910 Madison Ave, Memphis, TN, 38163, USA
| | - Dhruv Kothari
- Department of Otolaryngology Head & Neck Surgery, University of Tennessee Health Science Center, 910 Madison Ave, Memphis, TN, 38163, USA; Department of Otolaryngology Head & Neck Surgery, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA, 90048, USA
| | - Missael Vasquez
- Department of Otolaryngology Head & Neck Surgery, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA, 90048, USA
| | - Travis Clarke
- Department of Otolaryngology Head & Neck Surgery, University of Tennessee Health Science Center, 910 Madison Ave, Memphis, TN, 38163, USA
| | - Andrew Maroda
- Department of Otolaryngology Head & Neck Surgery, University of Tennessee Health Science Center, 910 Madison Ave, Memphis, TN, 38163, USA
| | - Wade G McClain
- Department of Otolaryngology Head & Neck Surgery, University of Tennessee Health Science Center, 910 Madison Ave, Memphis, TN, 38163, USA
| | - Anthony Sheyn
- Department of Otolaryngology Head & Neck Surgery, University of Tennessee Health Science Center, 910 Madison Ave, Memphis, TN, 38163, USA
| | - Robert M Tuliszewski
- Department of Otolaryngology Head & Neck Surgery, University of Tennessee Health Science Center, 910 Madison Ave, Memphis, TN, 38163, USA
| | - Dennis M Tang
- Department of Otolaryngology Head & Neck Surgery, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA, 90048, USA
| | - Sanjeet V Rangarajan
- Department of Otolaryngology-Head and Neck Surgery, University Hospitals Cleveland Medical Center, Case Western Reserve University School of Medicine, 11100 Euclid Ave, Cleveland, OH, 44106, USA.
| |
Collapse
|
15
|
Coskun BN, Yagiz B, Ocakoglu G, Dalkilic E, Pehlivan Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int 2024; 44:509-515. [PMID: 37747564 DOI: 10.1007/s00296-023-05473-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 09/14/2023] [Indexed: 09/26/2023]
Abstract
We aimed to assess Large Language Models (LLMs)-ChatGPT 3.5-4, BARD, and Bing-in their accuracy and completeness when answering Methotrexate (MTX) related questions for treating rheumatoid arthritis. We employed 23 questions from an earlier study related to MTX concerns. These questions were entered into the LLMs, and the responses generated by each model were evaluated by two reviewers using Likert scales to assess accuracy and completeness. The GPT models achieved a 100% correct answer rate, while BARD and Bing scored 73.91%. In terms of accuracy of the outputs (completely correct responses), GPT-4 achieved a score of 100%, GPT 3.5 secured 86.96%, and BARD and Bing each scored 60.87%. BARD produced 17.39% incorrect responses and 8.7% non-responses, while Bing recorded 13.04% incorrect and 13.04% non-responses. The ChatGPT models produced significantly more accurate responses than Bing for the "mechanism of action" category, and GPT-4 model showed significantly higher accuracy than BARD in the "side effects" category. There were no statistically significant differences among the models for the "lifestyle" category. GPT-4 achieved a comprehensive output of 100%, followed by GPT-3.5 at 86.96%, BARD at 60.86%, and Bing at 0%. In the "mechanism of action" category, both ChatGPT models and BARD produced significantly more comprehensive outputs than Bing. For the "side effects" and "lifestyle" categories, the ChatGPT models showed significantly higher completeness than Bing. The GPT models, particularly GPT 4, demonstrated superior performance in providing accurate and comprehensive patient information about MTX use. However, the study also identified inaccuracies and shortcomings in the generated responses.
Collapse
Affiliation(s)
- Belkis Nihan Coskun
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey.
| | - Burcu Yagiz
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| | - Gokhan Ocakoglu
- Department of Biostatistics, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| | - Ediz Dalkilic
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| | - Yavuz Pehlivan
- Division of Rheumatology, Department of Internal Medicine, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| |
Collapse
|
16
|
Zalzal HG, Abraham A, Cheng J, Shah RK. Can ChatGPT help patients answer their otolaryngology questions? Laryngoscope Investig Otolaryngol 2024; 9:e1193. [PMID: 38362184 PMCID: PMC10866598 DOI: 10.1002/lio2.1193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 11/14/2023] [Accepted: 11/22/2023] [Indexed: 02/17/2024] Open
Abstract
Background Over the past year, the world has been captivated by the potential of artificial intelligence (AI). The appetite for AI in science, specifically healthcare is huge. It is imperative to understand the credibility of large language models in assisting the public in medical queries. Objective To evaluate the ability of ChatGPT to provide reasonably accurate answers to public queries within the domain of Otolaryngology. Methods Two board-certified otolaryngologists (HZ, RS) inputted 30 text-based patient queries into the ChatGPT-3.5 model. ChatGPT responses were rated by physicians on a scale (accurate, partially accurate, incorrect), while a similar 3-point scale involving confidence was given to layperson reviewers. Demographic data involving gender and education level was recorded for the public reviewers. Inter-rater agreement percentage was based on binomial distribution for calculating the 95% confidence intervals and performing significance tests. Statistical significance was defined as p < .05 for two-sided tests. Results In testing patient queries, both Otolaryngology physicians found that ChatGPT answered 98.3% of questions correctly, but only 79.8% (range 51.7%-100%) of patients were confident that the AI model was accurate in its responses (corrected agreement = 0.682; p < .001). Among the layperson responses, the corrected coefficient was of moderate agreement (0.571; p < .001). No correlation was noted among age, gender, or education level for the layperson responses. Conclusion ChatGPT is highly accurate in responding to questions posed by the public with regards to Otolaryngology from a physician standpoint. Public reviewers were not fully confident in believing the AI model, with subjective concerns related to less trust in AI answers compared to physician explanation. Larger evaluations with a representative public sample and broader medical questions should immediately be conducted by appropriate organizations, governing bodies, and/or governmental agencies to instill public confidence in AI and ChatGPT as a medical resource. Level of Evidence 4.
Collapse
Affiliation(s)
- Habib G. Zalzal
- Division of Otolaryngology‐Head and Neck SurgeryChildren's National HospitalWashingtonDistrict of ColumbiaUSA
| | | | - Jenhao Cheng
- Quality, Safety, AnalyticsChildren's National HospitalWashingtonDistrict of ColumbiaUSA
| | - Rahul K. Shah
- Division of Otolaryngology‐Head and Neck SurgeryChildren's National HospitalWashingtonDistrict of ColumbiaUSA
| |
Collapse
|
17
|
Zaleski AL, Berkowsky R, Craig KJT, Pescatello LS. Comprehensiveness, Accuracy, and Readability of Exercise Recommendations Provided by an AI-Based Chatbot: Mixed Methods Study. JMIR MEDICAL EDUCATION 2024; 10:e51308. [PMID: 38206661 PMCID: PMC10811574 DOI: 10.2196/51308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 10/05/2023] [Accepted: 12/11/2023] [Indexed: 01/12/2024]
Abstract
BACKGROUND Regular physical activity is critical for health and disease prevention. Yet, health care providers and patients face barriers to implement evidence-based lifestyle recommendations. The potential to augment care with the increased availability of artificial intelligence (AI) technologies is limitless; however, the suitability of AI-generated exercise recommendations has yet to be explored. OBJECTIVE The purpose of this study was to assess the comprehensiveness, accuracy, and readability of individualized exercise recommendations generated by a novel AI chatbot. METHODS A coding scheme was developed to score AI-generated exercise recommendations across ten categories informed by gold-standard exercise recommendations, including (1) health condition-specific benefits of exercise, (2) exercise preparticipation health screening, (3) frequency, (4) intensity, (5) time, (6) type, (7) volume, (8) progression, (9) special considerations, and (10) references to the primary literature. The AI chatbot was prompted to provide individualized exercise recommendations for 26 clinical populations using an open-source application programming interface. Two independent reviewers coded AI-generated content for each category and calculated comprehensiveness (%) and factual accuracy (%) on a scale of 0%-100%. Readability was assessed using the Flesch-Kincaid formula. Qualitative analysis identified and categorized themes from AI-generated output. RESULTS AI-generated exercise recommendations were 41.2% (107/260) comprehensive and 90.7% (146/161) accurate, with the majority (8/15, 53%) of inaccuracy related to the need for exercise preparticipation medical clearance. Average readability level of AI-generated exercise recommendations was at the college level (mean 13.7, SD 1.7), with an average Flesch reading ease score of 31.1 (SD 7.7). Several recurring themes and observations of AI-generated output included concern for liability and safety, preference for aerobic exercise, and potential bias and direct discrimination against certain age-based populations and individuals with disabilities. CONCLUSIONS There were notable gaps in the comprehensiveness, accuracy, and readability of AI-generated exercise recommendations. Exercise and health care professionals should be aware of these limitations when using and endorsing AI-based technologies as a tool to support lifestyle change involving exercise.
Collapse
Affiliation(s)
- Amanda L Zaleski
- Clinical Evidence Development, Aetna Medical Affairs, CVS Health Corporation, Hartford, CT, United States
- Department of Preventive Cardiology, Hartford Hospital, Hartford, CT, United States
| | - Rachel Berkowsky
- Department of Kinesiology, University of Connecticut, Storrs, CT, United States
| | - Kelly Jean Thomas Craig
- Clinical Evidence Development, Aetna Medical Affairs, CVS Health Corporation, Hartford, CT, United States
| | - Linda S Pescatello
- Department of Kinesiology, University of Connecticut, Storrs, CT, United States
| |
Collapse
|
18
|
Campbell DJ, Estephan LE. ChatGPT for patient education: an evolving investigation. J Clin Sleep Med 2023; 19:2135-2136. [PMID: 37677075 PMCID: PMC10692945 DOI: 10.5664/jcsm.10808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 09/01/2023] [Indexed: 09/09/2023]
Affiliation(s)
- Daniel J. Campbell
- Department of Otolaryngology–Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania
| | - Leonard E. Estephan
- Department of Otolaryngology–Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania
| |
Collapse
|