Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Campbell DJ, Estephan LE, Mastrolonardo EV, Amin DR, Huntley CT, Boon MS. Evaluating ChatGPT responses on obstructive sleep apnea for patient education. J Clin Sleep Med 2023;19:1989-1995. [PMID: 37485676 PMCID: PMC10692937 DOI: 10.5664/jcsm.10728] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 07/16/2023] [Accepted: 07/18/2023] [Indexed: 07/25/2023]

For:	Campbell DJ, Estephan LE, Mastrolonardo EV, Amin DR, Huntley CT, Boon MS. Evaluating ChatGPT responses on obstructive sleep apnea for patient education. J Clin Sleep Med 2023;19:1989-1995. [PMID: 37485676 PMCID: PMC10692937 DOI: 10.5664/jcsm.10728] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 07/16/2023] [Accepted: 07/18/2023] [Indexed: 07/25/2023]

Number

Cited by Other Article(s)

Lechien JR. Generative AI and Otolaryngology-Head & Neck Surgery. Otolaryngol Clin North Am 2024;57:753-765. [PMID: 38839556 DOI: 10.1016/j.otc.2024.04.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2024]

Incerti Parenti S, Bartolucci ML, Biondi E, Maglioni A, Corazza G, Gracco A, Alessandri-Bonetti G. Online Patient Education in Obstructive Sleep Apnea: ChatGPT versus Google Search. Healthcare (Basel) 2024;12:1781. [PMID: 39273804 PMCID: PMC11394980 DOI: 10.3390/healthcare12171781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Revised: 08/30/2024] [Accepted: 09/03/2024] [Indexed: 09/15/2024] Open

Lechien JR, Rameau A. Applications of ChatGPT in Otolaryngology-Head Neck Surgery: A State of the Art Review. Otolaryngol Head Neck Surg 2024;171:667-677. [PMID: 38716790 DOI: 10.1002/ohn.807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 04/01/2024] [Accepted: 04/19/2024] [Indexed: 08/28/2024]

Abstract

OBJECTIVE

To review the current literature on the application, accuracy, and performance of Chatbot Generative Pre-Trained Transformer (ChatGPT) in Otolaryngology-Head and Neck Surgery.

DATA SOURCES

PubMED, Cochrane Library, and Scopus.

REVIEW METHODS

A comprehensive review of the literature on the applications of ChatGPT in otolaryngology was conducted according to Preferred Reporting Items for Systematic Reviews and Meta-analyses statement.

CONCLUSIONS

ChatGPT provides imperfect patient information or general knowledge related to diseases found in Otolaryngology-Head and Neck Surgery. In clinical practice, despite suboptimal performance, studies reported that the model is more accurate in providing diagnoses, than in suggesting the most adequate additional examinations and treatments related to clinical vignettes or real clinical cases. ChatGPT has been used as an adjunct tool to improve scientific reports (referencing, spelling correction), to elaborate study protocols, or to take student or resident exams reporting several levels of accuracy. The stability of ChatGPT responses throughout repeated questions appeared high but many studies reported some hallucination events, particularly in providing scientific references.

IMPLICATIONS FOR PRACTICE

To date, most applications of ChatGPT are limited in generating disease or treatment information, and in the improvement of the management of clinical cases. The lack of comparison of ChatGPT performance with other large language models is the main limitation of the current research. Its ability to analyze clinical images has not yet been investigated in otolaryngology although upper airway tract or ear images are an important step in the diagnosis of most common ear, nose, and throat conditions. This review may help otolaryngologists to conceive new applications in further research.

Collapse

Oliva AD, Pasick LJ, Hoffer ME, Rosow DE. Improving readability and comprehension levels of otolaryngology patient education materials using ChatGPT. Am J Otolaryngol 2024;45:104502. [PMID: 39197330 DOI: 10.1016/j.amjoto.2024.104502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Accepted: 08/24/2024] [Indexed: 09/01/2024]

Abstract

OBJECTIVE

A publicly available large language learning model platform may help determine current readability levels of otolaryngology patient education materials, as well as translate these materials to the recommended 6th-grade and 8th-grade reading levels.

STUDY DESIGN

Cross-sectional analysis.

SETTING

Online using large language learning model, ChatGPT.

METHODS

The Patient Education pages of the American Laryngological Association (ALA) and American Academy of Otolaryngology-Head and Neck Surgery (AAO-HNS) websites were accessed. Materials were input into ChatGPT (OpenAI, San Francisco, CA; version 3.5) and Microsoft Word (Microsoft, Redmond, WA; version 16.74). Programs calculated Flesch Reading Ease (FRE) scores, with higher scores indicating easier readability, and Flesch-Kincaid (FK) grade levels, estimating U.S. grade level required to understand text. ChatGPT was prompted to "translate to a 5th-grade reading level" and provide new scores. Scores were compared for statistical differences, as well as differences between ChatGPT and Word gradings.

RESULTS

Patient education materials were reviewed and 37 ALA and 72 AAO-HNS topics were translated. Overall FRE scores and FK grades demonstrated significant improvements following translation of materials, as scored by ChatGPT (p < 0.001). Word also scored significant improvements in FRE and FK following translation by ChatGPT for AAO-HNS materials overall (p < 0.001) but not for individual topics or for subspecialty-specific categories. Compared with Word, ChatGPT significantly exaggerated the change in FRE grades and FK scores (p < 0.001).

CONCLUSION

Otolaryngology patient education materials were found to be written at higher reading levels than recommended. Artificial intelligence may prove to be a useful resource to simplify content to make it more accessible to patients.

Collapse

Garg N, Campbell DJ, Yang A, McCann A, Moroco AE, Estephan LE, Palmer WJ, Krein H, Heffelfinger R. Chatbots as Patient Education Resources for Aesthetic Facial Plastic Surgery: Evaluation of ChatGPT and Google Bard Responses. Facial Plast Surg Aesthet Med 2024. [PMID: 38946595 DOI: 10.1089/fpsam.2023.0368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/02/2024] Open

Sina EM, Campbell DJ, Duffy A, Mandloi S, Benedict P, Farquhar D, Unsal A, Nyquist G. Evaluating ChatGPT as a Patient Education Tool for COVID-19-Induced Olfactory Dysfunction. OTO Open 2024;8:e70011. [PMID: 39286736 PMCID: PMC11403001 DOI: 10.1002/oto2.70011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2024] [Revised: 08/20/2024] [Accepted: 08/28/2024] [Indexed: 09/19/2024] Open

Abstract

Objective

While most patients with COVID-19-induced olfactory dysfunction (OD) recover spontaneously, those with persistent OD face significant physical and psychological sequelae. ChatGPT, an artificial intelligence chatbot, has grown as a tool for patient education. This study seeks to evaluate the quality of ChatGPT-generated responses for COVID-19 OD.

Study Design

Quantitative observational study.

Setting

Publicly available online website.

Methods

ChatGPT (GPT-4) was queried 4 times with 30 identical questions. Prior to questioning, Chat-GPT was "prompted" to respond (1) to a patient, (2) to an eighth grader, (3) with references, and (4) no prompt. Answer accuracy was independently scored by 4 rhinologists using the Global Quality Score (GCS, range: 1-5). Proportions of responses at incremental score thresholds were compared using χ 2 analysis. Flesch-Kincaid grade level was calculated for each answer. Relationship between prompt type and grade level was assessed via analysis of variance.

Results

Across all graded responses (n = 480), 364 responses (75.8%) were "at least good" (GCS ≥ 4). Proportions of responses that were "at least good" (P < .0001) or "excellent" (GCS = 5) (P < .0001) differed by prompt; "at least moderate" (GCS ≥ 3) responses did not (P = .687). Eighth-grade level (14.06 ± 2.3) and patient-friendly (14.33 ± 2.0) responses were significantly lower mean grade level than no prompting (P < .0001).

Conclusion

ChatGPT provides appropriate answers to most questions on COVID-19 OD regardless of prompting. However, prompting influences response quality and grade level. ChatGPT responds at grade levels above accepted recommendations for presenting medical information to patients. Currently, ChatGPT offers significant potential for patient education as an adjunct to the conventional patient-physician relationship.

Collapse

Carnino JM, Pellegrini WR, Willis M, Cohen MB, Paz-Lansberg M, Davis EM, Grillone GA, Levi JR. Assessing ChatGPT's Responses to Otolaryngology Patient Questions. Ann Otol Rhinol Laryngol 2024;133:658-664. [PMID: 38676440 DOI: 10.1177/00034894241249621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/28/2024]

Adelstein JM, Sinkler MA, Li LT, Mistovich RJ. ChatGPT Responses to Common Questions About Slipped Capital Femoral Epiphysis: A Reliable Resource for Parents? J Pediatr Orthop 2024;44:353-357. [PMID: 38597253 DOI: 10.1097/bpo.0000000000002681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]

Lee TJ, Campbell DJ, Rao AK, Hossain A, Elkattawy O, Radfar N, Lee P, Gardin JM. Evaluating ChatGPT Responses on Atrial Fibrillation for Patient Education. Cureus 2024;16:e61680. [PMID: 38841294 PMCID: PMC11151148 DOI: 10.7759/cureus.61680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/04/2024] [Indexed: 06/07/2024] Open

Abstract

Background ChatGPT is a language model that has gained widespread popularity for its fine-tuned conversational abilities. However, a known drawback to the artificial intelligence (AI) chatbot is its tendency to confidently present users with inaccurate information. We evaluated the quality of ChatGPT responses to questions pertaining to atrial fibrillation for patient education. Our analysis included the accuracy and estimated grade level of answers and whether references were provided for the answers. Methodology ChatGPT was prompted four times and 16 frequently asked questions on atrial fibrillation from the American Heart Association were asked. Prompts included Form 1 (no prompt), Form 2 (patient-friendly prompt), Form 3 (physician-level prompt), and Form 4 (prompting for statistics/references). Responses were scored as incorrect, partially correct, or correct with references (perfect). Flesch-Kincaid grade-level unique words and response lengths were recorded for answers. Proportions of the responses at differing scores were compared using the chi-square analysis. The relationship between form and grade level was assessed using the analysis of variance. Results Across all forms, scoring frequencies were one (1.6%) incorrect, five (7.8%) partially correct, 55 (85.9%) correct, and three (4.7%) perfect. Proportions of responses that were at least correct did not differ by form (p = 0.350), but perfect responses did (p = 0.001). Form 2 answers had a lower mean grade level (12.80 ± 3.38) than Forms 1 (14.23 ± 2.34), 3 (16.73 ± 2.65), and 4 (14.85 ± 2.76) (p < 0.05). Across all forms, references were provided in only three (4.7%) answers. Notably, when additionally prompted for sources or references, ChatGPT still only provided sources on three responses out of 16 (18.8%). Conclusions ChatGPT holds significant potential for enhancing patient education through accurate, adaptive responses. Its ability to alter response complexity based on user input, combined with high accuracy rates, supports its use as an informational resource in healthcare settings. Future advancements and continuous monitoring of AI capabilities will be crucial in maximizing the benefits while mitigating the risks associated with AI-driven patient education.

Collapse

Lee TJ, Rao AK, Campbell DJ, Radfar N, Dayal M, Khrais A. Evaluating ChatGPT-3.5 and ChatGPT-4.0 Responses on Hyperlipidemia for Patient Education. Cureus 2024;16:e61067. [PMID: 38803402 PMCID: PMC11128363 DOI: 10.7759/cureus.61067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/25/2024] [Indexed: 05/29/2024] Open

Abstract

Introduction Hyperlipidemia is prevalent worldwide and affects a significant number of US adults. It significantly contributes to ischemic heart disease and millions of deaths annually. With the increasing use of the internet for health information, tools like ChatGPT (OpenAI, San Francisco, CA, USA) have gained traction. ChatGPT version 4.0, launched in March 2023, offers enhanced features over its predecessor but requires a monthly fee. This study compares the accuracy, comprehensibility, and response length of the free and paid versions of ChatGPT for patient education on hyperlipidemia. Materials and methods ChatGPT versions 3.5 and 4.0 were prompted in three different ways and 25 questions from the Cleveland Clinic's frequently asked questions (FAQs) on hyperlipidemia. Prompts included no prompting (Form 1), patient-friendly prompting (Form 2), and physician-level prompting (Form 3). Responses were categorized as incorrect, partially correct, or correct. Additionally, the grade level and word count from each response were recorded for analysis. Results Overall, scoring frequencies for ChatGPT version 3.5 were: five (6.67%) incorrect, 18 partially correct (24%), and 52 (69.33%) correct. Scoring frequencies for ChatGPT version 4.0 were: one (1.33%) incorrect, 18 (24.00%) partially correct, and 56 (74.67%) correct. Correct answers did not significantly differ between ChatGPT version 3.5 and ChatGPT version 4.0 (p = 0.586). ChatGPT version 3.5 had a significantly higher grade reading level than version 4.0 (p = 0.0002). ChatGPT version 3.5 had a significantly higher word count than version 4.0 (p = 0.0073). Discussion There was no significant difference in accuracy between the free and paid versions of hyperlipidemia FAQs. Both versions provided accurate but sometimes partially complete responses. Version 4.0 offered more concise and readable information, aligning with the readability of most online medical resources despite exceeding the National Institutes of Health's (NIH's) recommended eighth-grade reading level. The paid version demonstrated superior adaptability in tailoring responses based on the input. Conclusion Both versions of ChatGPT provide reliable medical information, with the paid version offering more adaptable and readable responses. Healthcare providers can recommend ChatGPT as a source of patient education, regardless of the version used. Future research should explore diverse question formulations and ChatGPT's handling of incorrect information.

Collapse

Bragazzi NL, Garbarino S. Assessing the Accuracy of Generative Conversational Artificial Intelligence in Debunking Sleep Health Myths: Mixed Methods Comparative Study With Expert Analysis. JMIR Form Res 2024;8:e55762. [PMID: 38501898 PMCID: PMC11061787 DOI: 10.2196/55762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/25/2024] [Accepted: 03/14/2024] [Indexed: 03/20/2024] Open

Abstract

BACKGROUND

Adequate sleep is essential for maintaining individual and public health, positively affecting cognition and well-being, and reducing chronic disease risks. It plays a significant role in driving the economy, public safety, and managing health care costs. Digital tools, including websites, sleep trackers, and apps, are key in promoting sleep health education. Conversational artificial intelligence (AI) such as ChatGPT (OpenAI, Microsoft Corp) offers accessible, personalized advice on sleep health but raises concerns about potential misinformation. This underscores the importance of ensuring that AI-driven sleep health information is accurate, given its significant impact on individual and public health, and the spread of sleep-related myths.

OBJECTIVE

This study aims to examine ChatGPT's capability to debunk sleep-related disbeliefs.

METHODS

A mixed methods design was leveraged. ChatGPT categorized 20 sleep-related myths identified by 10 sleep experts and rated them in terms of falseness and public health significance, on a 5-point Likert scale. Sensitivity, positive predictive value, and interrater agreement were also calculated. A qualitative comparative analysis was also conducted.

RESULTS

ChatGPT labeled a significant portion (n=17, 85%) of the statements as "false" (n=9, 45%) or "generally false" (n=8, 40%), with varying accuracy across different domains. For instance, it correctly identified most myths about "sleep timing," "sleep duration," and "behaviors during sleep," while it had varying degrees of success with other categories such as "pre-sleep behaviors" and "brain function and sleep." ChatGPT's assessment of the degree of falseness and public health significance, on the 5-point Likert scale, revealed an average score of 3.45 (SD 0.87) and 3.15 (SD 0.99), respectively, indicating a good level of accuracy in identifying the falseness of statements and a good understanding of their impact on public health. The AI-based tool showed a sensitivity of 85% and a positive predictive value of 100%. Overall, this indicates that when ChatGPT labels a statement as false, it is highly reliable, but it may miss identifying some false statements. When comparing with expert ratings, high intraclass correlation coefficients (ICCs) between ChatGPT's appraisals and expert opinions could be found, suggesting that the AI's ratings were generally aligned with expert views on falseness (ICC=.83, P<.001) and public health significance (ICC=.79, P=.001) of sleep-related myths. Qualitatively, both ChatGPT and sleep experts refuted sleep-related misconceptions. However, ChatGPT adopted a more accessible style and provided a more generalized view, focusing on broad concepts, while experts sometimes used technical jargon, providing evidence-based explanations.

CONCLUSIONS

ChatGPT-4 can accurately address sleep-related queries and debunk sleep-related myths, with a performance comparable to sleep experts, even if, given its limitations, the AI cannot completely replace expert opinions, especially in nuanced and complex fields such as sleep health, but can be a valuable complement in the dissemination of updated information and promotion of healthy behaviors.

Collapse

Garbarino S, Bragazzi NL. Evaluating the effectiveness of artificial intelligence-based tools in detecting and understanding sleep health misinformation: Comparative analysis using Google Bard and OpenAI ChatGPT-4. J Sleep Res 2024:e14210. [PMID: 38577714 DOI: 10.1111/jsr.14210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 03/26/2024] [Accepted: 03/28/2024] [Indexed: 04/06/2024]

Abstract

This study evaluates the performance of two major artificial intelligence-based tools (ChatGPT-4 and Google Bard) in debunking sleep-related myths. More in detail, the present research assessed 20 sleep misconceptions using a 5-point Likert scale for falseness and public health significance, comparing responses of artificial intelligence tools with expert opinions. The results indicated that Google Bard correctly identified 19 out of 20 statements as false (95.0% accuracy), not differing from ChatGPT-4 (85.0% accuracy, Fisher's exact test p = 0.615). Google Bard's ratings of the falseness of the sleep misconceptions averaged 4.25 ± 0.70, showing a moderately negative skewness (-0.42) and kurtosis (-0.83), and suggesting a distribution with fewer extreme values compared with ChatGPT-4. In assessing public health significance, Google Bard's mean score was 2.4 ± 0.80, with skewness and kurtosis of 0.36 and -0.07, respectively, indicating a more normal distribution compared with ChatGPT-4. The inter-rater agreement between Google Bard and sleep experts had an intra-class correlation coefficient of 0.58 for falseness and 0.69 for public health significance, showing moderate alignment (p = 0.065 and p = 0.014, respectively). Text-mining analysis revealed Google Bard's focus on practical advice, while ChatGPT-4 concentrated on theoretical aspects of sleep. The readability analysis suggested Google Bard's responses were more accessible, aligning with 8th-grade level material, versus ChatGPT-4's 12th-grade level complexity. The study demonstrates the potential of artificial intelligence in public health education, especially in sleep health, and underscores the importance of accurate, reliable artificial intelligence-generated information, calling for further collaboration between artificial intelligence developers, sleep health professionals and educators to enhance the effectiveness of sleep health promotion.

Collapse

Alapati R, Campbell D, Molin N, Creighton E, Wei Z, Boon M, Huntley C. Evaluating insomnia queries from an artificial intelligence chatbot for patient education. J Clin Sleep Med 2024;20:583-594. [PMID: 38217478 PMCID: PMC10985291 DOI: 10.5664/jcsm.10948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/28/2023] [Accepted: 11/28/2023] [Indexed: 01/15/2024]

Abstract

STUDY OBJECTIVES

We evaluated the accuracy of ChatGPT in addressing insomnia-related queries for patient education and assessed ChatGPT's ability to provide varied responses based on differing prompting scenarios.

METHODS

Four identical sets of 20 insomnia-related queries were posed to ChatGPT. Each set differed by the context in which ChatGPT was prompted: no prompt, patient-centered, physician-centered, and with references and statistics. Responses were reviewed by 2 academic sleep surgeons, 1 academic sleep medicine physician, and 2 sleep medicine fellows across 4 domains: clinical accuracy, prompt adherence, referencing, and statistical precision, using a binary grading system. Flesch-Kincaid grade-level scores were calculated to estimate the grade level of the responses, with statistical differences between prompts analyzed via analysis of variance and Tukey's test. Interrater reliability was calculated using Fleiss's kappa.

RESULTS

The study revealed significant variations in the Flesch-Kincaid grade-level scores across 4 prompts: unprompted (13.2 ± 2.2), patient-centered (8.1 ± 1.9), physician-centered (15.4 ± 2.8), and with references and statistics (17.3 ± 2.3, P < .001). Despite poor Fleiss kappa scores, indicating low interrater reliability for clinical accuracy and relevance, all evaluators agreed that the majority of ChatGPT's responses were clinically accurate, with the highest variability on Form 4. The responses were also uniformly relevant to the given prompts (100% agreement). Eighty percent of the references ChatGPT cited were verified as both real and relevant, and only 25% of cited statistics were corroborated within referenced articles.

CONCLUSIONS

ChatGPT can be used to generate clinically accurate responses to insomnia-related inquiries.

CITATION

Alapati R, Campbell D, Molin N, et al. Evaluating insomnia queries from an artificial intelligence chatbot for patient education. J Clin Sleep Med. 2024;20(4):583-594.

Collapse

Dhar S, Kothari D, Vasquez M, Clarke T, Maroda A, McClain WG, Sheyn A, Tuliszewski RM, Tang DM, Rangarajan SV. The utility and accuracy of ChatGPT in providing post-operative instructions following tonsillectomy: A pilot study. Int J Pediatr Otorhinolaryngol 2024;179:111901. [PMID: 38447265 DOI: 10.1016/j.ijporl.2024.111901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 02/20/2024] [Accepted: 02/24/2024] [Indexed: 03/08/2024]

Abstract

OBJECTIVE

To investigate the utility of answers generated by ChatGPT, a large language model, to common questions parents have for their children following tonsillectomy.

METHODS

Twenty Otolaryngology residents anonymously submitted common questions asked by parents of pediatric patients following tonsillectomy. After identifying the 16 most common questions via consensus-based approach, we asked ChatGPT to generate responses to these queries. Satisfaction with the AI-generated answers was rated from 1 (Worst) to 5 (Best) by an expert panel of 3 pediatric Otolaryngologists.

RESULTS

The distribution of questions across the five most common domains, their mean satisfaction scores, and their Krippendorf's interrater reliability coefficient were: Pain management [6, (3.67), (0.434)], Complications [4, (3.58), (-0.267)], Diet [3, (4.33), (-0.357)], Physical Activity [2, (4.33), (-0.318)], and Follow-up [1, (2.67), (-0.250)]. The panel noted that answers for diet, bleeding complications, and return to school were thorough. Pain management and follow-up recommendations were inaccurate, including a recommendation to prescribe codeine to children despite a black-box warning, and a suggested post-operative follow-up at 1 week, rather than the customary 2-4 weeks for our panel.

CONCLUSION

Although ChatGPT can provide accurate answers for common patient questions following tonsillectomy, it sometimes provides eloquently written inaccurate information. This may lead to patients using AI-generated medical advice contrary to physician advice. The inaccuracy in pain management answers likely reflects regional practice variability. If trained appropriately, ChatGPT could be an excellent resource for Otolaryngologists and patients to answer questions in the postoperative period. Future research should investigate if Otolaryngologist-trained models can increase the accuracy of responses.

Collapse

Coskun BN, Yagiz B, Ocakoglu G, Dalkilic E, Pehlivan Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int 2024;44:509-515. [PMID: 37747564 DOI: 10.1007/s00296-023-05473-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 09/14/2023] [Indexed: 09/26/2023]

Zalzal HG, Abraham A, Cheng J, Shah RK. Can ChatGPT help patients answer their otolaryngology questions? Laryngoscope Investig Otolaryngol 2024;9:e1193. [PMID: 38362184 PMCID: PMC10866598 DOI: 10.1002/lio2.1193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 11/14/2023] [Accepted: 11/22/2023] [Indexed: 02/17/2024] Open

Abstract

Background

Over the past year, the world has been captivated by the potential of artificial intelligence (AI). The appetite for AI in science, specifically healthcare is huge. It is imperative to understand the credibility of large language models in assisting the public in medical queries.

Objective

To evaluate the ability of ChatGPT to provide reasonably accurate answers to public queries within the domain of Otolaryngology.

Methods

Two board-certified otolaryngologists (HZ, RS) inputted 30 text-based patient queries into the ChatGPT-3.5 model. ChatGPT responses were rated by physicians on a scale (accurate, partially accurate, incorrect), while a similar 3-point scale involving confidence was given to layperson reviewers. Demographic data involving gender and education level was recorded for the public reviewers. Inter-rater agreement percentage was based on binomial distribution for calculating the 95% confidence intervals and performing significance tests. Statistical significance was defined as p < .05 for two-sided tests.

Results

In testing patient queries, both Otolaryngology physicians found that ChatGPT answered 98.3% of questions correctly, but only 79.8% (range 51.7%-100%) of patients were confident that the AI model was accurate in its responses (corrected agreement = 0.682; p < .001). Among the layperson responses, the corrected coefficient was of moderate agreement (0.571; p < .001). No correlation was noted among age, gender, or education level for the layperson responses.

Conclusion

ChatGPT is highly accurate in responding to questions posed by the public with regards to Otolaryngology from a physician standpoint. Public reviewers were not fully confident in believing the AI model, with subjective concerns related to less trust in AI answers compared to physician explanation. Larger evaluations with a representative public sample and broader medical questions should immediately be conducted by appropriate organizations, governing bodies, and/or governmental agencies to instill public confidence in AI and ChatGPT as a medical resource.

Level of Evidence

Collapse

Zaleski AL, Berkowsky R, Craig KJT, Pescatello LS. Comprehensiveness, Accuracy, and Readability of Exercise Recommendations Provided by an AI-Based Chatbot: Mixed Methods Study. JMIR MEDICAL EDUCATION 2024;10:e51308. [PMID: 38206661 PMCID: PMC10811574 DOI: 10.2196/51308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 10/05/2023] [Accepted: 12/11/2023] [Indexed: 01/12/2024]

Abstract

BACKGROUND

Regular physical activity is critical for health and disease prevention. Yet, health care providers and patients face barriers to implement evidence-based lifestyle recommendations. The potential to augment care with the increased availability of artificial intelligence (AI) technologies is limitless; however, the suitability of AI-generated exercise recommendations has yet to be explored.

OBJECTIVE

The purpose of this study was to assess the comprehensiveness, accuracy, and readability of individualized exercise recommendations generated by a novel AI chatbot.

METHODS

A coding scheme was developed to score AI-generated exercise recommendations across ten categories informed by gold-standard exercise recommendations, including (1) health condition-specific benefits of exercise, (2) exercise preparticipation health screening, (3) frequency, (4) intensity, (5) time, (6) type, (7) volume, (8) progression, (9) special considerations, and (10) references to the primary literature. The AI chatbot was prompted to provide individualized exercise recommendations for 26 clinical populations using an open-source application programming interface. Two independent reviewers coded AI-generated content for each category and calculated comprehensiveness (%) and factual accuracy (%) on a scale of 0%-100%. Readability was assessed using the Flesch-Kincaid formula. Qualitative analysis identified and categorized themes from AI-generated output.

RESULTS

AI-generated exercise recommendations were 41.2% (107/260) comprehensive and 90.7% (146/161) accurate, with the majority (8/15, 53%) of inaccuracy related to the need for exercise preparticipation medical clearance. Average readability level of AI-generated exercise recommendations was at the college level (mean 13.7, SD 1.7), with an average Flesch reading ease score of 31.1 (SD 7.7). Several recurring themes and observations of AI-generated output included concern for liability and safety, preference for aerobic exercise, and potential bias and direct discrimination against certain age-based populations and individuals with disabilities.

CONCLUSIONS

There were notable gaps in the comprehensiveness, accuracy, and readability of AI-generated exercise recommendations. Exercise and health care professionals should be aware of these limitations when using and endorsing AI-based technologies as a tool to support lifestyle change involving exercise.

Collapse

Campbell DJ, Estephan LE. ChatGPT for patient education: an evolving investigation. J Clin Sleep Med 2023;19:2135-2136. [PMID: 37677075 PMCID: PMC10692945 DOI: 10.5664/jcsm.10808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 09/01/2023] [Indexed: 09/09/2023]