1
|
Lee TJ, Campbell DJ, Rao AK, Hossain A, Elkattawy O, Radfar N, Lee P, Gardin JM. Evaluating ChatGPT Responses on Atrial Fibrillation for Patient Education. Cureus 2024; 16:e61680. [PMID: 38841294 PMCID: PMC11151148 DOI: 10.7759/cureus.61680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/04/2024] [Indexed: 06/07/2024] Open
Abstract
Background ChatGPT is a language model that has gained widespread popularity for its fine-tuned conversational abilities. However, a known drawback to the artificial intelligence (AI) chatbot is its tendency to confidently present users with inaccurate information. We evaluated the quality of ChatGPT responses to questions pertaining to atrial fibrillation for patient education. Our analysis included the accuracy and estimated grade level of answers and whether references were provided for the answers. Methodology ChatGPT was prompted four times and 16 frequently asked questions on atrial fibrillation from the American Heart Association were asked. Prompts included Form 1 (no prompt), Form 2 (patient-friendly prompt), Form 3 (physician-level prompt), and Form 4 (prompting for statistics/references). Responses were scored as incorrect, partially correct, or correct with references (perfect). Flesch-Kincaid grade-level unique words and response lengths were recorded for answers. Proportions of the responses at differing scores were compared using the chi-square analysis. The relationship between form and grade level was assessed using the analysis of variance. Results Across all forms, scoring frequencies were one (1.6%) incorrect, five (7.8%) partially correct, 55 (85.9%) correct, and three (4.7%) perfect. Proportions of responses that were at least correct did not differ by form (p = 0.350), but perfect responses did (p = 0.001). Form 2 answers had a lower mean grade level (12.80 ± 3.38) than Forms 1 (14.23 ± 2.34), 3 (16.73 ± 2.65), and 4 (14.85 ± 2.76) (p < 0.05). Across all forms, references were provided in only three (4.7%) answers. Notably, when additionally prompted for sources or references, ChatGPT still only provided sources on three responses out of 16 (18.8%). Conclusions ChatGPT holds significant potential for enhancing patient education through accurate, adaptive responses. Its ability to alter response complexity based on user input, combined with high accuracy rates, supports its use as an informational resource in healthcare settings. Future advancements and continuous monitoring of AI capabilities will be crucial in maximizing the benefits while mitigating the risks associated with AI-driven patient education.
Collapse
Affiliation(s)
- Thomas J Lee
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Daniel J Campbell
- Otolaryngology-Head and Neck Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| | - Abhinav K Rao
- Department of Medicine, Trident Medical Center, Charleston, USA
| | - Afif Hossain
- Department of Medicine/Division of Cardiology, Rutgers University New Jersey Medical School, Newark, USA
| | - Omar Elkattawy
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Navid Radfar
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Paul Lee
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Julius M Gardin
- Department of Medicine/Division of Cardiology, Rutgers University New Jersey Medical School, Newark, USA
| |
Collapse
|
2
|
Lee TJ, Rao AK, Campbell DJ, Radfar N, Dayal M, Khrais A. Evaluating ChatGPT-3.5 and ChatGPT-4.0 Responses on Hyperlipidemia for Patient Education. Cureus 2024; 16:e61067. [PMID: 38803402 PMCID: PMC11128363 DOI: 10.7759/cureus.61067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/25/2024] [Indexed: 05/29/2024] Open
Abstract
Introduction Hyperlipidemia is prevalent worldwide and affects a significant number of US adults. It significantly contributes to ischemic heart disease and millions of deaths annually. With the increasing use of the internet for health information, tools like ChatGPT (OpenAI, San Francisco, CA, USA) have gained traction. ChatGPT version 4.0, launched in March 2023, offers enhanced features over its predecessor but requires a monthly fee. This study compares the accuracy, comprehensibility, and response length of the free and paid versions of ChatGPT for patient education on hyperlipidemia. Materials and methods ChatGPT versions 3.5 and 4.0 were prompted in three different ways and 25 questions from the Cleveland Clinic's frequently asked questions (FAQs) on hyperlipidemia. Prompts included no prompting (Form 1), patient-friendly prompting (Form 2), and physician-level prompting (Form 3). Responses were categorized as incorrect, partially correct, or correct. Additionally, the grade level and word count from each response were recorded for analysis. Results Overall, scoring frequencies for ChatGPT version 3.5 were: five (6.67%) incorrect, 18 partially correct (24%), and 52 (69.33%) correct. Scoring frequencies for ChatGPT version 4.0 were: one (1.33%) incorrect, 18 (24.00%) partially correct, and 56 (74.67%) correct. Correct answers did not significantly differ between ChatGPT version 3.5 and ChatGPT version 4.0 (p = 0.586). ChatGPT version 3.5 had a significantly higher grade reading level than version 4.0 (p = 0.0002). ChatGPT version 3.5 had a significantly higher word count than version 4.0 (p = 0.0073). Discussion There was no significant difference in accuracy between the free and paid versions of hyperlipidemia FAQs. Both versions provided accurate but sometimes partially complete responses. Version 4.0 offered more concise and readable information, aligning with the readability of most online medical resources despite exceeding the National Institutes of Health's (NIH's) recommended eighth-grade reading level. The paid version demonstrated superior adaptability in tailoring responses based on the input. Conclusion Both versions of ChatGPT provide reliable medical information, with the paid version offering more adaptable and readable responses. Healthcare providers can recommend ChatGPT as a source of patient education, regardless of the version used. Future research should explore diverse question formulations and ChatGPT's handling of incorrect information.
Collapse
Affiliation(s)
- Thomas J Lee
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Abhinav K Rao
- Department of Medicine, Trident Medical Center, Charleston, USA
| | - Daniel J Campbell
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| | - Navid Radfar
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Manik Dayal
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Ayham Khrais
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| |
Collapse
|
3
|
Zaleski AL, Berkowsky R, Craig KJT, Pescatello LS. Comprehensiveness, Accuracy, and Readability of Exercise Recommendations Provided by an AI-Based Chatbot: Mixed Methods Study. JMIR MEDICAL EDUCATION 2024; 10:e51308. [PMID: 38206661 PMCID: PMC10811574 DOI: 10.2196/51308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 10/05/2023] [Accepted: 12/11/2023] [Indexed: 01/12/2024]
Abstract
BACKGROUND Regular physical activity is critical for health and disease prevention. Yet, health care providers and patients face barriers to implement evidence-based lifestyle recommendations. The potential to augment care with the increased availability of artificial intelligence (AI) technologies is limitless; however, the suitability of AI-generated exercise recommendations has yet to be explored. OBJECTIVE The purpose of this study was to assess the comprehensiveness, accuracy, and readability of individualized exercise recommendations generated by a novel AI chatbot. METHODS A coding scheme was developed to score AI-generated exercise recommendations across ten categories informed by gold-standard exercise recommendations, including (1) health condition-specific benefits of exercise, (2) exercise preparticipation health screening, (3) frequency, (4) intensity, (5) time, (6) type, (7) volume, (8) progression, (9) special considerations, and (10) references to the primary literature. The AI chatbot was prompted to provide individualized exercise recommendations for 26 clinical populations using an open-source application programming interface. Two independent reviewers coded AI-generated content for each category and calculated comprehensiveness (%) and factual accuracy (%) on a scale of 0%-100%. Readability was assessed using the Flesch-Kincaid formula. Qualitative analysis identified and categorized themes from AI-generated output. RESULTS AI-generated exercise recommendations were 41.2% (107/260) comprehensive and 90.7% (146/161) accurate, with the majority (8/15, 53%) of inaccuracy related to the need for exercise preparticipation medical clearance. Average readability level of AI-generated exercise recommendations was at the college level (mean 13.7, SD 1.7), with an average Flesch reading ease score of 31.1 (SD 7.7). Several recurring themes and observations of AI-generated output included concern for liability and safety, preference for aerobic exercise, and potential bias and direct discrimination against certain age-based populations and individuals with disabilities. CONCLUSIONS There were notable gaps in the comprehensiveness, accuracy, and readability of AI-generated exercise recommendations. Exercise and health care professionals should be aware of these limitations when using and endorsing AI-based technologies as a tool to support lifestyle change involving exercise.
Collapse
Affiliation(s)
- Amanda L Zaleski
- Clinical Evidence Development, Aetna Medical Affairs, CVS Health Corporation, Hartford, CT, United States
- Department of Preventive Cardiology, Hartford Hospital, Hartford, CT, United States
| | - Rachel Berkowsky
- Department of Kinesiology, University of Connecticut, Storrs, CT, United States
| | - Kelly Jean Thomas Craig
- Clinical Evidence Development, Aetna Medical Affairs, CVS Health Corporation, Hartford, CT, United States
| | - Linda S Pescatello
- Department of Kinesiology, University of Connecticut, Storrs, CT, United States
| |
Collapse
|
4
|
Seneviratne NU, Ho SY, Boro A, Correa DJ. Readability and content gaps in online epilepsy surgery materials as potential health literacy and shared-decision-making barriers. Epilepsia Open 2023; 8:1566-1575. [PMID: 37805810 PMCID: PMC10690683 DOI: 10.1002/epi4.12842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 10/04/2023] [Indexed: 10/09/2023] Open
Abstract
OBJECTIVE Epilepsy surgery is an effective albeit underused treatment for refractory epilepsy, and online materials are vital to patient understanding of the complex process. Our goal is to analyze the readability and content inclusion of online patient health education materials designed for epilepsy surgery. METHODS A private browser setting was used on Google and Bing to identify the top 100 search results for the terms "epilepsy+surgery". Scientific papers, insurance pages, pay-wall access sites, and non-text content were excluded. The website text was reformatted to exclude graphics, contact information, links, and headers. Readability metrics were calculated using an online tool. Text content was analyzed for inclusion of important concepts (pre-surgical evaluation, complications, risks of continued seizures, types of surgery, complimentary diagrams/audiovisual material). Comparison of readability and content inclusion was performed as a function of organization types (epilepsy center, community health organization, pediatric-specific) and location (region, country). RESULTS Browser search yielded 82 distinct websites with information regarding epilepsy surgery, with 98.7% of websites exceeding the recommended 6th-grade reading level for health information. Epilepsy centers had significantly worse readability (Flesch-Kincaid Grade Level (FKGL) P < 0.01 and Flesch Reading Ease (FRE) P < 0.05). Content analysis showed that only 37% of websites discuss surgical side effects and only 23% mention the risks of continued seizures. Epilepsy centers were less likely to report information on surgical side effects (P < 0.001). UK-based websites had better readability (FKGL P < 0.01 and FRE P < 0.01) and were more likely to discuss side effects (P = 0.01) compared to US-based websites. SIGNIFICANCE The majority of online health content is overly complex and relatively incomplete in multiple key areas important to health literacy and understanding of surgical candidacy. Our findings suggest academic organizations, including level 4 epilepsy centers, need to simplify and broaden online education resources. More comprehensive, publicly accessible, and readable information may lead to better-shared decision-making.
Collapse
Affiliation(s)
| | - Sophey Y. Ho
- Albert Einstein College of MedicineThe BronxNew YorkUSA
| | - Alexis Boro
- Saul R. Korey Department of NeurologyAlbert Einstein College of MedicineThe BronxNew YorkUSA
| | - Daniel J. Correa
- Saul R. Korey Department of NeurologyAlbert Einstein College of MedicineThe BronxNew YorkUSA
| |
Collapse
|
5
|
Kaya E, Görmez S. Quality and readability of online information on plantar fasciitis and calcaneal spur. Rheumatol Int 2022; 42:1965-1972. [PMID: 35763090 DOI: 10.1007/s00296-022-05165-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 06/08/2022] [Indexed: 10/17/2022]
Abstract
Plantar fasciitis and calcaneal spur are common causes of heel pain in the community. People use the Internet to obtain medical information about diseases. We reviewed Internet information sources on plantar fasciitis and calcaneal spur for quality and readability. The first 50 websites for each search term ("calcaneal spur", "heel spur", and "plantar fasciitis") were scanned on www.google.com . Six different valid tools were used for information quality and readability assessment. We searched for HONCode (Health On the Net Foundation Code) stamps on included websites. The total mean points for DISCERN were 50.52 ± 14.62, and the total mean points for JAMA (Journal of the American Medical Association) were 2.42 ± 1.26. In total, 25.72% of 97 websites had HONCode stamps. The average scores for the readability indicators were calculated to be Flesch-Kincaid Grade Level (FKGL): 7.27 ± 1.71, Gunning Fog: 8.46 ± 2.17, Simple Measure of Gobbledygook (SMOG): 6.89 ± 1.24, and Coleman Liau Index: 15.56 ± 1.85. In our study, when the website resources were examined, there were profit websites the most and website quality and readability were moderate level. A significant proportion of the websites have a financial bias and provide low-quality information. A mechanism for monitoring the quality and readability of online information must be established and managed systematically.
Collapse
Affiliation(s)
- Erhan Kaya
- Department of Public Health, Faculty of Medicine, Kahramanmaras Sutcu Imam University, Kahramanmaras, Turkey.
| | - Sinan Görmez
- Department of Orthopedics and Traumatology, Bulancak State Hospital, Giresun, Turkey
| |
Collapse
|
6
|
Shneyderman M, Snow GE, Davis R, Best S, Akst LM. Readability of Online Materials Related to Vocal Cord Leukoplakia. OTO Open 2021; 5:2473974X211032644. [PMID: 34396027 PMCID: PMC8358515 DOI: 10.1177/2473974x211032644] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 06/26/2021] [Indexed: 11/15/2022] Open
Abstract
Objectives To assess readability and understandability of online materials for vocal cord leukoplakia. Study Design Review of online materials. Setting Academic medical center. Methods A Google search of "vocal cord leukoplakia" was performed, and the first 50 websites were considered for analysis. Readability was measured by the Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), and Simple Measure of Gobbledygook (SMOG). Understandability and actionability were assessed by 2 independent reviewers with the PEMAT-P (Patient Education Materials Assessment Tool for Printable Materials). Unpaired t tests compared scores between sites aimed at physicians and those at patients, and a Cohen's kappa was calculated to measure interrater reliability. Results Twenty-two websites (17 patient oriented, 5 physician oriented) met inclusion criteria. For the entire cohort, FRES, FKGL, and SMOG scores (mean ± SD) were 36.90 ± 20.65, 12.96 ± 3.28, and 15.65 ± 3.57, respectively, indicating that materials were difficult to read at a >12th-grade level. PEMAT-P understandability and actionability scores were 73.65% ± 7.05% and 13.63% ± 22.47%. Statistically, patient-oriented sites were more easily read than physician-oriented sites (P < .02 for each of the FRES, FKGL, and SMOG comparisons); there were no differences in understandability or actionability scores between these categories of sites. Conclusion Online materials for vocal cord leukoplakia are written at a level more advanced than what is recommended for patient education materials. Awareness of the current ways that these online materials are failing our patients may lead to improved education materials in the future.
Collapse
Affiliation(s)
| | - Grace E Snow
- Department of Otolaryngology-Head and Neck Surgery, School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Ruth Davis
- Department of Otolaryngology-Head and Neck Surgery, School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Simon Best
- Department of Otolaryngology-Head and Neck Surgery, School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Lee M Akst
- Department of Otolaryngology-Head and Neck Surgery, School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| |
Collapse
|