Dihan Q, Chauhan MZ, Eleiwa TK, Hassan AK, Sallam AB, Khouri AS, Chang TC, Elhusseiny AM. Using Large Language Models to Generate Educational Materials on Childhood Glaucoma.
Am J Ophthalmol 2024:S0002-9394(24)00144-2. [PMID:
38614196 DOI:
10.1016/j.ajo.2024.04.004]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 03/29/2024] [Accepted: 04/03/2024] [Indexed: 04/15/2024]
Abstract
PURPOSE
To evaluate the quality, readability, and accuracy of large language model (LLM) generated patient education materials (PEMs) on childhood glaucoma, and their ability to improve existing online information's readability.
DESIGN
Cross-sectional comparative study.
METHODS
We evaluated responses of ChatGPT-3.5, ChatGPT-4, and Bard to three separate prompts requesting they write PEMs on "childhood glaucoma." Prompt A required PEMs be "easily understandable by the average American." Prompt B required PEMs be written "at a 6th-grade level using Simple Measure of Gobbledygook (SMOG) readability formula." We then compared responses' quality (DISCERN questionnaire, Patient Education Materials Assessment Tool (PEMAT)), readability (SMOG, Flesch-Kincaid Grading Level (FKGL)), and accuracy (Likert Misinformation scale). To assess the improvement of readability for existing online information, Prompt C requested LLM rewrite 20 resources from a Google search of keyword "childhood glaucoma" to the American Medical Association-recommended "6th-grade level." Rewrites were compared on key metrics such as readability, complex words (≥3 syllables), and sentence count.
RESULTS
All 3 LLM generated PEMs that were of high quality, understandability, and accuracy (DISCERN≥4, ≥70% PEMAT understandability, Misinformation score=1). Prompt B responses were more readable than Prompt A responses for all 3 LLM (p≤0.001). ChatGPT-4 generated the most readable PEMs compared to ChatGPT-3.5 and Bard (p≤0.001). Although Prompt C responses showed consistent reduction of mean SMOG and FKGL scores, only ChatGPT-4 achieved the specified 6th-grade reading level (4.8 ± 0.8 and 3.7 ± 1.9, respectively).
CONCLUSION
LLMs can serve as strong supplementary tools in generating high quality, accurate, and novel PEMs, and improving the readability of existing PEMs on childhood glaucoma.
Collapse