1
|
Dharia SN, Traversone J, Wortman R, Mulligan M. Assessing the quality and readability of ChatGPT responses to frequently asked questions about trigger finger release. J Plast Reconstr Aesthet Surg 2025; 105:170-172. [PMID: 40300385 DOI: 10.1016/j.bjps.2025.04.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2025] [Revised: 04/07/2025] [Accepted: 04/21/2025] [Indexed: 05/01/2025]
Abstract
Artificial intelligence, specifically large language models like ChatGPT, is rapidly transforming the healthcare landscape. As ChatGPT becomes more popular for obtaining medical information, there are concerns regarding the accuracy and quality of its content. While prior studies in various medical specialties have yielded mixed results regarding ChatGPT's reliability, little research has focused on its ability to address questions regarding specific orthopedic procedures, such as surgical intervention for stenosing tenosynovitis (trigger finger). This study assessed the accuracy, clarity, and readability of ChatGPT's responses to ten commonly asked patient questions regarding trigger finger release. The questions were obtained from Google's "People also ask" section and queried in ChatGPT 4.0 on September 24, 2024. Responses were evaluated by two authors using a four-point accuracy scale. Additionally, the education level required to understand the responses was assessed using the Flesch-Kincaid scale. ChatGPT's responses achieved an average score of 1.9, falling between "excellent, requiring no clarification" and "satisfactory, requiring minimal clarification." Although the chatbot provided largely accurate information, it produced an incorrect response in one case and displayed occasional factual inaccuracies, particularly regarding treatment recommendations. The average reading level of responses was at a 12th-grade level, which exceeds the recommended 7th-8th-grade level for patient materials. ChatGPT can serve as a useful starting point for patients seeking information about orthopedic procedures like trigger finger release, but healthcare providers should guide patients in validating AI-generated content to enhance medical literacy and ensure accurate understanding.
Collapse
Affiliation(s)
- Sohil N Dharia
- Department of Orthopaedic Surgery, Albany Medical Center, Albany, NY, USA.
| | - John Traversone
- Department of Orthopaedic Surgery, Albany Medical Center, Albany, NY, USA
| | - Ryan Wortman
- Department of Orthopaedic Surgery, Albany Medical Center, Albany, NY, USA
| | - Michael Mulligan
- Department of Orthopaedic Surgery, Albany Medical Center, Albany, NY, USA
| |
Collapse
|
2
|
Isch EL, Lee J, Self DM, Sambangi A, Habarth-Morales TE, Vaile J, Caterson EJ. Artificial Intelligence in Surgical Coding: Evaluating Large Language Models for Current Procedural Terminology Accuracy in Hand Surgery. JOURNAL OF HAND SURGERY GLOBAL ONLINE 2025; 7:181-185. [PMID: 40182863 PMCID: PMC11963066 DOI: 10.1016/j.jhsg.2024.11.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Accepted: 11/21/2024] [Indexed: 04/05/2025] Open
Abstract
Purpose The advent of large language models (LLMs) like ChatGPT has introduced notable advancements in various surgical disciplines. These developments have led to an increased interest in the use of LLMs for Current Procedural Terminology (CPT) coding in surgery. With CPT coding being a complex and time-consuming process, often exacerbated by the scarcity of professional coders, there is a pressing need for innovative solutions to enhance coding efficiency and accuracy. Methods This observational study evaluated the effectiveness of five publicly available large language models-Perplexity.AI, Bard, BingAI, ChatGPT 3.5, and ChatGPT 4.0-in accurately identifying CPT codes for hand surgery procedures. A consistent query format was employed to test each model, ensuring the inclusion of detailed procedure components where necessary. The responses were classified as correct, partially correct, or incorrect based on their alignment with established CPT coding for the specified procedures. Results In the evaluation of artificial intelligence (AI) model performance on simple procedures, Perplexity.AI achieved the highest number of correct outcomes (15), followed by Bard and Bing AI (14 each). ChatGPT 4 and ChatGPT 3.5 yielded 8 and 7 correct outcomes, respectively. For complex procedures, Perplexity.AI and Bard each had three correct outcomes, whereas ChatGPT models had none. Bing AI had the highest number of partially correct outcomes (5). There were significant associations between AI models and performance outcomes for both simple and complex procedures. Conclusions This study highlights the feasibility and potential benefits of integrating LLMs into the CPT coding process for hand surgery. The findings advocate for further refinement and training of AI models to improve their accuracy and practicality, suggesting a future where AI-assisted coding could become a standard component of surgical workflows, aligning with the ongoing digital transformation in health care. Type of study/level of evidence Observational, IIIb.
Collapse
Affiliation(s)
- Emily L. Isch
- Department of General Surgery, Thomas Jefferson University, Philadelphia, PA
| | - Jamie Lee
- Drexel University College of Medicine, Philadelphia, PA
| | - D. Mitchell Self
- Department of Neurosurgery, Thomas Jefferson University and Jefferson Hospital for Neuroscience, Philadelphia, PA
| | - Abhijeet Sambangi
- Sidney Kimmel Medical College at Thomas Jefferson University, Philadelphia, PA
| | | | - John Vaile
- Sidney Kimmel Medical College at Thomas Jefferson University, Philadelphia, PA
| | - EJ Caterson
- Department of Surgery, Division of Plastic Surgery, Nemours Children’s Hospital Wilmington, DE
| |
Collapse
|
3
|
Graul S, Pais MA, Loucas R, Rohrbach T, Volkmer E, Leitsch S, Holzbach T. Pilot Study on AI Image Analysis for Lower-Limb Reconstruction-Assessing ChatGPT-4's Recommendations in Comparison to Board-Certified Plastic Surgeons and Resident Physicians. Life (Basel) 2025; 15:66. [PMID: 39860006 PMCID: PMC11766909 DOI: 10.3390/life15010066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Revised: 12/15/2024] [Accepted: 12/27/2024] [Indexed: 01/27/2025] Open
Abstract
AI, especially ChatGPT, is impacting healthcare through applications in research, patient communication, and training. To our knowledge, this is the first study to examine ChatGPT-4's ability to analyze images of lower leg defects and assesses its understanding of complex case reports in comparison to the performance of board-certified surgeons and residents. We conducted a cross-sectional survey in Switzerland, Germany, and Austria, where 52 participants reviewed images depicting lower leg defects within fictitious patient profiles and selected the optimal reconstruction techniques. The questionnaire included cases with varied difficulty, and answer options did not always include the most obvious choices. Findings highlight that ChatGPT-4 successfully evaluated various reconstruction methods but struggled to determine the optimal solution based on the available information in visual and written forms. A chi-squared test of independence was performed to investigate the overall association between answer options (A, B, C, and D) and rater group (board-certified surgeons, ChatGPT-4, and resident). Inter-group rater associations showed significant overall test results (p < 0.001), with high agreement among board-certified surgeons. Our results suggest that board-certified plastic surgeons remain essential for patient-specific treatment planning, while AI can support decision-making. This reaffirms the role of AI as a supportive tool, rather than a replacement, in reconstructive surgery.
Collapse
Affiliation(s)
- Silke Graul
- Department of Hand and Plastic Surgery, Thurgau Hospital Group, 8501 Frauenfeld, Switzerland
| | - Michael A. Pais
- Department of Hand and Plastic Surgery, Thurgau Hospital Group, 8501 Frauenfeld, Switzerland
- Department for BioMedical Research, University of Bern, 3012 Bern, Switzerland
| | - Rafael Loucas
- Department of Plastic, Hand and Reconstructive Surgery, University Hospital Regensburg, 93053 Regensburg, Germany
| | - Tobias Rohrbach
- Australian Centre of Health Engagement, Evidence and Values (ACHEEV), University of Wollongong, Wollongong 2500, Australia
| | - Elias Volkmer
- Department of Hand Surgery, Helios Klinikum Munich West, 81241 Munich, Germany
| | - Sebastian Leitsch
- Department of Hand and Plastic Surgery, Thurgau Hospital Group, 8501 Frauenfeld, Switzerland
| | - Thomas Holzbach
- Department of Hand and Plastic Surgery, Thurgau Hospital Group, 8501 Frauenfeld, Switzerland
| |
Collapse
|
4
|
Hendrix CG, Young S, Forro SD, Norris BL. Navigating the intersection of AI and orthopaedic trauma research: Promise, pitfalls, and the path forward. Injury 2025; 56:112085. [PMID: 39694774 DOI: 10.1016/j.injury.2024.112085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/20/2024]
Affiliation(s)
- Christopher G Hendrix
- Oklahoma State Center for Health Sciences, Dept. of Orthopaedic Surgery, Tulsa, OK, USA; Trauma Institute, Saint Francis Health System, Tulsa, OK, USA
| | - Sean Young
- Oklahoma State Center for Health Sciences, Dept. of Orthopaedic Surgery, Tulsa, OK, USA
| | - Stephen D Forro
- Oklahoma State Center for Health Sciences, Dept. of Orthopaedic Surgery, Tulsa, OK, USA; Orthopaedic and Trauma Services of Oklahoma (OTSO), Tulsa, OK, USA
| | - Brent L Norris
- Oklahoma State Center for Health Sciences, Dept. of Orthopaedic Surgery, Tulsa, OK, USA; Orthopaedic and Trauma Services of Oklahoma (OTSO), Tulsa, OK, USA.
| |
Collapse
|
5
|
Swisher AR, Wu AW, Liu GC, Lee MK, Carle TR, Tang DM. Enhancing Health Literacy: Evaluating the Readability of Patient Handouts Revised by ChatGPT's Large Language Model. Otolaryngol Head Neck Surg 2024; 171:1751-1757. [PMID: 39105460 DOI: 10.1002/ohn.927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Revised: 07/03/2024] [Accepted: 07/20/2024] [Indexed: 08/07/2024]
Abstract
OBJECTIVE To use an artificial intelligence (AI)-powered large language model (LLM) to improve readability of patient handouts. STUDY DESIGN Review of online material modified by AI. SETTING Academic center. METHODS Five handout materials obtained from the American Rhinologic Society (ARS) and the American Academy of Facial Plastic and Reconstructive Surgery websites were assessed using validated readability metrics. The handouts were inputted into OpenAI's ChatGPT-4 after prompting: "Rewrite the following at a 6th-grade reading level." The understandability and actionability of both native and LLM-revised versions were evaluated using the Patient Education Materials Assessment Tool (PEMAT). Results were compared using Wilcoxon rank-sum tests. RESULTS The mean readability scores of the standard (ARS, American Academy of Facial Plastic and Reconstructive Surgery) materials corresponded to "difficult," with reading categories ranging between high school and university grade levels. Conversely, the LLM-revised handouts had an average seventh-grade reading level. LLM-revised handouts had better readability in nearly all metrics tested: Flesch-Kincaid Reading Ease (70.8 vs 43.9; P < .05), Gunning Fog Score (10.2 vs 14.42; P < .05), Simple Measure of Gobbledygook (9.9 vs 13.1; P < .05), Coleman-Liau (8.8 vs 12.6; P < .05), and Automated Readability Index (8.2 vs 10.7; P = .06). PEMAT scores were significantly higher in the LLM-revised handouts for understandability (91 vs 74%; P < .05) with similar actionability (42 vs 34%; P = .15) when compared to the standard materials. CONCLUSION Patient-facing handouts can be augmented by ChatGPT with simple prompting to tailor information with improved readability. This study demonstrates the utility of LLMs to aid in rewriting patient handouts and may serve as a tool to help optimize education materials. LEVEL OF EVIDENCE Level VI.
Collapse
Affiliation(s)
- Austin R Swisher
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Phoenix, Arizona, USA
| | - Arthur W Wu
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Gene C Liu
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Matthew K Lee
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Taylor R Carle
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Dennis M Tang
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| |
Collapse
|
6
|
Lim B, Seth I, Cuomo R, Kenney PS, Ross RJ, Sofiadellis F, Pentangelo P, Ceccaroni A, Alfano C, Rozen WM. Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients. Aesthetic Plast Surg 2024; 48:4712-4724. [PMID: 38898239 PMCID: PMC11645314 DOI: 10.1007/s00266-024-04157-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Accepted: 05/21/2024] [Indexed: 06/21/2024]
Abstract
BACKGROUND Abdominoplasty is a common operation, used for a range of cosmetic and functional issues, often in the context of divarication of recti, significant weight loss, and after pregnancy. Despite this, patient-surgeon communication gaps can hinder informed decision-making. The integration of large language models (LLMs) in healthcare offers potential for enhancing patient information. This study evaluated the feasibility of using LLMs for answering perioperative queries. METHODS This study assessed the efficacy of four leading LLMs-OpenAI's ChatGPT-3.5, Anthropic's Claude, Google's Gemini, and Bing's CoPilot-using fifteen unique prompts. All outputs were evaluated using the Flesch-Kincaid, Flesch Reading Ease score, and Coleman-Liau index for readability assessment. The DISCERN score and a Likert scale were utilized to evaluate quality. Scores were assigned by two plastic surgical residents and then reviewed and discussed until a consensus was reached by five plastic surgeon specialists. RESULTS ChatGPT-3.5 required the highest level for comprehension, followed by Gemini, Claude, then CoPilot. Claude provided the most appropriate and actionable advice. In terms of patient-friendliness, CoPilot outperformed the rest, enhancing engagement and information comprehensiveness. ChatGPT-3.5 and Gemini offered adequate, though unremarkable, advice, employing more professional language. CoPilot uniquely included visual aids and was the only model to use hyperlinks, although they were not very helpful and acceptable, and it faced limitations in responding to certain queries. CONCLUSION ChatGPT-3.5, Gemini, Claude, and Bing's CoPilot showcased differences in readability and reliability. LLMs offer unique advantages for patient care but require careful selection. Future research should integrate LLM strengths and address weaknesses for optimal patient education. LEVEL OF EVIDENCE V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
Affiliation(s)
- Bryan Lim
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Ishith Seth
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Roberto Cuomo
- Plastic Surgery Unit, Department of Medicine, Surgery and Neuroscience, University of Siena, Siena, Italy.
| | - Peter Sinkjær Kenney
- Department of Plastic Surgery, Velje Hospital, Beriderbakken 4, 7100, Vejle, Denmark
- Department of Plastic and Breast Surgery, Aarhus University Hospital, Aarhus, Denmark
| | - Richard J Ross
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Foti Sofiadellis
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | | | | | | | - Warren Matthew Rozen
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| |
Collapse
|
7
|
Sarikonda A, Abishek R, Isch EL, Momin AA, Self M, Sambangi A, Carreras A, Jallo J, Harrop J, Sivaganesan A. Assessing the Clinical Appropriateness and Practical Utility of ChatGPT as an Educational Resource for Patients Considering Minimally Invasive Spine Surgery. Cureus 2024; 16:e71105. [PMID: 39525124 PMCID: PMC11548952 DOI: 10.7759/cureus.71105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/08/2024] [Indexed: 11/16/2024] Open
Abstract
Introduction Minimally invasive spine surgery (MISS) has evolved over the last three decades as a less invasive alternative to traditional spine surgery, offering benefits such as smaller incisions, faster recovery, and lower complication rates. With patients frequently seeking information about MISS online, the comprehensibility and accuracy of this information are crucial. Recent studies have shown that much of the online material regarding spine surgery exceeds the recommended readability levels, making it difficult for patients to understand. This study explores the clinical appropriateness and readability of responses generated by Chat Generative Pre-Trained Transformer (ChatGPT) to frequently asked questions (FAQs) about MISS. Methods A set of 15 FAQs was formulated based on clinical expertise and existing literature on MISS. Each question was independently inputted into ChatGPT five times, and the generated responses were evaluated by three neurosurgery attendings for clinical appropriateness. Appropriateness was judged based on accuracy, readability, and patient accessibility. Readability was assessed using seven standardized readability tests, including the Flesch-Kincaid Grade Level and Flesch Reading Ease (FRE) scores. Statistical analysis was performed to compare readability scores across preoperative, postoperative, and intraoperative/technical question categories. Results The mean readability scores for preoperative, postoperative, and intraoperative/technical questions were 15±2.8, 16±3, and 15.7±3.2, respectively, significantly exceeding the recommended sixth- to eighth-grade reading level for patient education (p=0.017). Differences in readability across individual questions were also statistically significant (p<0.001). All responses required a reading level above 11th grade, with a majority indicating college-level comprehension. Although preoperative and postoperative questions generally elicited clinically appropriate responses, 50% of intraoperative/technical questions yielded either "inappropriate" or "unreliable" responses, particularly for inquiries about radiation exposure and the use of lasers in MISS. Conclusions While ChatGPT is proficient in providing clinically appropriate responses to certain FAQs about MISS, it frequently produces responses that exceed the recommended readability level for patient education. This limitation suggests that its utility may be confined to highly educated patients, potentially exacerbating existing disparities in patient comprehension. Future AI-based patient education tools must prioritize clear and accessible communication, with oversight from medical professionals to ensure accuracy and appropriateness. Further research comparing ChatGPT's performance with other AI models could enhance its application in patient education across medical specialties.
Collapse
Affiliation(s)
- Advith Sarikonda
- Department of Neurological Surgery, Thomas Jefferson University, Philadelphia, USA
| | - Robert Abishek
- Department of Neurological Surgery, Thomas Jefferson University, Philadelphia, USA
| | - Emily L Isch
- Department of General Surgery, Division of Plastic Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| | - Arbaz A Momin
- Department of Neurological Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| | - Mitchell Self
- Department of Neurological Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| | - Abhijeet Sambangi
- Department of Neurological Surgery, Thomas Jefferson University, Philadelphia, USA
| | - Angeleah Carreras
- Department of Neurological Surgery, Thomas Jefferson University, Philadelphia, USA
| | - Jack Jallo
- Department of Neurosurgery, Thomas Jefferson Medical College, Philadelphia, USA
| | - Jim Harrop
- Department of Neurological Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| | - Ahilan Sivaganesan
- Department of Neurological Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| |
Collapse
|
8
|
Boroumand S, Gu E, Huelsboemer L, Stögner VA, Parikh N, Kauke-Navarro M, Pomahac B. To Face Transplant or Not Face Transplant? Evaluating the Limitations of ChatGPT's Consideration of Ethical Themes. Ann Plast Surg 2024; 93:527-529. [PMID: 39331750 DOI: 10.1097/sap.0000000000004072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/29/2024]
Affiliation(s)
- Sam Boroumand
- From the Division of Plastic & Reconstructive Surgery, Department of Surgery, Yale School of Medicine, New Haven, CT
| | - Emily Gu
- From the Division of Plastic & Reconstructive Surgery, Department of Surgery, Yale School of Medicine, New Haven, CT
| | - Lioba Huelsboemer
- From the Division of Plastic & Reconstructive Surgery, Department of Surgery, Yale School of Medicine, New Haven, CT
| | | | - Neil Parikh
- From the Division of Plastic & Reconstructive Surgery, Department of Surgery, Yale School of Medicine, New Haven, CT
| | - Martin Kauke-Navarro
- From the Division of Plastic & Reconstructive Surgery, Department of Surgery, Yale School of Medicine, New Haven, CT
| | - Bohdan Pomahac
- From the Division of Plastic & Reconstructive Surgery, Department of Surgery, Yale School of Medicine, New Haven, CT
| |
Collapse
|
9
|
Drouaud A, Stocchi C, Tang J, Gonsalves G, Cheung Z, Szatkowski J, Forsh D. Exploring the Performance of ChatGPT in an Orthopaedic Setting and Its Potential Use as an Educational Tool. JB JS Open Access 2024; 9:e24.00081. [PMID: 39600798 PMCID: PMC11584220 DOI: 10.2106/jbjs.oa.24.00081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/29/2024] Open
Abstract
Introduction We assessed ChatGPT-4 vision (GPT-4V)'s performance for image interpretation, diagnosis formulation, and patient management capabilities. We aim to shed light on its potential as an educational tool addressing real-life cases for medical students. Methods Ten of the most popular orthopaedic trauma cases from OrthoBullets were selected. GPT-4V interpreted medical imaging and patient information, providing diagnoses, and guiding responses to OrthoBullets questions. Four fellowship-trained orthopaedic trauma surgeons rated GPT-4V responses using a 5-point Likert scale (strongly disagree to strongly agree). Each of GPT-4V's answers was assessed for alignment with current medical knowledge (accuracy), rationale and whether it is logical (rationale), relevancy to the specific case (relevance), and whether surgeons would trust the answers (trustworthiness). Mean scores from surgeon ratings were calculated. Results In total, 10 clinical cases, comprising 97 questions, were analyzed (10 imaging, 35 management, and 52 treatment). The surgeons assigned a mean overall rating of 3.46/5.00 to GPT-4V's imaging response (accuracy 3.28, rationale 3.68, relevance 3.75, and trustworthiness 3.15). Management questions received an overall score of 3.76 (accuracy 3.61, rationale 3.84, relevance 4.01, and trustworthiness 3.58), while treatment questions had an average overall score of 4.04 (accuracy 3.99, rationale 4.08, relevance 4.15, and trustworthiness 3.93). Conclusion This is the first study evaluating GPT-4V's imaging interpretation, personalized management, and treatment approaches as a medical educational tool. Surgeon ratings indicate overall fair agreement in GPT-4V reasoning behind decision-making. GPT-4V performed less favorably in imaging interpretation compared with its management and treatment approach performance. The performance of GPT-4V falls below our fellowship-trained orthopaedic trauma surgeon's standards as a standalone tool for medical education.
Collapse
Affiliation(s)
- Arthur Drouaud
- George Washington University School of Medicine, Washington, District of Columbia
| | - Carolina Stocchi
- Department of Orthopaedic Surgery, Mount Sinai, New York, New York
| | - Justin Tang
- Department of Orthopaedic Surgery, Mount Sinai, New York, New York
| | - Grant Gonsalves
- Department of Orthopaedic Surgery, Mount Sinai, New York, New York
| | - Zoe Cheung
- Department of Orthopaedic Surgery, Staten Island University Hospital, Staten Island, New York
| | - Jan Szatkowski
- Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, Indiana
| | - David Forsh
- Department of Orthopaedic Surgery, Mount Sinai, New York, New York
| |
Collapse
|
10
|
Chen CJ, Sobol K, Hickey C, Raphael J. The Comparative Performance of Large Language Models on the Hand Surgery Self-Assessment Examination. Hand (N Y) 2024:15589447241279460. [PMID: 39324769 PMCID: PMC11559719 DOI: 10.1177/15589447241279460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 09/27/2024]
Abstract
BACKGROUND Generative artificial intelligence (AI) models have emerged as capable of producing human-like responses and have showcased their potential in general medical specialties. This study explores the performance of AI systems on the American Society for Surgery of the Hand (ASSH) Self-Assessment Exams (SAE). METHODS ChatGPT 4.0 and Bing AI were evaluated on a set of multiple-choice questions drawn from the ASSH SAE online question bank spanning 5 years (2019-2023). Each system was evaluated with 999 questions. Images and video links were inserted into question prompts to allow for complete AI interpretation. The performance of both systems was standardized using the May 2023 version of ChatGPT 4.0 and Microsoft Bing AI, both of which had web browsing and image capabilities. RESULTS ChatGPT 4.0 scored an average of 66.5% on the ASSH questions. Bing AI scored higher, with an average of 75.3%. Bing AI outperformed ChatGPT 4.0 by an average of 8.8%. As a benchmark, a minimum passing score of 50% was required for continuing medical education credit. Both ChatGPT 4.0 and Bing AI had poorer performance on video-type and image-type questions on analysis of variance testing. Responses from both models contained elements from sources such as PubMed, Journal of Hand Surgery, and American Academy of Orthopedic Surgeons. CONCLUSIONS ChatGPT 4.0 with browsing and Bing AI can both be anticipated to achieve passing scores on the ASSH SAE. Generative AI, with its ability to provide logical responses and literature citations, presents a convincing argument for use as an interactive learning aid and educational tool.
Collapse
Affiliation(s)
- Clark J. Chen
- Albert Einstein Healthcare Network, Philadelphia, PA, USA
| | - Keenan Sobol
- Albert Einstein Healthcare Network, Philadelphia, PA, USA
| | - Connor Hickey
- Albert Einstein Healthcare Network, Philadelphia, PA, USA
| | - James Raphael
- Albert Einstein Healthcare Network, Philadelphia, PA, USA
| |
Collapse
|
11
|
Isch EL, Sarikonda A, Sambangi A, Carreras A, Sircar A, Self DM, Habarth-Morales TE, Caterson EJ, Aycart M. Evaluating the Efficacy of Large Language Models in CPT Coding for Craniofacial Surgery: A Comparative Analysis. J Craniofac Surg 2024:00001665-990000000-01868. [PMID: 39221924 DOI: 10.1097/scs.0000000000010575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Accepted: 07/23/2024] [Indexed: 09/04/2024] Open
Abstract
BACKGROUND The advent of Large Language Models (LLMs) like ChatGPT has introduced significant advancements in various surgical disciplines. These developments have led to an increased interest in the utilization of LLMs for Current Procedural Terminology (CPT) coding in surgery. With CPT coding being a complex and time-consuming process, often exacerbated by the scarcity of professional coders, there is a pressing need for innovative solutions to enhance coding efficiency and accuracy. METHODS This observational study evaluated the effectiveness of 5 publicly available large language models-Perplexity.AI, Bard, BingAI, ChatGPT 3.5, and ChatGPT 4.0-in accurately identifying CPT codes for craniofacial procedures. A consistent query format was employed to test each model, ensuring the inclusion of detailed procedure components where necessary. The responses were classified as correct, partially correct, or incorrect based on their alignment with established CPT coding for the specified procedures. RESULTS The results indicate that while there is no overall significant association between the type of AI model and the correctness of CPT code identification, there are notable differences in performance for simple and complex CPT codes among the models. Specifically, ChatGPT 4.0 showed higher accuracy for complex codes, whereas Perplexity.AI and Bard were more consistent with simple codes. DISCUSSION The use of AI chatbots for CPT coding in craniofacial surgery presents a promising avenue for reducing the administrative burden and associated costs of manual coding. Despite the lower accuracy rates compared with specialized, trained algorithms, the accessibility and minimal training requirements of the AI chatbots make them attractive alternatives. The study also suggests that priming AI models with operative notes may enhance their accuracy, offering a resource-efficient strategy for improving CPT coding in clinical practice. CONCLUSIONS This study highlights the feasibility and potential benefits of integrating LLMs into the CPT coding process for craniofacial surgery. The findings advocate for further refinement and training of AI models to improve their accuracy and practicality, suggesting a future where AI-assisted coding could become a standard component of surgical workflows, aligning with the ongoing digital transformation in health care.
Collapse
Affiliation(s)
- Emily L Isch
- Department of General Surgery, Thomas Jefferson University
| | | | | | | | - Adrija Sircar
- Sidney Kimmel Medical College at Thomas Jefferson University
| | - D Mitchell Self
- Department of Neurosurgery, Thomas Jefferson University and Jefferson Hospital for Neuroscience, Philadelphia, PA
| | | | - E J Caterson
- Department of Surgery, Division of Plastic Surgery, Nemours Children's Hospital Wilmington, DE
| | - Mario Aycart
- Department of Surgery, Division of Plastic Surgery, Nemours Children's Hospital Wilmington, DE
| |
Collapse
|
12
|
Warrier A, Singh R, Haleem A, Zaki H, Eloy JA. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology. Laryngoscope 2024; 134:3997-4002. [PMID: 38563415 DOI: 10.1002/lary.31434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/05/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
OBJECTIVES Evaluate and compare the ability of large language models (LLMs) to diagnose various ailments in otolaryngology. METHODS We collected all 100 clinical vignettes from the second edition of Otolaryngology Cases-The University of Cincinnati Clinical Portfolio by Pensak et al. With the addition of the prompt "Provide a diagnosis given the following history," we prompted ChatGPT-3.5, Google Bard, and Bing-GPT4 to provide a diagnosis for each vignette. These diagnoses were compared to the portfolio for accuracy and recorded. All queries were run in June 2023. RESULTS ChatGPT-3.5 was the most accurate model (89% success rate), followed by Google Bard (82%) and Bing GPT (74%). A chi-squared test revealed a significant difference between the three LLMs in providing correct diagnoses (p = 0.023). Of the 100 vignettes, seven require additional testing results (i.e., biopsy, non-contrast CT) for accurate clinical diagnosis. When omitting these vignettes, the revised success rates were 95.7% for ChatGPT-3.5, 88.17% for Google Bard, and 78.72% for Bing-GPT4 (p = 0.002). CONCLUSIONS ChatGPT-3.5 offers the most accurate diagnoses when given established clinical vignettes as compared to Google Bard and Bing-GPT4. LLMs may accurately offer assessments for common otolaryngology conditions but currently require detailed prompt information and critical supervision from clinicians. There is vast potential in the clinical applicability of LLMs; however, practitioners should be wary of possible "hallucinations" and misinformation in responses. LEVEL OF EVIDENCE 3 Laryngoscope, 134:3997-4002, 2024.
Collapse
Affiliation(s)
- Akshay Warrier
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Rohan Singh
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Afash Haleem
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Haider Zaki
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| | - Jean Anderson Eloy
- Department of Otolaryngology-Head and Neck Surgery, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
- Center for Skull Base and Pituitary Surgery, Neurological Institute of New Jersey, Rutgers New Jersey Medical School, Newark, New Jersey, U.S.A
| |
Collapse
|
13
|
Su Z, Tang G, Huang R, Qiao Y, Zhang Z, Dai X. Based on Medicine, The Now and Future of Large Language Models. Cell Mol Bioeng 2024; 17:263-277. [PMID: 39372551 PMCID: PMC11450117 DOI: 10.1007/s12195-024-00820-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Accepted: 09/08/2024] [Indexed: 10/08/2024] Open
Abstract
OBJECTIVES This review explores the potential applications of large language models (LLMs) such as ChatGPT, GPT-3.5, and GPT-4 in the medical field, aiming to encourage their prudent use, provide professional support, and develop accessible medical AI tools that adhere to healthcare standards. METHODS This paper examines the impact of technologies such as OpenAI's Generative Pre-trained Transformers (GPT) series, including GPT-3.5 and GPT-4, and other large language models (LLMs) in medical education, scientific research, clinical practice, and nursing. Specifically, it includes supporting curriculum design, acting as personalized learning assistants, creating standardized simulated patient scenarios in education; assisting with writing papers, data analysis, and optimizing experimental designs in scientific research; aiding in medical imaging analysis, decision-making, patient education, and communication in clinical practice; and reducing repetitive tasks, promoting personalized care and self-care, providing psychological support, and enhancing management efficiency in nursing. RESULTS LLMs, including ChatGPT, have demonstrated significant potential and effectiveness in the aforementioned areas, yet their deployment in healthcare settings is fraught with ethical complexities, potential lack of empathy, and risks of biased responses. CONCLUSION Despite these challenges, significant medical advancements can be expected through the proper use of LLMs and appropriate policy guidance. Future research should focus on overcoming these barriers to ensure the effective and ethical application of LLMs in the medical field.
Collapse
Affiliation(s)
- Ziqing Su
- Department of Neurosurgery, The First Affiliated Hospital of Anhui Medical University, 218 Jixi Road, Hefei, 230022 P.R. China
- Department of Clinical Medicine, The First Clinical College of Anhui Medical University, Hefei, 230022 P.R. China
| | - Guozhang Tang
- Department of Neurosurgery, The First Affiliated Hospital of Anhui Medical University, 218 Jixi Road, Hefei, 230022 P.R. China
- Department of Clinical Medicine, The Second Clinical College of Anhui Medical University, Hefei, 230032 Anhui P.R. China
| | - Rui Huang
- Department of Neurosurgery, The First Affiliated Hospital of Anhui Medical University, 218 Jixi Road, Hefei, 230022 P.R. China
- Department of Clinical Medicine, The First Clinical College of Anhui Medical University, Hefei, 230022 P.R. China
| | - Yang Qiao
- Department of Neurosurgery, The First Affiliated Hospital of Anhui Medical University, 218 Jixi Road, Hefei, 230022 P.R. China
| | - Zheng Zhang
- Department of Neurosurgery, The First Affiliated Hospital of Anhui Medical University, 218 Jixi Road, Hefei, 230022 P.R. China
- Department of Clinical Medicine, The First Clinical College of Anhui Medical University, Hefei, 230022 P.R. China
| | - Xingliang Dai
- Department of Neurosurgery, The First Affiliated Hospital of Anhui Medical University, 218 Jixi Road, Hefei, 230022 P.R. China
- Department of Research & Development, East China Institute of Digital Medical Engineering, Shangrao, 334000 P.R. China
| |
Collapse
|
14
|
Telich-Tarriba JE. Navigating the Impact of AI in Research Manuscript Creation. Indian J Plast Surg 2024; 57:235-236. [PMID: 39139677 PMCID: PMC11319009 DOI: 10.1055/s-0044-1782522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/15/2024] Open
|
15
|
Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider CR, Forte AJ. Clinical and Surgical Applications of Large Language Models: A Systematic Review. J Clin Med 2024; 13:3041. [PMID: 38892752 PMCID: PMC11172607 DOI: 10.3390/jcm13113041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 05/15/2024] [Accepted: 05/19/2024] [Indexed: 06/21/2024] Open
Abstract
Background: Large language models (LLMs) represent a recent advancement in artificial intelligence with medical applications across various healthcare domains. The objective of this review is to highlight how LLMs can be utilized by clinicians and surgeons in their everyday practice. Methods: A systematic review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Six databases were searched to identify relevant articles. Eligibility criteria emphasized articles focused primarily on clinical and surgical applications of LLMs. Results: The literature search yielded 333 results, with 34 meeting eligibility criteria. All articles were from 2023. There were 14 original research articles, four letters, one interview, and 15 review articles. These articles covered a wide variety of medical specialties, including various surgical subspecialties. Conclusions: LLMs have the potential to enhance healthcare delivery. In clinical settings, LLMs can assist in diagnosis, treatment guidance, patient triage, physician knowledge augmentation, and administrative tasks. In surgical settings, LLMs can assist surgeons with documentation, surgical planning, and intraoperative guidance. However, addressing their limitations and concerns, particularly those related to accuracy and biases, is crucial. LLMs should be viewed as tools to complement, not replace, the expertise of healthcare professionals.
Collapse
Affiliation(s)
| | - Sahar Borna
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | - Syed Ali Haider
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Clifton R. Haider
- Department of Physiology and Biomedical Engineering, Mayo Clinic, Rochester, MN 55905, USA
| | - Antonio Jorge Forte
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
16
|
Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider C, Forte AJ. AI and Ethics: A Systematic Review of the Ethical Considerations of Large Language Model Use in Surgery Research. Healthcare (Basel) 2024; 12:825. [PMID: 38667587 PMCID: PMC11050155 DOI: 10.3390/healthcare12080825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 04/02/2024] [Accepted: 04/09/2024] [Indexed: 04/28/2024] Open
Abstract
INTRODUCTION As large language models receive greater attention in medical research, the investigation of ethical considerations is warranted. This review aims to explore surgery literature to identify ethical concerns surrounding these artificial intelligence models and evaluate how autonomy, beneficence, nonmaleficence, and justice are represented within these ethical discussions to provide insights in order to guide further research and practice. METHODS A systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Five electronic databases were searched in October 2023. Eligible studies included surgery-related articles that focused on large language models and contained adequate ethical discussion. Study details, including specialty and ethical concerns, were collected. RESULTS The literature search yielded 1179 articles, with 53 meeting the inclusion criteria. Plastic surgery, orthopedic surgery, and neurosurgery were the most represented surgical specialties. Autonomy was the most explicitly cited ethical principle. The most frequently discussed ethical concern was accuracy (n = 45, 84.9%), followed by bias, patient confidentiality, and responsibility. CONCLUSION The ethical implications of using large language models in surgery are complex and evolving. The integration of these models into surgery necessitates continuous ethical discourse to ensure responsible and ethical use, balancing technological advancement with human dignity and safety.
Collapse
Affiliation(s)
| | - Sahar Borna
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | - Syed A. Haider
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Clifton Haider
- Department of Physiology and Biomedical Engineering, Mayo Clinic, Rochester, MN 55905, USA
| | - Antonio J. Forte
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
17
|
Boyd CJ, Hemal K, Sorenson TJ, Patel PA, Bekisz JM, Choi M, Karp NS. Artificial Intelligence as a Triage Tool during the Perioperative Period: Pilot Study of Accuracy and Accessibility for Clinical Application. PLASTIC AND RECONSTRUCTIVE SURGERY-GLOBAL OPEN 2024; 12:e5580. [PMID: 38313585 PMCID: PMC10836902 DOI: 10.1097/gox.0000000000005580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 12/05/2023] [Indexed: 02/06/2024]
Abstract
Background Given the dialogistic properties of ChatGPT, we hypothesized that this artificial intelligence (AI) function can be used as a self-service tool where clinical questions can be directly answered by AI. Our objective was to assess the content, accuracy, and accessibility of AI-generated content regarding common perioperative questions for reduction mammaplasty. Methods ChatGPT (OpenAI, February Version, San Francisco, Calif.) was used to query 20 common patient concerns that arise in the perioperative period of a reduction mammaplasty. Searches were performed in duplicate for both a general term and a specific clinical question. Query outputs were analyzed both objectively and subjectively. Descriptive statistics, t tests, and chi-square tests were performed where appropriate with a predetermined level of significance of P less than 0.05. Results From a total of 40 AI-generated outputs, mean word length was 191.8 words. Readability was at the thirteenth grade level. Regarding content, of all query outputs, 97.5% were on the appropriate topic. Medical advice was deemed to be reasonable in 100% of cases. General queries more frequently reported overarching background information, whereas specific queries more frequently reported prescriptive information (P < 0.0001). AI outputs specifically recommended following surgeon provided postoperative instructions in 82.5% of instances. Conclusions Currently available AI tools, in their nascent form, can provide recommendations for common perioperative questions and concerns for reduction mammaplasty. With further calibration, AI interfaces may serve as a tool for fielding patient queries in the future; however, patients must always retain the ability to bypass technology and be able to contact their surgeon.
Collapse
Affiliation(s)
- Carter J Boyd
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| | - Kshipra Hemal
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| | - Thomas J Sorenson
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| | | | - Jonathan M Bekisz
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| | - Mihye Choi
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| | - Nolan S Karp
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| |
Collapse
|