Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Kirchner GJ, Kim RY, Weddle JB, Bible JE. Can Artificial Intelligence Improve the Readability of Patient Education Materials? Clin Orthop Relat Res 2023;481:2260-2267. [PMID: 37116006 PMCID: PMC10566892 DOI: 10.1097/corr.0000000000002668] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 03/22/2023] [Accepted: 03/29/2023] [Indexed: 04/29/2023]

For:	Kirchner GJ, Kim RY, Weddle JB, Bible JE. Can Artificial Intelligence Improve the Readability of Patient Education Materials? Clin Orthop Relat Res 2023;481:2260-2267. [PMID: 37116006 PMCID: PMC10566892 DOI: 10.1097/corr.0000000000002668] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 03/22/2023] [Accepted: 03/29/2023] [Indexed: 04/29/2023]

Number

Cited by Other Article(s)

Pohl NB, Derector E, Rivlin M, Bachoura A, Tosti R, Kachooei AR, Beredjiklian PK, Fletcher DJ. A quality and readability comparison of artificial intelligence and popular health website education materials for common hand surgery procedures. HAND SURGERY & REHABILITATION 2024;43:101723. [PMID: 38782361 DOI: 10.1016/j.hansur.2024.101723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 05/16/2024] [Accepted: 05/18/2024] [Indexed: 05/25/2024]

Abstract

INTRODUCTION

ChatGPT and its application in producing patient education materials for orthopedic hand disorders has not been extensively studied. This study evaluated the quality and readability of educational information pertaining to common hand surgeries from patient education websites and information produced by ChatGPT.

METHODS

Patient education information for four hand surgeries (carpal tunnel release, trigger finger release, Dupuytren's contracture, and ganglion cyst surgery) was extracted from ChatGPT (at a scientific and fourth-grade reading level), WebMD, and Mayo Clinic. In a blinded and randomized fashion, five fellowship-trained orthopaedic hand surgeons evaluated the quality of information using a modified DISCERN criteria. Readability and reading grade level were assessed using Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) equations.

RESULTS

The Mayo Clinic website scored higher in terms of quality for carpal tunnel release information (p = 0.004). WebMD scored higher for Dupuytren's contracture release (p < 0.001), ganglion cyst surgery (p = 0.003), and overall quality (p < 0.001). ChatGPT - 4th Grade Reading Level, ChatGPT - Scientific Reading Level, WebMD, and Mayo Clinic written materials on average exceeded recommended reading grade levels (4th-6th grade) by at least four grade levels (10th, 14th, 13th, and 11th grade, respectively).

CONCLUSIONS

ChatGPT provides inferior education materials compared to patient-friendly websites. When prompted to provide more easily read materials, ChatGPT generates less robust information compared to patient-friendly websites and does not adequately simplify the educational information. ChatGPT has potential to improve the quality and readability of patient education materials but currently, patient-friendly websites provide superior quality at similar reading comprehension levels.

Collapse

Morya VK, Lee HW, Shahid H, Magar AG, Lee JH, Kim JH, Jun L, Noh KC. Application of ChatGPT for Orthopedic Surgeries and Patient Care. Clin Orthop Surg 2024;16:347-356. [PMID: 38827766 PMCID: PMC11130626 DOI: 10.4055/cios23181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 11/15/2023] [Accepted: 12/12/2023] [Indexed: 06/05/2024] Open

Baldwin AJ. An artificial intelligence language model improves readability of burns first aid information. Burns 2024;50:1122-1127. [PMID: 38492982 DOI: 10.1016/j.burns.2024.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 01/29/2024] [Accepted: 03/05/2024] [Indexed: 03/18/2024]

Woo KMC, Simon GW, Akindutire O, Aphinyanaphongs Y, Austrian JS, Kim JG, Genes N, Goldenring JA, Major VJ, Pariente CS, Pineda EG, Kang SK. Evaluation of GPT-4 ability to identify and generate patient instructions for actionable incidental radiology findings. J Am Med Inform Assoc 2024:ocae117. [PMID: 38778578 DOI: 10.1093/jamia/ocae117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 03/30/2024] [Accepted: 05/06/2024] [Indexed: 05/25/2024] Open

Abstract

OBJECTIVES

To evaluate the proficiency of a HIPAA-compliant version of GPT-4 in identifying actionable, incidental findings from unstructured radiology reports of Emergency Department patients. To assess appropriateness of artificial intelligence (AI)-generated, patient-facing summaries of these findings.

MATERIALS AND METHODS

Radiology reports extracted from the electronic health record of a large academic medical center were manually reviewed to identify non-emergent, incidental findings with high likelihood of requiring follow-up, further sub-stratified as "definitely actionable" (DA) or "possibly actionable-clinical correlation" (PA-CC). Instruction prompts to GPT-4 were developed and iteratively optimized using a validation set of 50 reports. The optimized prompt was then applied to a test set of 430 unseen reports. GPT-4 performance was primarily graded on accuracy identifying either DA or PA-CC findings, then secondarily for DA findings alone. Outputs were reviewed for hallucinations. AI-generated patient-facing summaries were assessed for appropriateness via Likert scale.

RESULTS

For the primary outcome (DA or PA-CC), GPT-4 achieved 99.3% recall, 73.6% precision, and 84.5% F-1. For the secondary outcome (DA only), GPT-4 demonstrated 95.2% recall, 77.3% precision, and 85.3% F-1. No findings were "hallucinated" outright. However, 2.8% of cases included generated text about recommendations that were inferred without specific reference. The majority of True Positive AI-generated summaries required no or minor revision.

CONCLUSION

GPT-4 demonstrates proficiency in detecting actionable, incidental findings after refined instruction prompting. AI-generated patient instructions were most often appropriate, but rarely included inferred recommendations. While this technology shows promise to augment diagnostics, active clinician oversight via "human-in-the-loop" workflows remains critical for clinical implementation.

Collapse

Affiliation(s)

Kar-Mun C Woo Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
Gregory W Simon Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
Olumide Akindutire Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
Yindalon Aphinyanaphongs Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States
Jonathan S Austrian Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States Department of Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
Jung G Kim Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States Institute for Innovations in Medical Education, NYU Langone Health, New York, NY 10016, United States
Nicholas Genes Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States
Jacob A Goldenring Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
Vincent J Major Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States
Chloé S Pariente Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States
Edwin G Pineda MCIT Clinical Systems-ASAP application, NYU Langone Health, New York, NY 10016, United States
Stella K Kang Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States Department of Radiology, NYU Grossman School of Medicine, New York, NY 10016, United States

Collapse

Ghanem YK, Rouhi AD, Al-Houssan A, Saleh Z, Moccia MC, Joshi H, Dumon KR, Hong Y, Spitz F, Joshi AR, Kwiatt M. Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis. Surg Endosc 2024;38:2887-2893. [PMID: 38443499 PMCID: PMC11078845 DOI: 10.1007/s00464-024-10739-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Accepted: 01/28/2024] [Indexed: 03/07/2024]

Abstract

INTRODUCTION

Generative artificial intelligence (AI) chatbots have recently been posited as potential sources of online medical information for patients making medical decisions. Existing online patient-oriented medical information has repeatedly been shown to be of variable quality and difficult readability. Therefore, we sought to evaluate the content and quality of AI-generated medical information on acute appendicitis.

METHODS

A modified DISCERN assessment tool, comprising 16 distinct criteria each scored on a 5-point Likert scale (score range 16-80), was used to assess AI-generated content. Readability was determined using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. Four popular chatbots, ChatGPT-3.5 and ChatGPT-4, Bard, and Claude-2, were prompted to generate medical information about appendicitis. Three investigators independently scored the generated texts blinded to the identity of the AI platforms.

RESULTS

ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 had overall mean (SD) quality scores of 60.7 (1.2), 62.0 (1.0), 62.3 (1.2), and 51.3 (2.3), respectively, on a scale of 16-80. Inter-rater reliability was 0.81, 0.75, 0.81, and 0.72, respectively, indicating substantial agreement. Claude-2 demonstrated a significantly lower mean quality score compared to ChatGPT-4 (p = 0.001), ChatGPT-3.5 (p = 0.005), and Bard (p = 0.001). Bard was the only AI platform that listed verifiable sources, while Claude-2 provided fabricated sources. All chatbots except for Claude-2 advised readers to consult a physician if experiencing symptoms. Regarding readability, FKGL and FRE scores of ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 were 14.6 and 23.8, 11.9 and 33.9, 8.6 and 52.8, 11.0 and 36.6, respectively, indicating difficulty readability at a college reading skill level.

CONCLUSION

AI-generated medical information on appendicitis scored favorably upon quality assessment, but most either fabricated sources or did not provide any altogether. Additionally, overall readability far exceeded recommended levels for the public. Generative AI platforms demonstrate measured potential for patient education and engagement about appendicitis.

Collapse

Browne R, Gull K, Hurley CM, Sugrue RM, O’Sullivan JB. ChatGPT-4 Can Help Hand Surgeons Communicate Better With Patients. JOURNAL OF HAND SURGERY GLOBAL ONLINE 2024;6:436-438. [PMID: 38817773 PMCID: PMC11133925 DOI: 10.1016/j.jhsg.2024.03.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Accepted: 03/10/2024] [Indexed: 06/01/2024] Open

Yang J, Ardavanis KS, Slack KE, Fernando ND, Della Valle CJ, Hernandez NM. Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis. J Arthroplasty 2024;39:1184-1190. [PMID: 38237878 DOI: 10.1016/j.arth.2024.01.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 01/08/2024] [Accepted: 01/11/2024] [Indexed: 02/22/2024] Open

Abstract

BACKGROUND

Advancements in artificial intelligence (AI) have led to the creation of large language models (LLMs), such as Chat Generative Pretrained Transformer (ChatGPT) and Bard, that analyze online resources to synthesize responses to user queries. Despite their popularity, the accuracy of LLM responses to medical questions remains unknown. This study aimed to compare the responses of ChatGPT and Bard regarding treatments for hip and knee osteoarthritis with the American Academy of Orthopaedic Surgeons (AAOS) Evidence-Based Clinical Practice Guidelines (CPGs) recommendations.

METHODS

Both ChatGPT (Open AI) and Bard (Google) were queried regarding 20 treatments (10 for hip and 10 for knee osteoarthritis) from the AAOS CPGs. Responses were classified by 2 reviewers as being in "Concordance," "Discordance," or "No Concordance" with AAOS CPGs. A Cohen's Kappa coefficient was used to assess inter-rater reliability, and Chi-squared analyses were used to compare responses between LLMs.

RESULTS

Overall, ChatGPT and Bard provided responses that were concordant with the AAOS CPGs for 16 (80%) and 12 (60%) treatments, respectively. Notably, ChatGPT and Bard encouraged the use of non-recommended treatments in 30% and 60% of queries, respectively. There were no differences in performance when evaluating by joint or by recommended versus non-recommended treatments. Studies were referenced in 6 (30%) of the Bard responses and none (0%) of the ChatGPT responses. Of the 6 Bard responses, studies could only be identified for 1 (16.7%). Of the remaining, 2 (33.3%) responses cited studies in journals that did not exist, 2 (33.3%) cited studies that could not be found with the information given, and 1 (16.7%) provided links to unrelated studies.

CONCLUSIONS

Both ChatGPT and Bard do not consistently provide responses that align with the AAOS CPGs. Consequently, physicians and patients should temper expectations on the guidance AI platforms can currently provide.

Collapse

Sacca L, Lobaina D, Burgoa S, Lotharius K, Moothedan E, Gilmore N, Xie J, Mohler R, Scharf G, Knecht M, Kitsantas P. Promoting Artificial Intelligence for Global Breast Cancer Risk Prediction and Screening in Adult Women: A Scoping Review. J Clin Med 2024;13:2525. [PMID: 38731054 PMCID: PMC11084581 DOI: 10.3390/jcm13092525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 04/01/2024] [Accepted: 04/23/2024] [Indexed: 05/13/2024] Open

Abstract

Background: Artificial intelligence (AI) algorithms can be applied in breast cancer risk prediction and prevention by using patient history, scans, imaging information, and analysis of specific genes for cancer classification to reduce overdiagnosis and overtreatment. This scoping review aimed to identify the barriers encountered in applying innovative AI techniques and models in developing breast cancer risk prediction scores and promoting screening behaviors among adult females. Findings may inform and guide future global recommendations for AI application in breast cancer prevention and care for female populations. Methods: The PRISMA-SCR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) was used as a reference checklist throughout this study. The Arksey and O'Malley methodology was used as a framework to guide this review. The framework methodology consisted of five steps: (1) Identify research questions; (2) Search for relevant studies; (3) Selection of studies relevant to the research questions; (4) Chart the data; (5) Collate, summarize, and report the results. Results: In the field of breast cancer risk detection and prevention, the following AI techniques and models have been applied: Machine and Deep Learning Model (ML-DL model) (n = 1), Academic Algorithms (n = 2), Breast Cancer Surveillance Consortium (BCSC), Clinical 5-Year Risk Prediction Model (n = 2), deep-learning computer vision AI algorithms (n = 2), AI-based thermal imaging solution (Thermalytix) (n = 1), RealRisks (n = 2), Breast Cancer Risk NAVIgation (n = 1), MammoRisk (ML-Based Tool) (n = 1), Various MLModels (n = 1), and various machine/deep learning, decision aids, and commercial algorithms (n = 7). In the 11 included studies, a total of 39 barriers to AI applications in breast cancer risk prediction and screening efforts were identified. The most common barriers in the application of innovative AI tools for breast cancer prediction and improved screening rates included lack of external validity and limited generalizability (n = 6), as AI was used in studies with either a small sample size or datasets with missing data. Many studies (n = 5) also encountered selection bias due to exclusion of certain populations based on characteristics such as race/ethnicity, family history, or past medical history. Several recommendations for future research should be considered. AI models need to include a broader spectrum and more complete predictive variables for risk assessment. Investigating long-term outcomes with improved follow-up periods is critical to assess the impacts of AI on clinical decisions beyond just the immediate outcomes. Utilizing AI to improve communication strategies at both a local and organizational level can assist in informed decision-making and compliance, especially in populations with limited literacy levels. Conclusions: The use of AI in patient education and as an adjunctive tool for providers is still early in its incorporation, and future research should explore the implementation of AI-driven resources to enhance understanding and decision-making regarding breast cancer screening, especially in vulnerable populations with limited literacy.

Collapse

Fiedler B, Azua EN, Phillips T, Ahmed AS. ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam. J Shoulder Elbow Surg 2024:S1058-2746(24)00231-3. [PMID: 38580067 DOI: 10.1016/j.jse.2024.02.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 01/24/2024] [Accepted: 02/12/2024] [Indexed: 04/07/2024]

Abstract

BACKGROUND

While multiple studies have tested the ability of large language models (LLMs), such as ChatGPT, to pass standardized medical exams at different levels of training, LLMs have never been tested on surgical sub-specialty examinations, such as the American Shoulder and Elbow Surgeons (ASES) Maintenance of Certification (MOC). The purpose of this study was to compare results of ChatGPT 3.5, GPT-4, and fellowship-trained surgeons on the 2023 ASES MOC self-assessment exam.

METHODS

ChatGPT 3.5 and GPT-4 were subjected to the same set of text-only questions from the ASES MOC exam, and GPT-4 was additionally subjected to image-based MOC exam questions. Question responses from both models were compared against the correct answers. Performance of both models was compared to corresponding average human performance on the same question subsets. One sided proportional z-test were utilized to analyze data.

RESULTS

Humans performed significantly better than Chat GPT 3.5 on exclusively text-based questions (76.4% vs. 60.8%, P = .044). Humans also performed significantly better than GPT 4 on image-based questions (73.9% vs. 53.2%, P = .019). There was no significant difference between humans and GPT 4 in text-based questions (76.4% vs. 66.7%, P = .136). Accounting for all questions, humans significantly outperformed GPT-4 (75.3% vs. 60.2%, P = .012). GPT-4 did not perform statistically significantly betterer than ChatGPT 3.5 on text-only questions (66.7% vs. 60.8%, P = .268).

DISCUSSION

Although human performance was overall superior, ChatGPT demonstrated the capacity to analyze orthopedic information and answer specialty-specific questions on the ASES MOC exam for both text and image-based questions. With continued advancements in deep learning, LLMs may someday rival exam performance of fellowship-trained surgeons.

Collapse

Arif HA, LeBrun G, Moore ST, Friscia DA. Analysis of the Most Popular Online Ankle Fracture-Related Patient Education Materials. FOOT & ANKLE ORTHOPAEDICS 2024;9:24730114241241310. [PMID: 38577700 PMCID: PMC10989055 DOI: 10.1177/24730114241241310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/06/2024] Open

Lin MX, Li G, Cui D, Mathews PM, Akpek EK. Usability of Patient Education-Oriented Cataract Surgery Websites. Ophthalmology 2024;131:499-506. [PMID: 37852419 DOI: 10.1016/j.ophtha.2023.10.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Revised: 09/27/2023] [Accepted: 10/03/2023] [Indexed: 10/20/2023] Open

Abstract

PURPOSE

To assess the web accessibility and readability of patient-oriented educational websites for cataract surgery.

DESIGN

Cross-sectional electronic survey.

PARTICIPANTS

Websites with information dedicated to educating patients about cataract surgery.

METHODS

An incognito search for "cataract surgery" was performed using a popular search engine. The top 100 patient-oriented cataract surgery websites that came up were included and categorized as institutional, private practice, or medical organization according to authorship. Each site was assessed for readability using 4 standardized reading grade-level formulas. Accessibility was assessed through multilingual availability, accessibility menu availability, complementary educational video availability, and conformance and adherence to the Web Content Accessibility Guidelines (WCAG) 2.0. A standard t test and chi-square analysis were performed to assess the significance of differences with regard to readability and accessibility among the 3 authorship categories.

MAIN OUTCOME MEASURES

The main outcome measures were the website's average reading grade level, number of accessibility violations, multilingual availability, accessibility menu availability, complementary educational video availability, accessibility conformance level, and violation of the perceivable, operable, understandable, and robust (POUR) principles according to the WCAG 2.0.

RESULTS

A total of 32, 55, and 13 sites were affiliated with institutions, private practice, and other medical organizations, respectively. The overall mean reading grade was 11.8 ± 1.6, with higher reading levels observed in private practice websites compared with institutions and medical organizations combined (12.1 vs. 11.4; P = 0.03). Fewer private practice websites had multiple language options compared with institutional and medical organization websites combined (5.5% vs. 20.0%; P = 0.03). More private practice websites had accessibility menus than institutions and medical organizations combined (27.3% vs. 8.9%; P = 0.038). The overall mean number of WCAG 2.0 POUR principle violations was 17.1 ± 23.1 with no significant difference among groups. Eighty-five percent of websites violated the perceivable principle.

CONCLUSIONS

Available patient-oriented online information for cataract surgery may not be comprehensible to the general public. Readability and accessibility aspects should be considered when designing these resources.

FINANCIAL DISCLOSURE(S)

The author(s) have no proprietary or commercial interest in any materials discussed in this article.

Collapse

Weddell J, Jawad D, Buckley T, Redfern J, Mansur Z, Elliott N, Hanson CL, Gallagher R. Online information for spontaneous coronary artery dissection (SCAD) survivors and their families: A systematic appraisal of content and quality of websites. Int J Med Inform 2024;184:105372. [PMID: 38350180 DOI: 10.1016/j.ijmedinf.2024.105372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 01/11/2024] [Accepted: 02/04/2024] [Indexed: 02/15/2024]

Abstract

BACKGROUND

Spontaneous coronary artery dissection (SCAD) survivors often seek information online. However, the quality and content of websites for SCAD survivors is uncertain. This review aimed to systematically identify and appraise websites for SCAD survivors.

METHODS

A systematic review approach was adapted for websites. A comprehensive search of SCAD key-phrases was performed using an internet search engine during January 2023. Websites targeting SCAD survivors were included. Websites were appraised for quality using Quality Component Scoring System (QCSS) and Health Related Website Evaluation Form (HRWEF), suitability using the Suitability Assessment Method (SAM), readability using a readability generator, and interactivity. Content was appraised using a tool based on SCAD international consensus literature. Raw scores from tools were concerted to percentages, then classified variably as excellent through to poor.

RESULTS

A total of 50 websites were identified and included from 600 screened. Overall, content accuracy/scope (53.3 ± 23.3) and interactivity (67.1 ± 11.5) were poor, quality was fair (59.1 ± 22.3, QCSS) and average (83.1 ± 5.8, HRWEF) and suitability was adequate (54.9 ± 13.8, SAM). The mean readability grade was 11.6 (±2.3), far exceeding the recommendations of ≤ 8. By website type, survivor affiliated and medically peer-reviewed health information websites scored highest. Appraisal tools had limitations, such as overlapping appraisal of similar things and less relevant items due to internet modernity.

CONCLUSION

Many online websites are available for SCAD survivors, but often have limited and/or inaccurate content, poor quality, are not tailored to the demographic, and are difficult to read. Appraisal tools for health website require consolidation and further development.

Collapse

Arango SD, Flynn JC, Zeitlin J, Lorenzana DJ, Miller AJ, Wilson MS, Strohl AB, Weiss LE, Weir TB. The Performance of ChatGPT on the American Society for Surgery of the Hand Self-Assessment Examination. Cureus 2024;16:e58950. [PMID: 38800302 PMCID: PMC11126365 DOI: 10.7759/cureus.58950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/24/2024] [Indexed: 05/29/2024] Open

Parekh AS, McCahon JAS, Nghe A, Pedowitz DI, Daniel JN, Parekh SG. Foot and Ankle Patient Education Materials and Artificial Intelligence Chatbots: A Comparative Analysis. Foot Ankle Spec 2024:19386400241235834. [PMID: 38504411 DOI: 10.1177/19386400241235834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/21/2024]

Moons P, Van Bulck L. Using ChatGPT and Google Bard to improve the readability of written patient information: a proof of concept. Eur J Cardiovasc Nurs 2024;23:122-126. [PMID: 37603843 DOI: 10.1093/eurjcn/zvad087] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 08/16/2023] [Accepted: 08/17/2023] [Indexed: 08/23/2023]

Lum ZC, Collins DP, Dennison S, Guntupalli L, Choudhary S, Saiz AM, Randall RL. Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level. Cureus 2024;16:e56104. [PMID: 38618358 PMCID: PMC11014641 DOI: 10.7759/cureus.56104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 03/12/2024] [Indexed: 04/16/2024] Open

Abstract

Introduction Artificial intelligence (AI) models using large language models (LLMs) and non-specific domains have gained attention for their innovative information processing. As AI advances, it's essential to regularly evaluate these tools' competency to maintain high standards, prevent errors or biases, and avoid flawed reasoning or misinformation that could harm patients or spread inaccuracies. Our study aimed to determine the performance of Chat Generative Pre-trained Transformer (ChatGPT) by OpenAI and Google BARD (BARD) in orthopedic surgery, assess performance based on question types, contrast performance between different AIs and compare AI performance to orthopedic residents. Methods We administered ChatGPT and BARD 757 Orthopedic In-Training Examination (OITE) questions. After excluding image-related questions, the AIs answered 390 multiple choice questions, all categorized within 10 sub-specialties (basic science, trauma, sports medicine, spine, hip and knee, pediatrics, oncology, shoulder and elbow, hand, and food and ankle) and three taxonomy classes (recall, interpretation, and application of knowledge). Statistical analysis was performed to analyze the number of questions answered correctly by each AI model, the performance returned by each AI model within the categorized question sub-specialty designation, and the performance of each AI model in comparison to the results returned by orthopedic residents classified by their respective post-graduate year (PGY) level. Results BARD answered more overall questions correctly (58% vs 54%, p<0.001). ChatGPT performed better in sports medicine and basic science and worse in hand surgery, while BARD performed better in basic science (p<0.05). The AIs performed better in recall questions compared to the application of knowledge (p<0.05). Based on previous data, it ranked in the 42nd-96th percentile for post-graduate year ones (PGY1s), 27th-58th for PGY2s, 3rd-29th for PGY3s, 1st-21st for PGY4s, and 1st-17th for PGY5s. Discussion ChatGPT excelled in sports medicine but fell short in hand surgery, while both AIs performed well in the basic science sub-specialty but performed poorly in the application of knowledge-based taxonomy questions. BARD performed better than ChatGPT overall. Although the AI reached the second-year PGY orthopedic resident level, it fell short of passing the American Board of Orthopedic Surgery (ABOS). Its strengths in recall-based inquiries highlight its potential as an orthopedic learning and educational tool.

Collapse

Huffman N, Pasqualini I, Khan ST, Klika AK, Deren ME, Jin Y, Kunze KN, Piuzzi NS. Enabling Personalized Medicine in Orthopaedic Surgery Through Artificial Intelligence: A Critical Analysis Review. JBJS Rev 2024;12:01874474-202403000-00006. [PMID: 38466797 DOI: 10.2106/jbjs.rvw.23.00232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/13/2024]

Abstract

» The application of artificial intelligence (AI) in the field of orthopaedic surgery holds potential for revolutionizing health care delivery across 3 crucial domains: (I) personalized prediction of clinical outcomes and adverse events, which may optimize patient selection, surgical planning, and enhance patient safety and outcomes; (II) diagnostic automated and semiautomated imaging analyses, which may reduce time burden and facilitate precise and timely diagnoses; and (III) forecasting of resource utilization, which may reduce health care costs and increase value for patients and institutions.» Computer vision is one of the most highly studied areas of AI within orthopaedics, with applications pertaining to fracture classification, identification of the manufacturer and model of prosthetic implants, and surveillance of prosthesis loosening and failure.» Prognostic applications of AI within orthopaedics include identifying patients who will likely benefit from a specified treatment, predicting prosthetic implant size, postoperative length of stay, discharge disposition, and surgical complications. Not only may these applications be beneficial to patients but also to institutions and payors because they may inform potential cost expenditure, improve overall hospital efficiency, and help anticipate resource utilization.» AI infrastructure development requires institutional financial commitment and a team of clinicians and data scientists with expertise in AI that can complement skill sets and knowledge. Once a team is established and a goal is determined, teams (1) obtain, curate, and label data; (2) establish a reference standard; (3) develop an AI model; (4) evaluate the performance of the AI model; (5) externally validate the model, and (6) reinforce, improve, and evaluate the model's performance until clinical implementation is possible.» Understanding the implications of AI in orthopaedics may eventually lead to wide-ranging improvements in patient care. However, AI, while holding tremendous promise, is not without methodological and ethical limitations that are essential to address. First, it is important to ensure external validity of programs before their use in a clinical setting. Investigators should maintain high quality data records and registry surveillance, exercise caution when evaluating others' reported AI applications, and increase transparency of the methodological conduct of current models to improve external validity and avoid propagating bias. By addressing these challenges and responsibly embracing the potential of AI, the medical field may eventually be able to harness its power to improve patient care and outcomes.

Collapse

Rouhi AD, Ghanem YK, Yolchieva L, Saleh Z, Joshi H, Moccia MC, Suarez-Pierre A, Han JJ. Can Artificial Intelligence Improve the Readability of Patient Education Materials on Aortic Stenosis? A Pilot Study. Cardiol Ther 2024;13:137-147. [PMID: 38194058 PMCID: PMC10899139 DOI: 10.1007/s40119-023-00347-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 12/13/2023] [Indexed: 01/10/2024] Open

Abstract

INTRODUCTION

The advent of generative artificial intelligence (AI) dialogue platforms and large language models (LLMs) may help facilitate ongoing efforts to improve health literacy. Additionally, recent studies have highlighted inadequate health literacy among patients with cardiac disease. The aim of the present study was to ascertain whether two freely available generative AI dialogue platforms could rewrite online aortic stenosis (AS) patient education materials (PEMs) to meet recommended reading skill levels for the public.

METHODS

Online PEMs were gathered from a professional cardiothoracic surgical society and academic institutions in the USA. PEMs were then inputted into two AI-powered LLMs, ChatGPT-3.5 and Bard, with the prompt "translate to 5th-grade reading level". Readability of PEMs before and after AI conversion was measured using the validated Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook Index (SMOGI), and Gunning-Fog Index (GFI) scores.

RESULTS

Overall, 21 PEMs on AS were gathered. Original readability measures indicated difficult readability at the 10th-12th grade reading level. ChatGPT-3.5 successfully improved readability across all four measures (p < 0.001) to the approximately 6th-7th grade reading level. Bard successfully improved readability across all measures (p < 0.001) except for SMOGI (p = 0.729) to the approximately 8th-9th grade level. Neither platform generated PEMs written below the recommended 6th-grade reading level. ChatGPT-3.5 demonstrated significantly more favorable post-conversion readability scores, percentage change in readability scores, and conversion time compared to Bard (all p < 0.001).

CONCLUSION

AI dialogue platforms can enhance the readability of PEMs for patients with AS but may not fully meet recommended reading skill levels, highlighting potential tools to help strengthen cardiac health literacy in the future.

Collapse

Mayol J. Transforming Abdominal Wall Surgery With Generative Artificial Intelligence. JOURNAL OF ABDOMINAL WALL SURGERY : JAWS 2023;2:12419. [PMID: 38312403 PMCID: PMC10831645 DOI: 10.3389/jaws.2023.12419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 11/16/2023] [Indexed: 02/06/2024]

Crook BS, Park CN, Hurley ET, Richard MJ, Pidgeon TS. Evaluation of Online Artificial Intelligence-Generated Information on Common Hand Procedures. J Hand Surg Am 2023;48:1122-1127. [PMID: 37690015 DOI: 10.1016/j.jhsa.2023.08.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Revised: 07/25/2023] [Accepted: 08/02/2023] [Indexed: 09/11/2023]

Abstract

PURPOSE

The purpose of this study was to analyze the quality and readability of the information generated by an online artificial intelligence (AI) platform regarding 4 common hand surgeries and to compare AI-generated responses to those provided in the informational articles published by the American Society for Surgery of the Hand (ASSH) HandCare website.

METHODS

An open AI model (ChatGPT) was used to answer questions commonly asked by patients on 4 common hand surgeries (carpal tunnel release, cubital tunnel release, trigger finger release, and distal radius fracture fixation). These answers were evaluated for medical accuracy, quality and readability and compared to answers derived from the ASSH HandCare materials.

RESULTS

For the AI model, the Journal of the American Medical Association benchmark criteria score was 0/4, and the DISCERN score was 58 (considered good). The areas in which the AI model lost points were primarily related to the lack of attribution, reliability and currency of the source material. For AI responses, the mean Flesch Kinkaid Reading Ease score was 15, and the Flesch Kinkaid Grade Level was 34, which is considered to be college level. For comparison, ASSH HandCare materials scored 3/4 on the Journal of the American Medical Association Benchmark, 71 on DISCERN (excellent), 9 on Flesch Kinkaid Grade Level, and 60 on Flesch Kinkaid Reading Ease score (eighth/ninth grade level).

CONCLUSION

An AI language model (ChatGPT) provided generally high-quality answers to frequently asked questions relating to the common hand procedures queried, but it is unclear when or where these answers came from without citations to source material. Furthermore, a high reading level was required to comprehend the information presented. The AI software repeatedly referenced the need to discuss these questions with a surgeon, the importance of shared decision-making and individualized care, and compliance with surgeon treatment recommendations.

CLINICAL RELEVANCE

As novel AI applications become increasingly mainstream, hand surgeons must understand the limitations and ramifications these technologies have for patient care.

Collapse

Makiev KG, Asimakidou M, Vasios IS, Keskinis A, Petkidis G, Tilkeridis K, Ververidis A, Iliopoulos E. A Study on Distinguishing ChatGPT-Generated and Human-Written Orthopaedic Abstracts by Reviewers: Decoding the Discrepancies. Cureus 2023;15:e49166. [PMID: 38130535 PMCID: PMC10733892 DOI: 10.7759/cureus.49166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/21/2023] [Indexed: 12/23/2023] Open

Abstract

BACKGROUND

ChatGPT (OpenAI Incorporated, Mission District, San Francisco, United States) is an artificial intelligence (AI)-based language model that generates human-resembling texts. This AI-generated literary work is comprehensible and contextually relevant and it is really difficult to differentiate from human-written content. ChatGPT has risen in popularity lately and is widely utilized in scholarly manuscript drafting. The aim of this study is to identify if 1) human reviewers can differentiate between AI-generated and human-written abstracts and 2) AI detectors are currently reliable in detecting AI-generated abstracts.

METHODS

Seven blinded reviewers were asked to read 21 abstracts and differentiate which were AI-generated and which were human-written. The first group consisted of three orthopaedic residents with limited research experience (OR). The second group included three orthopaedic professors with extensive research experience (OP). The seventh reviewer was a non-orthopaedic doctor and acted as a control in terms of expertise. All abstracts were scanned by a plagiarism detector program. The performance of detecting AI-generated abstracts of two different AI detectors was also analyzed. A structured interview was conducted at the end of the survey in order to evaluate the decision-making process utilized by each reviewer.

RESULTS

The OR group managed to identify correctly 34.9% of the abstracts' authorship and the OP group 31.7%. The non-orthopaedic control identified correctly 76.2%. All AI-generated abstracts were 100% unique (0% plagiarism). The first AI detector managed to identify correctly only 9/21 (42.9%) of the abstracts' authors, whereas the second AI detector identified 14/21 (66.6%).

CONCLUSION

Inability to correctly identify AI-generated context poses a significant scientific risk as "false" abstracts can end up in scientific conferences or publications. Neither expertise nor research background was shown to have any meaningful impact on the predictive outcome. Focus on statistical data presentation may help the differentiation process. Further research is warranted in order to highlight which elements could help reveal an AI-generated abstract.

Collapse

Bernstein J. CORR Insights®: Can Artificial Intelligence Improve the Readability of Patient Education Materials? Clin Orthop Relat Res 2023;481:2268-2270. [PMID: 37192346 PMCID: PMC10566765 DOI: 10.1097/corr.0000000000002702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Accepted: 04/25/2023] [Indexed: 05/18/2023]

Alowais SA, Alghamdi SS, Alsuhebany N, Alqahtani T, Alshaya AI, Almohareb SN, Aldairem A, Alrashed M, Bin Saleh K, Badreldin HA, Al Yami MS, Al Harbi S, Albekairy AM. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC MEDICAL EDUCATION 2023;23:689. [PMID: 37740191 PMCID: PMC10517477 DOI: 10.1186/s12909-023-04698-z] [Citation(s) in RCA: 65] [Impact Index Per Article: 65.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 09/19/2023] [Indexed: 09/24/2023]

Abstract

INTRODUCTION

Healthcare systems are complex and challenging for all stakeholders, but artificial intelligence (AI) has transformed various fields, including healthcare, with the potential to improve patient care and quality of life. Rapid AI advancements can revolutionize healthcare by integrating it into clinical practice. Reporting AI's role in clinical practice is crucial for successful implementation by equipping healthcare providers with essential knowledge and tools.

RESEARCH SIGNIFICANCE

This review article provides a comprehensive and up-to-date overview of the current state of AI in clinical practice, including its potential applications in disease diagnosis, treatment recommendations, and patient engagement. It also discusses the associated challenges, covering ethical and legal considerations and the need for human expertise. By doing so, it enhances understanding of AI's significance in healthcare and supports healthcare organizations in effectively adopting AI technologies.

MATERIALS AND METHODS

The current investigation analyzed the use of AI in the healthcare system with a comprehensive review of relevant indexed literature, such as PubMed/Medline, Scopus, and EMBASE, with no time constraints but limited to articles published in English. The focused question explores the impact of applying AI in healthcare settings and the potential outcomes of this application.

RESULTS

Integrating AI into healthcare holds excellent potential for improving disease diagnosis, treatment selection, and clinical laboratory testing. AI tools can leverage large datasets and identify patterns to surpass human performance in several healthcare aspects. AI offers increased accuracy, reduced costs, and time savings while minimizing human errors. It can revolutionize personalized medicine, optimize medication dosages, enhance population health management, establish guidelines, provide virtual health assistants, support mental health care, improve patient education, and influence patient-physician trust.

CONCLUSION

AI can be used to diagnose diseases, develop personalized treatment plans, and assist clinicians with decision-making. Rather than simply automating tasks, AI is about developing technologies that can enhance patient care across healthcare settings. However, challenges related to data privacy, bias, and the need for human expertise must be addressed for the responsible and effective implementation of AI in healthcare.

Collapse

Affiliation(s)

Shuroug A Alowais Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia. King Abdullah International Medical Research Center, Riyadh, Saudi Arabia. Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia.
Sahar S Alghamdi King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia Department of Pharmaceutical Sciences, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
Nada Alsuhebany Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
Tariq Alqahtani King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia Department of Pharmaceutical Sciences, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
Abdulrahman I Alshaya Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
Sumaya N Almohareb Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
Atheer Aldairem Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
Mohammed Alrashed Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
Khalid Bin Saleh Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
Hisham A Badreldin Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
Majed S Al Yami Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
Shmeylan Al Harbi Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
Abdulkareem M Albekairy Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia King Abdullah International Medical Research Center, Riyadh, Saudi Arabia Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia

Collapse

Clement ND, Simpson AHRW. Artificial intelligence in orthopaedics. Bone Joint Res 2023;12:494-496. [PMID: 37553119 PMCID: PMC10409576 DOI: 10.1302/2046-3758.128.bjr-2023-0199] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 08/10/2023] Open

Lum ZC. Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT. Clin Orthop Relat Res 2023;481:1623-1630. [PMID: 37220190 PMCID: PMC10344569 DOI: 10.1097/corr.0000000000002704] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 04/28/2023] [Indexed: 05/25/2023]

Abstract

BACKGROUND

Advances in neural networks, deep learning, and artificial intelligence (AI) have progressed recently. Previous deep learning AI has been structured around domain-specific areas that are trained on dataset-specific areas of interest that yield high accuracy and precision. A new AI model using large language models (LLM) and nonspecific domain areas, ChatGPT (OpenAI), has gained attention. Although AI has demonstrated proficiency in managing vast amounts of data, implementation of that knowledge remains a challenge.

QUESTIONS/PURPOSES

(1) What percentage of Orthopaedic In-Training Examination questions can a generative, pretrained transformer chatbot (ChatGPT) answer correctly? (2) How does that percentage compare with results achieved by orthopaedic residents of different levels, and if scoring lower than the 10th percentile relative to 5th-year residents is likely to correspond to a failing American Board of Orthopaedic Surgery score, is this LLM likely to pass the orthopaedic surgery written boards? (3) Does increasing question taxonomy affect the LLM's ability to select the correct answer choices?

METHODS

This study randomly selected 400 of 3840 publicly available questions based on the Orthopaedic In-Training Examination and compared the mean score with that of residents who took the test over a 5-year period. Questions with figures, diagrams, or charts were excluded, including five questions the LLM could not provide an answer for, resulting in 207 questions administered with raw score recorded. The LLM's answer results were compared with the Orthopaedic In-Training Examination ranking of orthopaedic surgery residents. Based on the findings of an earlier study, a pass-fail cutoff was set at the 10th percentile. Questions answered were then categorized based on the Buckwalter taxonomy of recall, which deals with increasingly complex levels of interpretation and application of knowledge; comparison was made of the LLM's performance across taxonomic levels and was analyzed using a chi-square test.

RESULTS

ChatGPT selected the correct answer 47% (97 of 207) of the time, and 53% (110 of 207) of the time it answered incorrectly. Based on prior Orthopaedic In-Training Examination testing, the LLM scored in the 40th percentile for postgraduate year (PGY) 1s, the eighth percentile for PGY2s, and the first percentile for PGY3s, PGY4s, and PGY5s; based on the latter finding (and using a predefined cutoff of the 10th percentile of PGY5s as the threshold for a passing score), it seems unlikely that the LLM would pass the written board examination. The LLM's performance decreased as question taxonomy level increased (it answered 54% [54 of 101] of Tax 1 questions correctly, 51% [18 of 35] of Tax 2 questions correctly, and 34% [24 of 71] of Tax 3 questions correctly; p = 0.034).

CONCLUSION

Although this general-domain LLM has a low likelihood of passing the orthopaedic surgery board examination, testing performance and knowledge are comparable to that of a first-year orthopaedic surgery resident. The LLM's ability to provide accurate answers declines with increasing question taxonomy and complexity, indicating a deficiency in implementing knowledge.

CLINICAL RELEVANCE

Current AI appears to perform better at knowledge and interpretation-based inquires, and based on this study and other areas of opportunity, it may become an additional tool for orthopaedic learning and education.

Collapse