1
|
Pohl NB, Derector E, Rivlin M, Bachoura A, Tosti R, Kachooei AR, Beredjiklian PK, Fletcher DJ. A quality and readability comparison of artificial intelligence and popular health website education materials for common hand surgery procedures. HAND SURGERY & REHABILITATION 2024; 43:101723. [PMID: 38782361 DOI: 10.1016/j.hansur.2024.101723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 05/16/2024] [Accepted: 05/18/2024] [Indexed: 05/25/2024]
Abstract
INTRODUCTION ChatGPT and its application in producing patient education materials for orthopedic hand disorders has not been extensively studied. This study evaluated the quality and readability of educational information pertaining to common hand surgeries from patient education websites and information produced by ChatGPT. METHODS Patient education information for four hand surgeries (carpal tunnel release, trigger finger release, Dupuytren's contracture, and ganglion cyst surgery) was extracted from ChatGPT (at a scientific and fourth-grade reading level), WebMD, and Mayo Clinic. In a blinded and randomized fashion, five fellowship-trained orthopaedic hand surgeons evaluated the quality of information using a modified DISCERN criteria. Readability and reading grade level were assessed using Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) equations. RESULTS The Mayo Clinic website scored higher in terms of quality for carpal tunnel release information (p = 0.004). WebMD scored higher for Dupuytren's contracture release (p < 0.001), ganglion cyst surgery (p = 0.003), and overall quality (p < 0.001). ChatGPT - 4th Grade Reading Level, ChatGPT - Scientific Reading Level, WebMD, and Mayo Clinic written materials on average exceeded recommended reading grade levels (4th-6th grade) by at least four grade levels (10th, 14th, 13th, and 11th grade, respectively). CONCLUSIONS ChatGPT provides inferior education materials compared to patient-friendly websites. When prompted to provide more easily read materials, ChatGPT generates less robust information compared to patient-friendly websites and does not adequately simplify the educational information. ChatGPT has potential to improve the quality and readability of patient education materials but currently, patient-friendly websites provide superior quality at similar reading comprehension levels.
Collapse
Affiliation(s)
- Nicholas B Pohl
- Department of Orthopaedic Surgery, Rothman Orthopaedic Institute, Philadelphia, PA, USA.
| | - Evan Derector
- Department of Orthopaedic Surgery, Rothman Orthopaedic Institute, Philadelphia, PA, USA
| | - Michael Rivlin
- Department of Orthopaedic Surgery, Rothman Orthopaedic Institute, Philadelphia, PA, USA
| | - Abdo Bachoura
- Department of Orthopaedic Surgery, Rothman Orthopaedics Florida, Orlando, FL, USA
| | - Rick Tosti
- Department of Orthopaedic Surgery, Rothman Orthopaedic Institute, Philadelphia, PA, USA
| | - Amir R Kachooei
- Department of Orthopaedic Surgery, Rothman Orthopaedics Florida, Orlando, FL, USA
| | - Pedro K Beredjiklian
- Department of Orthopaedic Surgery, Rothman Orthopaedic Institute, Philadelphia, PA, USA
| | - Daniel J Fletcher
- Department of Orthopaedic Surgery, Rothman Orthopaedic Institute, Philadelphia, PA, USA
| |
Collapse
|
2
|
Morya VK, Lee HW, Shahid H, Magar AG, Lee JH, Kim JH, Jun L, Noh KC. Application of ChatGPT for Orthopedic Surgeries and Patient Care. Clin Orthop Surg 2024; 16:347-356. [PMID: 38827766 PMCID: PMC11130626 DOI: 10.4055/cios23181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 11/15/2023] [Accepted: 12/12/2023] [Indexed: 06/05/2024] Open
Abstract
Artificial intelligence (AI) has rapidly transformed various aspects of life, and the launch of the chatbot "ChatGPT" by OpenAI in November 2022 has garnered significant attention and user appreciation. ChatGPT utilizes natural language processing based on a "generative pre-trained transfer" (GPT) model, specifically the transformer architecture, to generate human-like responses to a wide range of questions and topics. Equipped with approximately 57 billion words and 175 billion parameters from online data, ChatGPT has potential applications in medicine and orthopedics. One of its key strengths is its personalized, easy-to-understand, and adaptive response, which allows it to learn continuously through user interaction. This article discusses how AI, especially ChatGPT, presents numerous opportunities in orthopedics, ranging from preoperative planning and surgical techniques to patient education and medical support. Although ChatGPT's user-friendly responses and adaptive capabilities are laudable, its limitations, including biased responses and ethical concerns, necessitate its cautious and responsible use. Surgeons and healthcare providers should leverage the strengths of the ChatGPT while recognizing its current limitations and verifying critical information through independent research and expert opinions. As AI technology continues to evolve, ChatGPT may become a valuable tool in orthopedic education and patient care, leading to improved outcomes and efficiency in healthcare delivery. The integration of AI into orthopedics offers substantial benefits but requires careful consideration and continuous improvement.
Collapse
Affiliation(s)
- Vivek Kumar Morya
- Department of Orthopedic Surgery, Hallym University Kangnam Sacred Heart Hospital, Seoul, Korea
| | - Ho-Won Lee
- Department of Orthopedic Surgery, Hallym University Kangnam Sacred Heart Hospital, Seoul, Korea
| | - Hamzah Shahid
- Department of Orthopedic Surgery, Hallym University Kangnam Sacred Heart Hospital, Seoul, Korea
| | - Anuja Gajanan Magar
- Department of Orthopedic Surgery, Hallym University Kangnam Sacred Heart Hospital, Seoul, Korea
| | - Ju-Hyung Lee
- Department of Orthopedic Surgery, Hallym University Kangnam Sacred Heart Hospital, Seoul, Korea
| | - Jae-Hyung Kim
- Department of Orthopedic Surgery, Hallym University Kangnam Sacred Heart Hospital, Seoul, Korea
| | - Lang Jun
- Department of Orthopedic Surgery, Hallym University Kangnam Sacred Heart Hospital, Seoul, Korea
| | - Kyu-Cheol Noh
- Department of Orthopedic Surgery, Hallym University Kangnam Sacred Heart Hospital, Seoul, Korea
| |
Collapse
|
3
|
Baldwin AJ. An artificial intelligence language model improves readability of burns first aid information. Burns 2024; 50:1122-1127. [PMID: 38492982 DOI: 10.1016/j.burns.2024.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 01/29/2024] [Accepted: 03/05/2024] [Indexed: 03/18/2024]
Abstract
AIMS This study aimed to assess the potential of using an artificial intelligence (AI) large language model to improve the readability of burns first aid information. METHODS An AI language model (ChatGPT-3) was used to rewrite content from the top 50 English-language webpages containing burns first aid information to be understandable by an individual with the literacy level of an 11-year-old, as recommended by the American Medical Association and Health Education England. The assessment of readability was conducted using five validated tools. RESULTS In their original form, only 4% of the patient education materials (PEMs) met the target readability level across all tools. The median grade was 6.9 (SD=1.1). One-sample one-tailed t-test revealed that this was not significantly below the target (p = .31). After AI-modification, 18% of PEMs reached the target level using all tools, with a median grade of 6 (SD=0.9), which was significantly below the target level (p < .001). Once rewritten using AI, paired t-test demonstrated that all readability scores improved significantly (p < .001). CONCLUSION Utilising an AI language model proved an effective and viable method for enhancing readability of burns first aid information.
Collapse
Affiliation(s)
- Alexander J Baldwin
- Department of Burns and Plastic Surgery, Buckinghamshire Healthcare NHS Trust, Buckinghamshire, UK.
| |
Collapse
|
4
|
Woo KMC, Simon GW, Akindutire O, Aphinyanaphongs Y, Austrian JS, Kim JG, Genes N, Goldenring JA, Major VJ, Pariente CS, Pineda EG, Kang SK. Evaluation of GPT-4 ability to identify and generate patient instructions for actionable incidental radiology findings. J Am Med Inform Assoc 2024:ocae117. [PMID: 38778578 DOI: 10.1093/jamia/ocae117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 03/30/2024] [Accepted: 05/06/2024] [Indexed: 05/25/2024] Open
Abstract
OBJECTIVES To evaluate the proficiency of a HIPAA-compliant version of GPT-4 in identifying actionable, incidental findings from unstructured radiology reports of Emergency Department patients. To assess appropriateness of artificial intelligence (AI)-generated, patient-facing summaries of these findings. MATERIALS AND METHODS Radiology reports extracted from the electronic health record of a large academic medical center were manually reviewed to identify non-emergent, incidental findings with high likelihood of requiring follow-up, further sub-stratified as "definitely actionable" (DA) or "possibly actionable-clinical correlation" (PA-CC). Instruction prompts to GPT-4 were developed and iteratively optimized using a validation set of 50 reports. The optimized prompt was then applied to a test set of 430 unseen reports. GPT-4 performance was primarily graded on accuracy identifying either DA or PA-CC findings, then secondarily for DA findings alone. Outputs were reviewed for hallucinations. AI-generated patient-facing summaries were assessed for appropriateness via Likert scale. RESULTS For the primary outcome (DA or PA-CC), GPT-4 achieved 99.3% recall, 73.6% precision, and 84.5% F-1. For the secondary outcome (DA only), GPT-4 demonstrated 95.2% recall, 77.3% precision, and 85.3% F-1. No findings were "hallucinated" outright. However, 2.8% of cases included generated text about recommendations that were inferred without specific reference. The majority of True Positive AI-generated summaries required no or minor revision. CONCLUSION GPT-4 demonstrates proficiency in detecting actionable, incidental findings after refined instruction prompting. AI-generated patient instructions were most often appropriate, but rarely included inferred recommendations. While this technology shows promise to augment diagnostics, active clinician oversight via "human-in-the-loop" workflows remains critical for clinical implementation.
Collapse
Affiliation(s)
- Kar-Mun C Woo
- Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Gregory W Simon
- Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Olumide Akindutire
- Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Yindalon Aphinyanaphongs
- Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
- Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States
| | - Jonathan S Austrian
- Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States
- Department of Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Jung G Kim
- Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
- Institute for Innovations in Medical Education, NYU Langone Health, New York, NY 10016, United States
| | - Nicholas Genes
- Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
- Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States
| | - Jacob A Goldenring
- Ronald O. Perelman Department of Emergency Medicine, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Vincent J Major
- Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
- Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States
| | - Chloé S Pariente
- Department of Health Informatics, Medical Center IT, NYU Langone Health, New York, NY 10016, United States
| | - Edwin G Pineda
- MCIT Clinical Systems-ASAP application, NYU Langone Health, New York, NY 10016, United States
| | - Stella K Kang
- Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
- Department of Radiology, NYU Grossman School of Medicine, New York, NY 10016, United States
| |
Collapse
|
5
|
Ghanem YK, Rouhi AD, Al-Houssan A, Saleh Z, Moccia MC, Joshi H, Dumon KR, Hong Y, Spitz F, Joshi AR, Kwiatt M. Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis. Surg Endosc 2024; 38:2887-2893. [PMID: 38443499 PMCID: PMC11078845 DOI: 10.1007/s00464-024-10739-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Accepted: 01/28/2024] [Indexed: 03/07/2024]
Abstract
INTRODUCTION Generative artificial intelligence (AI) chatbots have recently been posited as potential sources of online medical information for patients making medical decisions. Existing online patient-oriented medical information has repeatedly been shown to be of variable quality and difficult readability. Therefore, we sought to evaluate the content and quality of AI-generated medical information on acute appendicitis. METHODS A modified DISCERN assessment tool, comprising 16 distinct criteria each scored on a 5-point Likert scale (score range 16-80), was used to assess AI-generated content. Readability was determined using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. Four popular chatbots, ChatGPT-3.5 and ChatGPT-4, Bard, and Claude-2, were prompted to generate medical information about appendicitis. Three investigators independently scored the generated texts blinded to the identity of the AI platforms. RESULTS ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 had overall mean (SD) quality scores of 60.7 (1.2), 62.0 (1.0), 62.3 (1.2), and 51.3 (2.3), respectively, on a scale of 16-80. Inter-rater reliability was 0.81, 0.75, 0.81, and 0.72, respectively, indicating substantial agreement. Claude-2 demonstrated a significantly lower mean quality score compared to ChatGPT-4 (p = 0.001), ChatGPT-3.5 (p = 0.005), and Bard (p = 0.001). Bard was the only AI platform that listed verifiable sources, while Claude-2 provided fabricated sources. All chatbots except for Claude-2 advised readers to consult a physician if experiencing symptoms. Regarding readability, FKGL and FRE scores of ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 were 14.6 and 23.8, 11.9 and 33.9, 8.6 and 52.8, 11.0 and 36.6, respectively, indicating difficulty readability at a college reading skill level. CONCLUSION AI-generated medical information on appendicitis scored favorably upon quality assessment, but most either fabricated sources or did not provide any altogether. Additionally, overall readability far exceeded recommended levels for the public. Generative AI platforms demonstrate measured potential for patient education and engagement about appendicitis.
Collapse
Affiliation(s)
- Yazid K Ghanem
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA.
- Cooper Medical School of Rowan University, Camden, NJ, USA.
| | - Armaun D Rouhi
- Department of Surgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Ammr Al-Houssan
- Department of Surgery, University of Connecticut, Hartford, CT, USA
| | - Zena Saleh
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
| | - Matthew C Moccia
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
| | - Hansa Joshi
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
| | - Kristoffel R Dumon
- Department of Surgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Young Hong
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
- Cooper Medical School of Rowan University, Camden, NJ, USA
| | - Francis Spitz
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
- Cooper Medical School of Rowan University, Camden, NJ, USA
| | - Amit R Joshi
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
- Cooper Medical School of Rowan University, Camden, NJ, USA
| | - Michael Kwiatt
- Department of Surgery, Cooper University Hospital, 3 Cooper Plaza, Suite 411, Camden, NJ, 08103, USA
- Cooper Medical School of Rowan University, Camden, NJ, USA
| |
Collapse
|
6
|
Browne R, Gull K, Hurley CM, Sugrue RM, O’Sullivan JB. ChatGPT-4 Can Help Hand Surgeons Communicate Better With Patients. JOURNAL OF HAND SURGERY GLOBAL ONLINE 2024; 6:436-438. [PMID: 38817773 PMCID: PMC11133925 DOI: 10.1016/j.jhsg.2024.03.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Accepted: 03/10/2024] [Indexed: 06/01/2024] Open
Abstract
The American Society for Surgery of the Hand and British Society for Surgery of the Hand produce patient-focused information above the sixth-grade readability recommended by the American Medical Association. To promote health equity, patient-focused content should be aimed at an appropriate level of health literacy. Artificial intelligence-driven large language models may be able to assist hand surgery societies in improving the readability of the information provided to patients. The readability was calculated for all the articles written in English on the American Society for Surgery of the Hand and British Society for Surgery of the Hand websites, in terms of seven of the commonest readability formulas. Chat Generative Pre-Trained Transformer version 4 (ChatGPT-4) was then asked to rewrite each article at a sixth-grade readability level. The readability for each response was calculated and compared with the unedited articles. Chat Generative Pre-Trained Transformer version 4 was able to improve the readability across all chosen readability formulas and was successful in achieving a mean sixth-grade readability level in terms of the Flesch Kincaid Grade Level and Simple Measure of Gobbledygook calculations. It increased the mean Flesch Reading Ease score, with higher scores representing more readable material. This study demonstrated that ChatGPT-4 can be used to improve the readability of patient-focused material in hand surgery. However, ChatGPT-4 is interested primarily in sounding natural, and not in seeking truth, and hence, each response must be evaluated by the surgeon to ensure that information accuracy is not being sacrificed for the sake of readability by this powerful tool.
Collapse
Affiliation(s)
- Robert Browne
- Royal College of Surgeons in Ireland, Dublin, Ireland
| | - Khadija Gull
- Department of Reconstructive and Plastic Surgery, Connolly Hospital Blanchardstown, Dublin, Ireland
| | | | | | - John Barry O’Sullivan
- Royal College of Surgeons in Ireland, Dublin, Ireland
- Department of Reconstructive and Plastic Surgery, Connolly Hospital Blanchardstown, Dublin, Ireland
| |
Collapse
|
7
|
Yang J, Ardavanis KS, Slack KE, Fernando ND, Della Valle CJ, Hernandez NM. Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis. J Arthroplasty 2024; 39:1184-1190. [PMID: 38237878 DOI: 10.1016/j.arth.2024.01.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 01/08/2024] [Accepted: 01/11/2024] [Indexed: 02/22/2024] Open
Abstract
BACKGROUND Advancements in artificial intelligence (AI) have led to the creation of large language models (LLMs), such as Chat Generative Pretrained Transformer (ChatGPT) and Bard, that analyze online resources to synthesize responses to user queries. Despite their popularity, the accuracy of LLM responses to medical questions remains unknown. This study aimed to compare the responses of ChatGPT and Bard regarding treatments for hip and knee osteoarthritis with the American Academy of Orthopaedic Surgeons (AAOS) Evidence-Based Clinical Practice Guidelines (CPGs) recommendations. METHODS Both ChatGPT (Open AI) and Bard (Google) were queried regarding 20 treatments (10 for hip and 10 for knee osteoarthritis) from the AAOS CPGs. Responses were classified by 2 reviewers as being in "Concordance," "Discordance," or "No Concordance" with AAOS CPGs. A Cohen's Kappa coefficient was used to assess inter-rater reliability, and Chi-squared analyses were used to compare responses between LLMs. RESULTS Overall, ChatGPT and Bard provided responses that were concordant with the AAOS CPGs for 16 (80%) and 12 (60%) treatments, respectively. Notably, ChatGPT and Bard encouraged the use of non-recommended treatments in 30% and 60% of queries, respectively. There were no differences in performance when evaluating by joint or by recommended versus non-recommended treatments. Studies were referenced in 6 (30%) of the Bard responses and none (0%) of the ChatGPT responses. Of the 6 Bard responses, studies could only be identified for 1 (16.7%). Of the remaining, 2 (33.3%) responses cited studies in journals that did not exist, 2 (33.3%) cited studies that could not be found with the information given, and 1 (16.7%) provided links to unrelated studies. CONCLUSIONS Both ChatGPT and Bard do not consistently provide responses that align with the AAOS CPGs. Consequently, physicians and patients should temper expectations on the guidance AI platforms can currently provide.
Collapse
Affiliation(s)
- JaeWon Yang
- Department of Orthopaedic Surgery, University of Washington, Seattle, Washington
| | - Kyle S Ardavanis
- Department of Orthopaedic Surgery, Madigan Medical Center, Tacoma, Washington
| | - Katherine E Slack
- Elson S. Floyd College of Medicine, Washington State University, Spokane, Washington
| | - Navin D Fernando
- Department of Orthopaedic Surgery, University of Washington, Seattle, Washington
| | - Craig J Della Valle
- Department of Orthopaedic Surgery, Rush University Medical Center, Chicago, Illinois
| | - Nicholas M Hernandez
- Department of Orthopaedic Surgery, University of Washington, Seattle, Washington
| |
Collapse
|
8
|
Sacca L, Lobaina D, Burgoa S, Lotharius K, Moothedan E, Gilmore N, Xie J, Mohler R, Scharf G, Knecht M, Kitsantas P. Promoting Artificial Intelligence for Global Breast Cancer Risk Prediction and Screening in Adult Women: A Scoping Review. J Clin Med 2024; 13:2525. [PMID: 38731054 PMCID: PMC11084581 DOI: 10.3390/jcm13092525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 04/01/2024] [Accepted: 04/23/2024] [Indexed: 05/13/2024] Open
Abstract
Background: Artificial intelligence (AI) algorithms can be applied in breast cancer risk prediction and prevention by using patient history, scans, imaging information, and analysis of specific genes for cancer classification to reduce overdiagnosis and overtreatment. This scoping review aimed to identify the barriers encountered in applying innovative AI techniques and models in developing breast cancer risk prediction scores and promoting screening behaviors among adult females. Findings may inform and guide future global recommendations for AI application in breast cancer prevention and care for female populations. Methods: The PRISMA-SCR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) was used as a reference checklist throughout this study. The Arksey and O'Malley methodology was used as a framework to guide this review. The framework methodology consisted of five steps: (1) Identify research questions; (2) Search for relevant studies; (3) Selection of studies relevant to the research questions; (4) Chart the data; (5) Collate, summarize, and report the results. Results: In the field of breast cancer risk detection and prevention, the following AI techniques and models have been applied: Machine and Deep Learning Model (ML-DL model) (n = 1), Academic Algorithms (n = 2), Breast Cancer Surveillance Consortium (BCSC), Clinical 5-Year Risk Prediction Model (n = 2), deep-learning computer vision AI algorithms (n = 2), AI-based thermal imaging solution (Thermalytix) (n = 1), RealRisks (n = 2), Breast Cancer Risk NAVIgation (n = 1), MammoRisk (ML-Based Tool) (n = 1), Various MLModels (n = 1), and various machine/deep learning, decision aids, and commercial algorithms (n = 7). In the 11 included studies, a total of 39 barriers to AI applications in breast cancer risk prediction and screening efforts were identified. The most common barriers in the application of innovative AI tools for breast cancer prediction and improved screening rates included lack of external validity and limited generalizability (n = 6), as AI was used in studies with either a small sample size or datasets with missing data. Many studies (n = 5) also encountered selection bias due to exclusion of certain populations based on characteristics such as race/ethnicity, family history, or past medical history. Several recommendations for future research should be considered. AI models need to include a broader spectrum and more complete predictive variables for risk assessment. Investigating long-term outcomes with improved follow-up periods is critical to assess the impacts of AI on clinical decisions beyond just the immediate outcomes. Utilizing AI to improve communication strategies at both a local and organizational level can assist in informed decision-making and compliance, especially in populations with limited literacy levels. Conclusions: The use of AI in patient education and as an adjunctive tool for providers is still early in its incorporation, and future research should explore the implementation of AI-driven resources to enhance understanding and decision-making regarding breast cancer screening, especially in vulnerable populations with limited literacy.
Collapse
Affiliation(s)
- Lea Sacca
- Charles E. Schmidt College of Medicine, Florida Atlantic University, Boca Raton, FL 33431, USA; (D.L.); (S.B.); (K.L.); (E.M.); (N.G.); (J.X.); (R.M.); (G.S.); (M.K.); (P.K.)
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Fiedler B, Azua EN, Phillips T, Ahmed AS. ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam. J Shoulder Elbow Surg 2024:S1058-2746(24)00231-3. [PMID: 38580067 DOI: 10.1016/j.jse.2024.02.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 01/24/2024] [Accepted: 02/12/2024] [Indexed: 04/07/2024]
Abstract
BACKGROUND While multiple studies have tested the ability of large language models (LLMs), such as ChatGPT, to pass standardized medical exams at different levels of training, LLMs have never been tested on surgical sub-specialty examinations, such as the American Shoulder and Elbow Surgeons (ASES) Maintenance of Certification (MOC). The purpose of this study was to compare results of ChatGPT 3.5, GPT-4, and fellowship-trained surgeons on the 2023 ASES MOC self-assessment exam. METHODS ChatGPT 3.5 and GPT-4 were subjected to the same set of text-only questions from the ASES MOC exam, and GPT-4 was additionally subjected to image-based MOC exam questions. Question responses from both models were compared against the correct answers. Performance of both models was compared to corresponding average human performance on the same question subsets. One sided proportional z-test were utilized to analyze data. RESULTS Humans performed significantly better than Chat GPT 3.5 on exclusively text-based questions (76.4% vs. 60.8%, P = .044). Humans also performed significantly better than GPT 4 on image-based questions (73.9% vs. 53.2%, P = .019). There was no significant difference between humans and GPT 4 in text-based questions (76.4% vs. 66.7%, P = .136). Accounting for all questions, humans significantly outperformed GPT-4 (75.3% vs. 60.2%, P = .012). GPT-4 did not perform statistically significantly betterer than ChatGPT 3.5 on text-only questions (66.7% vs. 60.8%, P = .268). DISCUSSION Although human performance was overall superior, ChatGPT demonstrated the capacity to analyze orthopedic information and answer specialty-specific questions on the ASES MOC exam for both text and image-based questions. With continued advancements in deep learning, LLMs may someday rival exam performance of fellowship-trained surgeons.
Collapse
Affiliation(s)
- Benjamin Fiedler
- Baylor College of Medicine, Joseph Barnhart Department of Orthopedic Surgery, Houston, TX, USA.
| | - Eric N Azua
- Baylor College of Medicine, Joseph Barnhart Department of Orthopedic Surgery, Houston, TX, USA
| | - Todd Phillips
- Baylor College of Medicine, Joseph Barnhart Department of Orthopedic Surgery, Houston, TX, USA
| | - Adil Shahzad Ahmed
- Baylor College of Medicine, Joseph Barnhart Department of Orthopedic Surgery, Houston, TX, USA
| |
Collapse
|
10
|
Arif HA, LeBrun G, Moore ST, Friscia DA. Analysis of the Most Popular Online Ankle Fracture-Related Patient Education Materials. FOOT & ANKLE ORTHOPAEDICS 2024; 9:24730114241241310. [PMID: 38577700 PMCID: PMC10989055 DOI: 10.1177/24730114241241310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/06/2024] Open
Abstract
Background Given the increasing accessibility of Internet access, it is critical to ensure that the informational material available online for patient education is both accurate and readable to promote a greater degree of health literacy. This study sought to investigate the quality and readability of the most popular online resources for ankle fractures. Methods After conducting a Google search using 6 terms related to ankle fractures, we collected the first 20 nonsponsored results for each term. Readability was evaluated using the Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), and Gunning Fog Index (GFI) instruments. Quality was evaluated using custom created Ankle Fracture Index (AFI). Results A total of 46 of 120 articles met the inclusion criteria. The mean FKGL, FRE, and GFI scores were 8.4 ± 0.5, 57.5 ± 3.2, and 10.5 ± 0.5, respectively. The average AFI score was 15.4 ± 1.4, corresponding to an "acceptable" quality rating. Almost 70% of articles (n = 32) were written at or below the recommended eighth-grade reading level. Most articles discussed the need for imaging in diagnosis and treatment planning while neglecting to discuss the risks of surgery or potential future operations. Conclusion We found that online patient-facing materials on ankle fractures demonstrated an eighth-grade average reading grade level and an acceptable quality on content analysis. Further work should surround increasing information regarding risk factors, complications for surgery, and long-term recovery while ensuring that readability levels remain below at least the eighth-grade level.
Collapse
Affiliation(s)
- Haad A. Arif
- School of Medicine, University of California Riverside, Riverside, CA, USA
| | | | - Simon T. Moore
- School of Medicine, University of California Riverside, Riverside, CA, USA
| | - David A. Friscia
- School of Medicine, University of California Riverside, Riverside, CA, USA
- Eisenhower Desert Orthopedic Center, Rancho Mirage, CA, USA
| |
Collapse
|
11
|
Lin MX, Li G, Cui D, Mathews PM, Akpek EK. Usability of Patient Education-Oriented Cataract Surgery Websites. Ophthalmology 2024; 131:499-506. [PMID: 37852419 DOI: 10.1016/j.ophtha.2023.10.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Revised: 09/27/2023] [Accepted: 10/03/2023] [Indexed: 10/20/2023] Open
Abstract
PURPOSE To assess the web accessibility and readability of patient-oriented educational websites for cataract surgery. DESIGN Cross-sectional electronic survey. PARTICIPANTS Websites with information dedicated to educating patients about cataract surgery. METHODS An incognito search for "cataract surgery" was performed using a popular search engine. The top 100 patient-oriented cataract surgery websites that came up were included and categorized as institutional, private practice, or medical organization according to authorship. Each site was assessed for readability using 4 standardized reading grade-level formulas. Accessibility was assessed through multilingual availability, accessibility menu availability, complementary educational video availability, and conformance and adherence to the Web Content Accessibility Guidelines (WCAG) 2.0. A standard t test and chi-square analysis were performed to assess the significance of differences with regard to readability and accessibility among the 3 authorship categories. MAIN OUTCOME MEASURES The main outcome measures were the website's average reading grade level, number of accessibility violations, multilingual availability, accessibility menu availability, complementary educational video availability, accessibility conformance level, and violation of the perceivable, operable, understandable, and robust (POUR) principles according to the WCAG 2.0. RESULTS A total of 32, 55, and 13 sites were affiliated with institutions, private practice, and other medical organizations, respectively. The overall mean reading grade was 11.8 ± 1.6, with higher reading levels observed in private practice websites compared with institutions and medical organizations combined (12.1 vs. 11.4; P = 0.03). Fewer private practice websites had multiple language options compared with institutional and medical organization websites combined (5.5% vs. 20.0%; P = 0.03). More private practice websites had accessibility menus than institutions and medical organizations combined (27.3% vs. 8.9%; P = 0.038). The overall mean number of WCAG 2.0 POUR principle violations was 17.1 ± 23.1 with no significant difference among groups. Eighty-five percent of websites violated the perceivable principle. CONCLUSIONS Available patient-oriented online information for cataract surgery may not be comprehensible to the general public. Readability and accessibility aspects should be considered when designing these resources. FINANCIAL DISCLOSURE(S) The author(s) have no proprietary or commercial interest in any materials discussed in this article.
Collapse
Affiliation(s)
- Michael X Lin
- The Ocular Surface Disease Clinic, The Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Gavin Li
- The Ocular Surface Disease Clinic, The Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, Maryland; Icahn School of Medicine at Mount Sinai, New York, New York
| | - David Cui
- The Ocular Surface Disease Clinic, The Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, Maryland; Krieger Eye Institute, Sinai Hospital of Baltimore, Baltimore, Maryland
| | - Priya M Mathews
- The Ocular Surface Disease Clinic, The Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, Maryland; Center for Sight, Sarasota, Florida
| | - Esen K Akpek
- The Ocular Surface Disease Clinic, The Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, Maryland.
| |
Collapse
|
12
|
Weddell J, Jawad D, Buckley T, Redfern J, Mansur Z, Elliott N, Hanson CL, Gallagher R. Online information for spontaneous coronary artery dissection (SCAD) survivors and their families: A systematic appraisal of content and quality of websites. Int J Med Inform 2024; 184:105372. [PMID: 38350180 DOI: 10.1016/j.ijmedinf.2024.105372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 01/11/2024] [Accepted: 02/04/2024] [Indexed: 02/15/2024]
Abstract
BACKGROUND Spontaneous coronary artery dissection (SCAD) survivors often seek information online. However, the quality and content of websites for SCAD survivors is uncertain. This review aimed to systematically identify and appraise websites for SCAD survivors. METHODS A systematic review approach was adapted for websites. A comprehensive search of SCAD key-phrases was performed using an internet search engine during January 2023. Websites targeting SCAD survivors were included. Websites were appraised for quality using Quality Component Scoring System (QCSS) and Health Related Website Evaluation Form (HRWEF), suitability using the Suitability Assessment Method (SAM), readability using a readability generator, and interactivity. Content was appraised using a tool based on SCAD international consensus literature. Raw scores from tools were concerted to percentages, then classified variably as excellent through to poor. RESULTS A total of 50 websites were identified and included from 600 screened. Overall, content accuracy/scope (53.3 ± 23.3) and interactivity (67.1 ± 11.5) were poor, quality was fair (59.1 ± 22.3, QCSS) and average (83.1 ± 5.8, HRWEF) and suitability was adequate (54.9 ± 13.8, SAM). The mean readability grade was 11.6 (±2.3), far exceeding the recommendations of ≤ 8. By website type, survivor affiliated and medically peer-reviewed health information websites scored highest. Appraisal tools had limitations, such as overlapping appraisal of similar things and less relevant items due to internet modernity. CONCLUSION Many online websites are available for SCAD survivors, but often have limited and/or inaccurate content, poor quality, are not tailored to the demographic, and are difficult to read. Appraisal tools for health website require consolidation and further development.
Collapse
Affiliation(s)
- Joseph Weddell
- Sydney Nursing School, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia; Charles Perkins Centre, The University of Sydney, Sydney, Australia.
| | - Danielle Jawad
- Sydney School of Public Health, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia; Health Promotion Unit, Population Health Research & Evaluation Hub, Sydney Local Health District, Sydney, Australia
| | - Thomas Buckley
- Sydney Nursing School, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia; Charles Perkins Centre, The University of Sydney, Sydney, Australia
| | - Julie Redfern
- Charles Perkins Centre, The University of Sydney, Sydney, Australia; Sydney School of Health Sciences, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
| | - Zarin Mansur
- Sydney Nursing School, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia; Charles Perkins Centre, The University of Sydney, Sydney, Australia
| | - Natalie Elliott
- School of Health and Social Care, Edinburgh Napier University, Edinburgh, UK
| | - Coral L Hanson
- School of Health and Social Care, Edinburgh Napier University, Edinburgh, UK
| | - Robyn Gallagher
- Sydney Nursing School, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia; Charles Perkins Centre, The University of Sydney, Sydney, Australia
| |
Collapse
|
13
|
Arango SD, Flynn JC, Zeitlin J, Lorenzana DJ, Miller AJ, Wilson MS, Strohl AB, Weiss LE, Weir TB. The Performance of ChatGPT on the American Society for Surgery of the Hand Self-Assessment Examination. Cureus 2024; 16:e58950. [PMID: 38800302 PMCID: PMC11126365 DOI: 10.7759/cureus.58950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/24/2024] [Indexed: 05/29/2024] Open
Abstract
BACKGROUND This study aims to compare the performance of ChatGPT-3.5 (GPT-3.5) and ChatGPT-4 (GPT-4) on the American Society for Surgery of the Hand (ASSH) Self-Assessment Examination (SAE) to determine their potential as educational tools. METHODS This study assessed the proportion of correct answers to text-based questions on the 2021 and 2022 ASSH SAE between untrained ChatGPT versions. Secondary analyses assessed the performance of ChatGPT based on question difficulty and question category. The outcomes of ChatGPT were compared with the performance of actual examinees on the ASSH SAE. RESULTS A total of 238 questions were included in the analysis. Compared with GPT-3.5, GPT-4 provided significantly more correct answers overall (58.0% versus 68.9%, respectively; P = 0.013), on the 2022 SAE (55.9% versus 72.9%; P = 0.007), and more difficult questions (48.8% versus 63.6%; P = 0.02). In a multivariable logistic regression analysis, correct answers were predicted by GPT-4 (odds ratio [OR], 1.66; P = 0.011), increased question difficulty (OR, 0.59; P = 0.009), Bone and Joint questions (OR, 0.18; P < 0.001), and Soft Tissue questions (OR, 0.30; P = 0.013). Actual examinees scored a mean of 21.6% above GPT-3.5 and 10.7% above GPT-4. The mean percentage of correct answers by actual examinees was significantly higher for correct (versus incorrect) ChatGPT answers. CONCLUSIONS GPT-4 demonstrated improved performance over GPT-3.5 on the ASSH SAE, especially on more difficult questions. Actual examinees scored higher than both versions of ChatGPT, but the margin was cut in half by GPT-4.
Collapse
Affiliation(s)
- Sebastian D Arango
- Department of Orthopaedic Surgery, Philadelphia Hand to Shoulder Center, Philadelphia, USA
| | - Jason C Flynn
- Department of Orthopaedic Surgery, Sidney Kimmel Medical College, Philadelphia, USA
| | - Jacob Zeitlin
- Department of Orthopaedic Surgery, Philadelphia Hand to Shoulder Center, Philadelphia, USA
| | - Daniel J Lorenzana
- Department of Orthopaedic Surgery, Philadelphia Hand to Shoulder Center, Philadelphia, USA
| | - Andrew J Miller
- Department of Orthopaedic Surgery, Philadelphia Hand to Shoulder Center, Philadelphia, USA
| | - Matthew S Wilson
- Department of Orthopaedic Surgery, Philadelphia Hand to Shoulder Center, Philadelphia, USA
| | - Adam B Strohl
- Department of Orthopaedic Surgery, Philadelphia Hand to Shoulder Center, Philadelphia, USA
| | - Lawrence E Weiss
- Division of Orthopaedic Hand Surgery, OAA Orthopaedic Specialists, Allentown, USA
| | - Tristan B Weir
- Department of Orthopaedic Surgery, Philadelphia Hand to Shoulder Center, Philadelphia, USA
| |
Collapse
|
14
|
Parekh AS, McCahon JAS, Nghe A, Pedowitz DI, Daniel JN, Parekh SG. Foot and Ankle Patient Education Materials and Artificial Intelligence Chatbots: A Comparative Analysis. Foot Ankle Spec 2024:19386400241235834. [PMID: 38504411 DOI: 10.1177/19386400241235834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/21/2024]
Abstract
BACKGROUND The purpose of this study was to perform a comparative analysis of foot and ankle patient education material generated by the AI chatbots, as they compare to the American Orthopaedic Foot and Ankle Society (AOFAS)-recommended patient education website, FootCareMD.org. METHODS ChatGPT, Google Bard, and Bing AI were used to generate patient educational materials on 10 of the most common foot and ankle conditions. The content from these AI language model platforms was analyzed and compared with that in FootCareMD.org for accuracy of included information. Accuracy was determined for each of the 10 conditions on a basis of included information regarding background, symptoms, causes, diagnosis, treatments, surgical options, recovery procedures, and risks or preventions. RESULTS When compared to the reference standard of the AOFAS website FootCareMD.org, the AI language model platforms consistently scored below 60% in accuracy rates in all categories of the articles analyzed. ChatGPT was found to contain an average of 46.2% of key content across all included conditions when compared to FootCareMD.org. Comparatively, Google Bard and Bing AI contained 36.5% and 28.0% of information included on FootCareMD.org, respectively (P < .005). CONCLUSION Patient education regarding common foot and ankle conditions generated by AI language models provides limited content accuracy across all 3 AI chatbot platforms. LEVEL OF EVIDENCE Level IV.
Collapse
Affiliation(s)
- Aarav S Parekh
- Rothman Orthopaedic Institute, Philadelphia, Pennsylvania
| | | | - Amy Nghe
- Rothman Orthopaedic Institute, Philadelphia, Pennsylvania
| | | | | | | |
Collapse
|
15
|
Moons P, Van Bulck L. Using ChatGPT and Google Bard to improve the readability of written patient information: a proof of concept. Eur J Cardiovasc Nurs 2024; 23:122-126. [PMID: 37603843 DOI: 10.1093/eurjcn/zvad087] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 08/16/2023] [Accepted: 08/17/2023] [Indexed: 08/23/2023]
Abstract
Patient information materials often tend to be written at a reading level that is too advanced for patients. In this proof-of-concept study, we used ChatGPT and Google Bard to reduce the reading level of three selected patient information sections from scientific journals. ChatGPT successfully improved readability. However, it could not achieve the recommended 6th-grade reading level. Bard reached the reading level of 6th graders but oversimplified the texts by omitting up to 83% of the content. Despite the present limitations, developers of patient information are encouraged to employ large language models, preferably ChatGPT, to optimize their materials.
Collapse
Affiliation(s)
- Philip Moons
- KU Leuven Department of Public Health and Primary Care, KU Leuven-University of Leuven, Kapucijnenvoer 35 PB7001, 3000 Leuven, Belgium
- Institute of Health and Care Sciences, University of Gothenburg, Arvid Wallgrens backe 1, 413 46 Gothenburg, Sweden
- Department of Paediatrics and Child Health, University of Cape Town, Klipfontein Rd, Rondebosch, 7700 Cape Town, South Africa
| | - Liesbet Van Bulck
- KU Leuven Department of Public Health and Primary Care, KU Leuven-University of Leuven, Kapucijnenvoer 35 PB7001, 3000 Leuven, Belgium
- Research Foundation Flanders (FWO), Leuvenseweg 38, 1000 Brussels, Belgium
| |
Collapse
|
16
|
Lum ZC, Collins DP, Dennison S, Guntupalli L, Choudhary S, Saiz AM, Randall RL. Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level. Cureus 2024; 16:e56104. [PMID: 38618358 PMCID: PMC11014641 DOI: 10.7759/cureus.56104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 03/12/2024] [Indexed: 04/16/2024] Open
Abstract
Introduction Artificial intelligence (AI) models using large language models (LLMs) and non-specific domains have gained attention for their innovative information processing. As AI advances, it's essential to regularly evaluate these tools' competency to maintain high standards, prevent errors or biases, and avoid flawed reasoning or misinformation that could harm patients or spread inaccuracies. Our study aimed to determine the performance of Chat Generative Pre-trained Transformer (ChatGPT) by OpenAI and Google BARD (BARD) in orthopedic surgery, assess performance based on question types, contrast performance between different AIs and compare AI performance to orthopedic residents. Methods We administered ChatGPT and BARD 757 Orthopedic In-Training Examination (OITE) questions. After excluding image-related questions, the AIs answered 390 multiple choice questions, all categorized within 10 sub-specialties (basic science, trauma, sports medicine, spine, hip and knee, pediatrics, oncology, shoulder and elbow, hand, and food and ankle) and three taxonomy classes (recall, interpretation, and application of knowledge). Statistical analysis was performed to analyze the number of questions answered correctly by each AI model, the performance returned by each AI model within the categorized question sub-specialty designation, and the performance of each AI model in comparison to the results returned by orthopedic residents classified by their respective post-graduate year (PGY) level. Results BARD answered more overall questions correctly (58% vs 54%, p<0.001). ChatGPT performed better in sports medicine and basic science and worse in hand surgery, while BARD performed better in basic science (p<0.05). The AIs performed better in recall questions compared to the application of knowledge (p<0.05). Based on previous data, it ranked in the 42nd-96th percentile for post-graduate year ones (PGY1s), 27th-58th for PGY2s, 3rd-29th for PGY3s, 1st-21st for PGY4s, and 1st-17th for PGY5s. Discussion ChatGPT excelled in sports medicine but fell short in hand surgery, while both AIs performed well in the basic science sub-specialty but performed poorly in the application of knowledge-based taxonomy questions. BARD performed better than ChatGPT overall. Although the AI reached the second-year PGY orthopedic resident level, it fell short of passing the American Board of Orthopedic Surgery (ABOS). Its strengths in recall-based inquiries highlight its potential as an orthopedic learning and educational tool.
Collapse
Affiliation(s)
- Zachary C Lum
- Orthopedic Surgery, University of California (UC) Davis School of Medicine, Sacramento, USA
- Orthopedic Surgery, Nova Southeastern University, Pembroke Pines, USA
| | - Dylon P Collins
- College of Medicine, Nova Southeastern University Dr. Kiran C. Patel College of Osteopathic Medicine, Fort Lauderdale, USA
| | - Stanley Dennison
- College of Medicine, Nova Southeastern University Dr. Kiran C. Patel College of Osteopathic Medicine, Fort Lauderdale, USA
| | - Lohitha Guntupalli
- Osteopathic Medicine, Nova Southeastern University Dr. Kiran C. Patel College of Osteopathic Medicine, Clearwater, USA
| | - Soham Choudhary
- Orthopedic Surgery, University of California, Davis, Davis, USA
| | - Augustine M Saiz
- Orthopedic Surgery, University of California (UC) Davis Health, Sacramento, USA
| | - Robert L Randall
- Orthopedic Surgery, University of California (UC) Davis Health, Sacramento, USA
| |
Collapse
|
17
|
Huffman N, Pasqualini I, Khan ST, Klika AK, Deren ME, Jin Y, Kunze KN, Piuzzi NS. Enabling Personalized Medicine in Orthopaedic Surgery Through Artificial Intelligence: A Critical Analysis Review. JBJS Rev 2024; 12:01874474-202403000-00006. [PMID: 38466797 DOI: 10.2106/jbjs.rvw.23.00232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/13/2024]
Abstract
» The application of artificial intelligence (AI) in the field of orthopaedic surgery holds potential for revolutionizing health care delivery across 3 crucial domains: (I) personalized prediction of clinical outcomes and adverse events, which may optimize patient selection, surgical planning, and enhance patient safety and outcomes; (II) diagnostic automated and semiautomated imaging analyses, which may reduce time burden and facilitate precise and timely diagnoses; and (III) forecasting of resource utilization, which may reduce health care costs and increase value for patients and institutions.» Computer vision is one of the most highly studied areas of AI within orthopaedics, with applications pertaining to fracture classification, identification of the manufacturer and model of prosthetic implants, and surveillance of prosthesis loosening and failure.» Prognostic applications of AI within orthopaedics include identifying patients who will likely benefit from a specified treatment, predicting prosthetic implant size, postoperative length of stay, discharge disposition, and surgical complications. Not only may these applications be beneficial to patients but also to institutions and payors because they may inform potential cost expenditure, improve overall hospital efficiency, and help anticipate resource utilization.» AI infrastructure development requires institutional financial commitment and a team of clinicians and data scientists with expertise in AI that can complement skill sets and knowledge. Once a team is established and a goal is determined, teams (1) obtain, curate, and label data; (2) establish a reference standard; (3) develop an AI model; (4) evaluate the performance of the AI model; (5) externally validate the model, and (6) reinforce, improve, and evaluate the model's performance until clinical implementation is possible.» Understanding the implications of AI in orthopaedics may eventually lead to wide-ranging improvements in patient care. However, AI, while holding tremendous promise, is not without methodological and ethical limitations that are essential to address. First, it is important to ensure external validity of programs before their use in a clinical setting. Investigators should maintain high quality data records and registry surveillance, exercise caution when evaluating others' reported AI applications, and increase transparency of the methodological conduct of current models to improve external validity and avoid propagating bias. By addressing these challenges and responsibly embracing the potential of AI, the medical field may eventually be able to harness its power to improve patient care and outcomes.
Collapse
Affiliation(s)
- Nickelas Huffman
- Cleveland Clinic, Department of Orthopaedic Surgery, Cleveland, Ohio
| | | | - Shujaa T Khan
- Cleveland Clinic, Department of Orthopaedic Surgery, Cleveland, Ohio
| | - Alison K Klika
- Cleveland Clinic, Department of Orthopaedic Surgery, Cleveland, Ohio
| | - Matthew E Deren
- Cleveland Clinic, Department of Orthopaedic Surgery, Cleveland, Ohio
| | - Yuxuan Jin
- Cleveland Clinic, Department of Orthopaedic Surgery, Cleveland, Ohio
| | - Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York
| | - Nicolas S Piuzzi
- Cleveland Clinic, Department of Orthopaedic Surgery, Cleveland, Ohio
- Department of Biomedical Engineering, Cleveland Clinic Foundation, Cleveland, Ohio
| |
Collapse
|
18
|
Rouhi AD, Ghanem YK, Yolchieva L, Saleh Z, Joshi H, Moccia MC, Suarez-Pierre A, Han JJ. Can Artificial Intelligence Improve the Readability of Patient Education Materials on Aortic Stenosis? A Pilot Study. Cardiol Ther 2024; 13:137-147. [PMID: 38194058 PMCID: PMC10899139 DOI: 10.1007/s40119-023-00347-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 12/13/2023] [Indexed: 01/10/2024] Open
Abstract
INTRODUCTION The advent of generative artificial intelligence (AI) dialogue platforms and large language models (LLMs) may help facilitate ongoing efforts to improve health literacy. Additionally, recent studies have highlighted inadequate health literacy among patients with cardiac disease. The aim of the present study was to ascertain whether two freely available generative AI dialogue platforms could rewrite online aortic stenosis (AS) patient education materials (PEMs) to meet recommended reading skill levels for the public. METHODS Online PEMs were gathered from a professional cardiothoracic surgical society and academic institutions in the USA. PEMs were then inputted into two AI-powered LLMs, ChatGPT-3.5 and Bard, with the prompt "translate to 5th-grade reading level". Readability of PEMs before and after AI conversion was measured using the validated Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook Index (SMOGI), and Gunning-Fog Index (GFI) scores. RESULTS Overall, 21 PEMs on AS were gathered. Original readability measures indicated difficult readability at the 10th-12th grade reading level. ChatGPT-3.5 successfully improved readability across all four measures (p < 0.001) to the approximately 6th-7th grade reading level. Bard successfully improved readability across all measures (p < 0.001) except for SMOGI (p = 0.729) to the approximately 8th-9th grade level. Neither platform generated PEMs written below the recommended 6th-grade reading level. ChatGPT-3.5 demonstrated significantly more favorable post-conversion readability scores, percentage change in readability scores, and conversion time compared to Bard (all p < 0.001). CONCLUSION AI dialogue platforms can enhance the readability of PEMs for patients with AS but may not fully meet recommended reading skill levels, highlighting potential tools to help strengthen cardiac health literacy in the future.
Collapse
Affiliation(s)
- Armaun D Rouhi
- Department of Surgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Yazid K Ghanem
- Department of Surgery, Cooper University Hospital, Camden, NJ, USA
| | - Laman Yolchieva
- College of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
| | - Zena Saleh
- Department of Surgery, Cooper University Hospital, Camden, NJ, USA
| | - Hansa Joshi
- Department of Surgery, Cooper University Hospital, Camden, NJ, USA
| | - Matthew C Moccia
- Department of Surgery, Cooper University Hospital, Camden, NJ, USA
| | | | - Jason J Han
- Division of Cardiovascular Surgery, Department of Surgery, Perelman School of Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
19
|
Mayol J. Transforming Abdominal Wall Surgery With Generative Artificial Intelligence. JOURNAL OF ABDOMINAL WALL SURGERY : JAWS 2023; 2:12419. [PMID: 38312403 PMCID: PMC10831645 DOI: 10.3389/jaws.2023.12419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 11/16/2023] [Indexed: 02/06/2024]
Affiliation(s)
- Julio Mayol
- Hospital Clinico San Carlos, Instituto de Investigación Sanitaria San Carlos, Universidad Complutense de Madrid, Madrid, Spain
| |
Collapse
|
20
|
Crook BS, Park CN, Hurley ET, Richard MJ, Pidgeon TS. Evaluation of Online Artificial Intelligence-Generated Information on Common Hand Procedures. J Hand Surg Am 2023; 48:1122-1127. [PMID: 37690015 DOI: 10.1016/j.jhsa.2023.08.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Revised: 07/25/2023] [Accepted: 08/02/2023] [Indexed: 09/11/2023]
Abstract
PURPOSE The purpose of this study was to analyze the quality and readability of the information generated by an online artificial intelligence (AI) platform regarding 4 common hand surgeries and to compare AI-generated responses to those provided in the informational articles published by the American Society for Surgery of the Hand (ASSH) HandCare website. METHODS An open AI model (ChatGPT) was used to answer questions commonly asked by patients on 4 common hand surgeries (carpal tunnel release, cubital tunnel release, trigger finger release, and distal radius fracture fixation). These answers were evaluated for medical accuracy, quality and readability and compared to answers derived from the ASSH HandCare materials. RESULTS For the AI model, the Journal of the American Medical Association benchmark criteria score was 0/4, and the DISCERN score was 58 (considered good). The areas in which the AI model lost points were primarily related to the lack of attribution, reliability and currency of the source material. For AI responses, the mean Flesch Kinkaid Reading Ease score was 15, and the Flesch Kinkaid Grade Level was 34, which is considered to be college level. For comparison, ASSH HandCare materials scored 3/4 on the Journal of the American Medical Association Benchmark, 71 on DISCERN (excellent), 9 on Flesch Kinkaid Grade Level, and 60 on Flesch Kinkaid Reading Ease score (eighth/ninth grade level). CONCLUSION An AI language model (ChatGPT) provided generally high-quality answers to frequently asked questions relating to the common hand procedures queried, but it is unclear when or where these answers came from without citations to source material. Furthermore, a high reading level was required to comprehend the information presented. The AI software repeatedly referenced the need to discuss these questions with a surgeon, the importance of shared decision-making and individualized care, and compliance with surgeon treatment recommendations. CLINICAL RELEVANCE As novel AI applications become increasingly mainstream, hand surgeons must understand the limitations and ramifications these technologies have for patient care.
Collapse
Affiliation(s)
- Bryan S Crook
- Department of Orthopaedic Surgery, Duke University Hospital, Durham, NC.
| | - Caroline N Park
- Department of Orthopaedic Surgery, Duke University Hospital, Durham, NC
| | - Eoghan T Hurley
- Department of Orthopaedic Surgery, Duke University Hospital, Durham, NC
| | - Marc J Richard
- Department of Orthopaedic Surgery, Duke University Hospital, Durham, NC
| | - Tyler S Pidgeon
- Department of Orthopaedic Surgery, Duke University Hospital, Durham, NC
| |
Collapse
|
21
|
Makiev KG, Asimakidou M, Vasios IS, Keskinis A, Petkidis G, Tilkeridis K, Ververidis A, Iliopoulos E. A Study on Distinguishing ChatGPT-Generated and Human-Written Orthopaedic Abstracts by Reviewers: Decoding the Discrepancies. Cureus 2023; 15:e49166. [PMID: 38130535 PMCID: PMC10733892 DOI: 10.7759/cureus.49166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/21/2023] [Indexed: 12/23/2023] Open
Abstract
BACKGROUND ChatGPT (OpenAI Incorporated, Mission District, San Francisco, United States) is an artificial intelligence (AI)-based language model that generates human-resembling texts. This AI-generated literary work is comprehensible and contextually relevant and it is really difficult to differentiate from human-written content. ChatGPT has risen in popularity lately and is widely utilized in scholarly manuscript drafting. The aim of this study is to identify if 1) human reviewers can differentiate between AI-generated and human-written abstracts and 2) AI detectors are currently reliable in detecting AI-generated abstracts. METHODS Seven blinded reviewers were asked to read 21 abstracts and differentiate which were AI-generated and which were human-written. The first group consisted of three orthopaedic residents with limited research experience (OR). The second group included three orthopaedic professors with extensive research experience (OP). The seventh reviewer was a non-orthopaedic doctor and acted as a control in terms of expertise. All abstracts were scanned by a plagiarism detector program. The performance of detecting AI-generated abstracts of two different AI detectors was also analyzed. A structured interview was conducted at the end of the survey in order to evaluate the decision-making process utilized by each reviewer. RESULTS The OR group managed to identify correctly 34.9% of the abstracts' authorship and the OP group 31.7%. The non-orthopaedic control identified correctly 76.2%. All AI-generated abstracts were 100% unique (0% plagiarism). The first AI detector managed to identify correctly only 9/21 (42.9%) of the abstracts' authors, whereas the second AI detector identified 14/21 (66.6%). CONCLUSION Inability to correctly identify AI-generated context poses a significant scientific risk as "false" abstracts can end up in scientific conferences or publications. Neither expertise nor research background was shown to have any meaningful impact on the predictive outcome. Focus on statistical data presentation may help the differentiation process. Further research is warranted in order to highlight which elements could help reveal an AI-generated abstract.
Collapse
Affiliation(s)
- Konstantinos G Makiev
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Maria Asimakidou
- School of Medicine, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Ioannis S Vasios
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Anthimos Keskinis
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Georgios Petkidis
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Konstantinos Tilkeridis
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Athanasios Ververidis
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Efthymios Iliopoulos
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| |
Collapse
|
22
|
Bernstein J. CORR Insights®: Can Artificial Intelligence Improve the Readability of Patient Education Materials? Clin Orthop Relat Res 2023; 481:2268-2270. [PMID: 37192346 PMCID: PMC10566765 DOI: 10.1097/corr.0000000000002702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Accepted: 04/25/2023] [Indexed: 05/18/2023]
Affiliation(s)
- Joseph Bernstein
- Clinical Professor of Orthopaedic Surgery, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
23
|
Alowais SA, Alghamdi SS, Alsuhebany N, Alqahtani T, Alshaya AI, Almohareb SN, Aldairem A, Alrashed M, Bin Saleh K, Badreldin HA, Al Yami MS, Al Harbi S, Albekairy AM. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC MEDICAL EDUCATION 2023; 23:689. [PMID: 37740191 PMCID: PMC10517477 DOI: 10.1186/s12909-023-04698-z] [Citation(s) in RCA: 65] [Impact Index Per Article: 65.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 09/19/2023] [Indexed: 09/24/2023]
Abstract
INTRODUCTION Healthcare systems are complex and challenging for all stakeholders, but artificial intelligence (AI) has transformed various fields, including healthcare, with the potential to improve patient care and quality of life. Rapid AI advancements can revolutionize healthcare by integrating it into clinical practice. Reporting AI's role in clinical practice is crucial for successful implementation by equipping healthcare providers with essential knowledge and tools. RESEARCH SIGNIFICANCE This review article provides a comprehensive and up-to-date overview of the current state of AI in clinical practice, including its potential applications in disease diagnosis, treatment recommendations, and patient engagement. It also discusses the associated challenges, covering ethical and legal considerations and the need for human expertise. By doing so, it enhances understanding of AI's significance in healthcare and supports healthcare organizations in effectively adopting AI technologies. MATERIALS AND METHODS The current investigation analyzed the use of AI in the healthcare system with a comprehensive review of relevant indexed literature, such as PubMed/Medline, Scopus, and EMBASE, with no time constraints but limited to articles published in English. The focused question explores the impact of applying AI in healthcare settings and the potential outcomes of this application. RESULTS Integrating AI into healthcare holds excellent potential for improving disease diagnosis, treatment selection, and clinical laboratory testing. AI tools can leverage large datasets and identify patterns to surpass human performance in several healthcare aspects. AI offers increased accuracy, reduced costs, and time savings while minimizing human errors. It can revolutionize personalized medicine, optimize medication dosages, enhance population health management, establish guidelines, provide virtual health assistants, support mental health care, improve patient education, and influence patient-physician trust. CONCLUSION AI can be used to diagnose diseases, develop personalized treatment plans, and assist clinicians with decision-making. Rather than simply automating tasks, AI is about developing technologies that can enhance patient care across healthcare settings. However, challenges related to data privacy, bias, and the need for human expertise must be addressed for the responsible and effective implementation of AI in healthcare.
Collapse
Affiliation(s)
- Shuroug A Alowais
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia.
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia.
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia.
| | - Sahar S Alghamdi
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
- Department of Pharmaceutical Sciences, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Nada Alsuhebany
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Tariq Alqahtani
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
- Department of Pharmaceutical Sciences, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Abdulrahman I Alshaya
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Sumaya N Almohareb
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Atheer Aldairem
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Mohammed Alrashed
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Khalid Bin Saleh
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Hisham A Badreldin
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Majed S Al Yami
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Shmeylan Al Harbi
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Abdulkareem M Albekairy
- Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Prince Mutib Ibn Abdullah Ibn Abdulaziz Rd, Riyadh, 14611, Saudi Arabia
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
- Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| |
Collapse
|
24
|
Abstract
Cite this article: Bone Joint Res 2023;12(8):494–496.
Collapse
Affiliation(s)
| | - A. H. R. W. Simpson
- Department of Orthopaedics and Trauma, University of Edinburgh Queen's Medical Research Institute, Edinburgh, UK
| |
Collapse
|
25
|
Lum ZC. Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT. Clin Orthop Relat Res 2023; 481:1623-1630. [PMID: 37220190 PMCID: PMC10344569 DOI: 10.1097/corr.0000000000002704] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 04/28/2023] [Indexed: 05/25/2023]
Abstract
BACKGROUND Advances in neural networks, deep learning, and artificial intelligence (AI) have progressed recently. Previous deep learning AI has been structured around domain-specific areas that are trained on dataset-specific areas of interest that yield high accuracy and precision. A new AI model using large language models (LLM) and nonspecific domain areas, ChatGPT (OpenAI), has gained attention. Although AI has demonstrated proficiency in managing vast amounts of data, implementation of that knowledge remains a challenge. QUESTIONS/PURPOSES (1) What percentage of Orthopaedic In-Training Examination questions can a generative, pretrained transformer chatbot (ChatGPT) answer correctly? (2) How does that percentage compare with results achieved by orthopaedic residents of different levels, and if scoring lower than the 10th percentile relative to 5th-year residents is likely to correspond to a failing American Board of Orthopaedic Surgery score, is this LLM likely to pass the orthopaedic surgery written boards? (3) Does increasing question taxonomy affect the LLM's ability to select the correct answer choices? METHODS This study randomly selected 400 of 3840 publicly available questions based on the Orthopaedic In-Training Examination and compared the mean score with that of residents who took the test over a 5-year period. Questions with figures, diagrams, or charts were excluded, including five questions the LLM could not provide an answer for, resulting in 207 questions administered with raw score recorded. The LLM's answer results were compared with the Orthopaedic In-Training Examination ranking of orthopaedic surgery residents. Based on the findings of an earlier study, a pass-fail cutoff was set at the 10th percentile. Questions answered were then categorized based on the Buckwalter taxonomy of recall, which deals with increasingly complex levels of interpretation and application of knowledge; comparison was made of the LLM's performance across taxonomic levels and was analyzed using a chi-square test. RESULTS ChatGPT selected the correct answer 47% (97 of 207) of the time, and 53% (110 of 207) of the time it answered incorrectly. Based on prior Orthopaedic In-Training Examination testing, the LLM scored in the 40th percentile for postgraduate year (PGY) 1s, the eighth percentile for PGY2s, and the first percentile for PGY3s, PGY4s, and PGY5s; based on the latter finding (and using a predefined cutoff of the 10th percentile of PGY5s as the threshold for a passing score), it seems unlikely that the LLM would pass the written board examination. The LLM's performance decreased as question taxonomy level increased (it answered 54% [54 of 101] of Tax 1 questions correctly, 51% [18 of 35] of Tax 2 questions correctly, and 34% [24 of 71] of Tax 3 questions correctly; p = 0.034). CONCLUSION Although this general-domain LLM has a low likelihood of passing the orthopaedic surgery board examination, testing performance and knowledge are comparable to that of a first-year orthopaedic surgery resident. The LLM's ability to provide accurate answers declines with increasing question taxonomy and complexity, indicating a deficiency in implementing knowledge. CLINICAL RELEVANCE Current AI appears to perform better at knowledge and interpretation-based inquires, and based on this study and other areas of opportunity, it may become an additional tool for orthopaedic learning and education.
Collapse
|