Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Gebrael G, Sahu KK, Chigarira B, Tripathi N, Mathew Thomas V, Sayegh N, Maughan BL, Agarwal N, Swami U, Li H. Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0. Cancers (Basel) 2023;15:3717. [PMID: 37509379 PMCID: PMC10378202 DOI: 10.3390/cancers15143717] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 07/19/2023] [Accepted: 07/20/2023] [Indexed: 07/30/2023] Open

For:	Gebrael G, Sahu KK, Chigarira B, Tripathi N, Mathew Thomas V, Sayegh N, Maughan BL, Agarwal N, Swami U, Li H. Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0. Cancers (Basel) 2023;15:3717. [PMID: 37509379 PMCID: PMC10378202 DOI: 10.3390/cancers15143717] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 07/19/2023] [Accepted: 07/20/2023] [Indexed: 07/30/2023] Open

Number

Cited by Other Article(s)

Glicksberg BS, Timsina P, Patel D, Sawant A, Vaid A, Raut G, Charney AW, Apakama D, Carr BG, Freeman R, Nadkarni GN, Klang E. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J Am Med Inform Assoc 2024;31:1921-1928. [PMID: 38771093 PMCID: PMC11339523 DOI: 10.1093/jamia/ocae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 04/22/2024] [Indexed: 05/22/2024] Open

Abstract

BACKGROUND

Artificial intelligence (AI) and large language models (LLMs) can play a critical role in emergency room operations by augmenting decision-making about patient admission. However, there are no studies for LLMs using real-world data and scenarios, in comparison to and being informed by traditional supervised machine learning (ML) models. We evaluated the performance of GPT-4 for predicting patient admissions from emergency department (ED) visits. We compared performance to traditional ML models both naively and when informed by few-shot examples and/or numerical probabilities.

METHODS

We conducted a retrospective study using electronic health records across 7 NYC hospitals. We trained Bio-Clinical-BERT and XGBoost (XGB) models on unstructured and structured data, respectively, and created an ensemble model reflecting ML performance. We then assessed GPT-4 capabilities in many scenarios: through Zero-shot, Few-shot with and without retrieval-augmented generation (RAG), and with and without ML numerical probabilities.

RESULTS

The Ensemble ML model achieved an area under the receiver operating characteristic curve (AUC) of 0.88, an area under the precision-recall curve (AUPRC) of 0.72 and an accuracy of 82.9%. The naïve GPT-4's performance (0.79 AUC, 0.48 AUPRC, and 77.5% accuracy) showed substantial improvement when given limited, relevant data to learn from (ie, RAG) and underlying ML probabilities (0.87 AUC, 0.71 AUPRC, and 83.1% accuracy). Interestingly, RAG alone boosted performance to near peak levels (0.82 AUC, 0.56 AUPRC, and 81.3% accuracy).

CONCLUSIONS

The naïve LLM had limited performance but showed significant improvement in predicting ED admissions when supplemented with real-world examples to learn from, particularly through RAG, and/or numerical probabilities from traditional ML models. Its peak performance, although slightly lower than the pure ML model, is noteworthy given its potential for providing reasoning behind predictions. Further refinement of LLMs with real-world data is necessary for successful integration as decision-support tools in care settings.

Collapse

Affiliation(s)

Benjamin S Glicksberg Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Prem Timsina Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Dhaval Patel Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Ashwin Sawant Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Akhil Vaid Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Ganesh Raut Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Alexander W Charney The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Donald Apakama Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Brendan G Carr Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Robert Freeman Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Girish N Nadkarni Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Eyal Klang Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States

Collapse

Alasker A, Alsalamah S, Alshathri N, Almansour N, Alsalamah F, Alghafees M, AlKhamees M, Alsaikhan B. Performance of large language models (LLMs) in providing prostate cancer information. BMC Urol 2024;24:177. [PMID: 39180045 PMCID: PMC11342655 DOI: 10.1186/s12894-024-01570-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 08/16/2024] [Indexed: 08/26/2024] Open

Affiliation(s)

Ahmed Alasker Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
Seham Alsalamah King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia. College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia.
Nada Alshathri King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
Nura Almansour King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
Faris Alsalamah King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
Mohammad Alghafees Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
Mohammad AlKhamees Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia Department of Surgical Specialities, College of Medicine, Majmaah University, Majmaah, Saudi Arabia
Bader Alsaikhan Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia

Collapse

Cao JJ, Kwon DH, Ghaziani TT, Kwo P, Tse G, Kesselman A, Kamaya A, Tse JR. Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability. Abdom Radiol (NY) 2024:10.1007/s00261-024-04501-7. [PMID: 39088019 DOI: 10.1007/s00261-024-04501-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 07/10/2024] [Accepted: 07/13/2024] [Indexed: 08/02/2024]

Abstract

PURPOSE

To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.

METHODS

Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests.

RESULTS

Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).

CONCLUSION

Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.

Collapse

Levin C, Kagan T, Rosen S, Saban M. An evaluation of the capabilities of language models and nurses in providing neonatal clinical decision support. Int J Nurs Stud 2024;155:104771. [PMID: 38688103 DOI: 10.1016/j.ijnurstu.2024.104771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 03/26/2024] [Accepted: 04/03/2024] [Indexed: 05/02/2024]

Varady NH, Lu AZ, Mazzucco M, Dines JS, Altchek DW, Williams RJ, Kunze KN. Understanding How ChatGPT May Become a Clinical Administrative Tool Through an Investigation on the Ability to Answer Common Patient Questions Concerning Ulnar Collateral Ligament Injuries. Orthop J Sports Med 2024;12:23259671241257516. [PMID: 39139744 PMCID: PMC11320692 DOI: 10.1177/23259671241257516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/10/2024] [Indexed: 08/15/2024] Open

Abstract

Background

The consumer availability and automated response functions of chat generator pretrained transformer (ChatGPT-4), a large language model, poise this application to be utilized for patient health queries and may have a role in serving as an adjunct to minimize administrative and clinical burden.

Purpose

To evaluate the ability of ChatGPT-4 to respond to patient inquiries concerning ulnar collateral ligament (UCL) injuries and compare these results with the performance of Google.

Study Design

Cross-sectional study.

Methods

Google Web Search was used as a benchmark, as it is the most widely used search engine worldwide and the only search engine that generates frequently asked questions (FAQs) when prompted with a query, allowing comparisons through a systematic approach. The query "ulnar collateral ligament reconstruction" was entered into Google, and the top 10 FAQs, answers, and their sources were recorded. ChatGPT-4 was prompted to perform a Google search of FAQs with the same query and to record the sources of answers for comparison. This process was again replicated to obtain 10 new questions requiring numeric instead of open-ended responses. Finally, responses were graded independently for clinical accuracy (grade 0 = inaccurate, grade 1 = somewhat accurate, grade 2 = accurate) by 2 fellowship-trained sports medicine surgeons (D.W.A, J.S.D.) blinded to the search engine and answer source.

Results

ChatGPT-4 used a greater proportion of academic sources than Google to provide answers to the top 10 FAQs, although this was not statistically significant (90% vs 50%; P = .14). In terms of question overlap, 40% of the most common questions on Google and ChatGPT-4 were the same. When comparing FAQs with numeric responses, 20% of answers were completely overlapping, 30% demonstrated partial overlap, and the remaining 50% did not demonstrate any overlap. All sources used by ChatGPT-4 to answer these FAQs were academic, while only 20% of sources used by Google were academic (P = .0007). The remaining Google sources included social media (40%), medical practices (20%), single-surgeon websites (10%), and commercial websites (10%). The mean (± standard deviation) accuracy for answers given by ChatGPT-4 was significantly greater compared with Google for the top 10 FAQs (1.9 ± 0.2 vs 1.2 ± 0.6; P = .001) and top 10 questions with numeric answers (1.8 ± 0.4 vs 1 ± 0.8; P = .013).

Conclusion

ChatGPT-4 is capable of providing responses with clinically relevant content concerning UCL injuries and reconstruction. ChatGPT-4 utilized a greater proportion of academic websites to provide responses to FAQs representative of patient inquiries compared with Google Web Search and provided significantly more accurate answers. Moving forward, ChatGPT has the potential to be used as a clinical adjunct when answering queries about UCL injuries and reconstruction, but further validation is warranted before integrated or autonomous use in clinical settings.

Collapse

Kunze KN, Varady NH, Mazzucco M, Lu AZ, Chahla J, Martin RK, Ranawat AS, Pearle AD, Williams RJ. The Large Language Model ChatGPT-4 Exhibits Excellent Triage Capabilities and Diagnostic Performance for Patients Presenting With Various Causes of Knee Pain. Arthroscopy 2024:S0749-8063(24)00456-0. [PMID: 38925234 DOI: 10.1016/j.arthro.2024.06.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 05/28/2024] [Accepted: 06/14/2024] [Indexed: 06/28/2024]

Abstract

PURPOSE

To provide a proof-of-concept analysis of the appropriateness and performance of ChatGPT-4 to triage, synthesize differential diagnoses, and generate treatment plans concerning common presentations of knee pain.

METHODS

Twenty knee complaints warranting triage and expanded scenarios were input into ChatGPT-4, with memory cleared prior to each new input to mitigate bias. For the 10 triage complaints, ChatGPT-4 was asked to generate a differential diagnosis that was graded for accuracy and suitability in comparison to a differential created by 2 orthopaedic sports medicine physicians. For the 10 clinical scenarios, ChatGPT-4 was prompted to provide treatment guidance for the patient, which was again graded. To test the higher-order capabilities of ChatGPT-4, further inquiry into these specific management recommendations was performed and graded.

RESULTS

All ChatGPT-4 diagnoses were deemed appropriate within the spectrum of potential pathologies on a differential. The top diagnosis on the differential was identical between surgeons and ChatGPT-4 for 70% of scenarios, and the top diagnosis provided by the surgeon appeared as either the first or second diagnosis in 90% of scenarios. Overall, 16 of 30 diagnoses (53.3%) in the differential were identical. When provided with 10 expanded vignettes with a single diagnosis, the accuracy of ChatGPT-4 increased to 100%, with the suitability of management graded as appropriate in 90% of cases. Specific information pertaining to conservative management, surgical approaches, and related treatments was appropriate and accurate in 100% of cases.

CONCLUSIONS

ChatGPT-4 provided clinically reasonable diagnoses to triage patient complaints of knee pain due to various underlying conditions that were generally consistent with differentials provided by sports medicine physicians. Diagnostic performance was enhanced when providing additional information, allowing ChatGPT-4 to reach high predictive accuracy for recommendations concerning management and treatment options. However, ChatGPT-4 may show clinically important error rates for diagnosis depending on prompting strategy and information provided; therefore, further refinements are necessary prior to implementation into clinical workflows.

CLINICAL RELEVANCE

Although ChatGPT-4 is increasingly being used by patients for health information, the potential for ChatGPT-4 to serve as a clinical support tool is unclear. In this study, we found that ChatGPT-4 was frequently able to diagnose and triage knee complaints appropriately as rated by sports medicine surgeons, suggesting that it may eventually be a useful clinical support tool.

Collapse

Lim B, Seth I, Cuomo R, Kenney PS, Ross RJ, Sofiadellis F, Pentangelo P, Ceccaroni A, Alfano C, Rozen WM. Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients. Aesthetic Plast Surg 2024:10.1007/s00266-024-04157-0. [PMID: 38898239 DOI: 10.1007/s00266-024-04157-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Accepted: 05/21/2024] [Indexed: 06/21/2024]

Abstract

BACKGROUND

Abdominoplasty is a common operation, used for a range of cosmetic and functional issues, often in the context of divarication of recti, significant weight loss, and after pregnancy. Despite this, patient-surgeon communication gaps can hinder informed decision-making. The integration of large language models (LLMs) in healthcare offers potential for enhancing patient information. This study evaluated the feasibility of using LLMs for answering perioperative queries.

METHODS

This study assessed the efficacy of four leading LLMs-OpenAI's ChatGPT-3.5, Anthropic's Claude, Google's Gemini, and Bing's CoPilot-using fifteen unique prompts. All outputs were evaluated using the Flesch-Kincaid, Flesch Reading Ease score, and Coleman-Liau index for readability assessment. The DISCERN score and a Likert scale were utilized to evaluate quality. Scores were assigned by two plastic surgical residents and then reviewed and discussed until a consensus was reached by five plastic surgeon specialists.

RESULTS

ChatGPT-3.5 required the highest level for comprehension, followed by Gemini, Claude, then CoPilot. Claude provided the most appropriate and actionable advice. In terms of patient-friendliness, CoPilot outperformed the rest, enhancing engagement and information comprehensiveness. ChatGPT-3.5 and Gemini offered adequate, though unremarkable, advice, employing more professional language. CoPilot uniquely included visual aids and was the only model to use hyperlinks, although they were not very helpful and acceptable, and it faced limitations in responding to certain queries.

CONCLUSION

ChatGPT-3.5, Gemini, Claude, and Bing's CoPilot showcased differences in readability and reliability. LLMs offer unique advantages for patient care but require careful selection. Future research should integrate LLM strengths and address weaknesses for optimal patient education.

LEVEL OF EVIDENCE V

This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .

Collapse

Amacher SA, Arpagaus A, Sahmer C, Becker C, Gross S, Urben T, Tisljar K, Sutter R, Marsch S, Hunziker S. Prediction of outcomes after cardiac arrest by a generative artificial intelligence model. Resusc Plus 2024;18:100587. [PMID: 38433764 PMCID: PMC10906512 DOI: 10.1016/j.resplu.2024.100587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 02/01/2024] [Accepted: 02/11/2024] [Indexed: 03/05/2024] Open

Abstract

Aims

To investigate the prognostic accuracy of a non-medical generative artificial intelligence model (Chat Generative Pre-Trained Transformer 4 - ChatGPT-4) as a novel aspect in predicting death and poor neurological outcome at hospital discharge based on real-life data from cardiac arrest patients.

Methods

This prospective cohort study investigates the prognostic performance of ChatGPT-4 to predict outcomes at hospital discharge of adult cardiac arrest patients admitted to intensive care at a large Swiss tertiary academic medical center (COMMUNICATE/PROPHETIC cohort study). We prompted ChatGPT-4 with sixteen prognostic parameters derived from established post-cardiac arrest scores for each patient. We compared the prognostic performance of ChatGPT-4 regarding the area under the curve (AUC), sensitivity, specificity, positive and negative predictive values, and likelihood ratios of three cardiac arrest scores (Out-of-Hospital Cardiac Arrest [OHCA], Cardiac Arrest Hospital Prognosis [CAHP], and PROgnostication using LOGistic regression model for Unselected adult cardiac arrest patients in the Early stages [PROLOGUE score]) for in-hospital mortality and poor neurological outcome.

Results

Mortality at hospital discharge was 43% (n = 309/713), 54% of patients (n = 387/713) had a poor neurological outcome. ChatGPT-4 showed good discrimination regarding in-hospital mortality with an AUC of 0.85, similar to the OHCA, CAHP, and PROLOGUE (AUCs of 0.82, 0.83, and 0.84, respectively) scores. For poor neurological outcome, ChatGPT-4 showed a similar prediction to the post-cardiac arrest scores (AUC 0.83).

Conclusions

ChatGPT-4 showed a similar performance in predicting mortality and poor neurological outcome compared to validated post-cardiac arrest scores. However, more research is needed regarding illogical answers for potential incorporation of an LLM in the multimodal outcome prognostication after cardiac arrest.

Collapse

Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider CR, Forte AJ. Clinical and Surgical Applications of Large Language Models: A Systematic Review. J Clin Med 2024;13:3041. [PMID: 38892752 PMCID: PMC11172607 DOI: 10.3390/jcm13113041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 05/15/2024] [Accepted: 05/19/2024] [Indexed: 06/21/2024] Open

Preiksaitis C, Ashenburg N, Bunney G, Chu A, Kabeer R, Riley F, Ribeira R, Rose C. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med Inform 2024;12:e53787. [PMID: 38728687 PMCID: PMC11127144 DOI: 10.2196/53787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 12/20/2023] [Accepted: 04/05/2024] [Indexed: 05/12/2024] Open

Abstract

BACKGROUND

Artificial intelligence (AI), more specifically large language models (LLMs), holds significant potential in revolutionizing emergency care delivery by optimizing clinical workflows and enhancing the quality of decision-making. Although enthusiasm for integrating LLMs into emergency medicine (EM) is growing, the existing literature is characterized by a disparate collection of individual studies, conceptual analyses, and preliminary implementations. Given these complexities and gaps in understanding, a cohesive framework is needed to comprehend the existing body of knowledge on the application of LLMs in EM.

OBJECTIVE

Given the absence of a comprehensive framework for exploring the roles of LLMs in EM, this scoping review aims to systematically map the existing literature on LLMs' potential applications within EM and identify directions for future research. Addressing this gap will allow for informed advancements in the field.

METHODS

Using PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) criteria, we searched Ovid MEDLINE, Embase, Web of Science, and Google Scholar for papers published between January 2018 and August 2023 that discussed LLMs' use in EM. We excluded other forms of AI. A total of 1994 unique titles and abstracts were screened, and each full-text paper was independently reviewed by 2 authors. Data were abstracted independently, and 5 authors performed a collaborative quantitative and qualitative synthesis of the data.

RESULTS

A total of 43 papers were included. Studies were predominantly from 2022 to 2023 and conducted in the United States and China. We uncovered four major themes: (1) clinical decision-making and support was highlighted as a pivotal area, with LLMs playing a substantial role in enhancing patient care, notably through their application in real-time triage, allowing early recognition of patient urgency; (2) efficiency, workflow, and information management demonstrated the capacity of LLMs to significantly boost operational efficiency, particularly through the automation of patient record synthesis, which could reduce administrative burden and enhance patient-centric care; (3) risks, ethics, and transparency were identified as areas of concern, especially regarding the reliability of LLMs' outputs, and specific studies highlighted the challenges of ensuring unbiased decision-making amidst potentially flawed training data sets, stressing the importance of thorough validation and ethical oversight; and (4) education and communication possibilities included LLMs' capacity to enrich medical training, such as through using simulated patient interactions that enhance communication skills.

CONCLUSIONS

LLMs have the potential to fundamentally transform EM, enhancing clinical decision-making, optimizing workflows, and improving patient outcomes. This review sets the stage for future advancements by identifying key research areas: prospective validation of LLM applications, establishing standards for responsible use, understanding provider and patient perceptions, and improving physicians' AI literacy. Effective integration of LLMs into EM will require collaborative efforts and thorough evaluation to ensure these technologies can be safely and effectively applied.

Collapse

Scott IA, Zuccon G. The new paradigm in machine learning - foundation models, large language models and beyond: a primer for physicians. Intern Med J 2024;54:705-715. [PMID: 38715436 DOI: 10.1111/imj.16393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 03/26/2024] [Indexed: 05/18/2024]

Frosolini A, Catarzi L, Benedetti S, Latini L, Chisci G, Franz L, Gennaro P, Gabriele G. The Role of Large Language Models (LLMs) in Providing Triage for Maxillofacial Trauma Cases: A Preliminary Study. Diagnostics (Basel) 2024;14:839. [PMID: 38667484 PMCID: PMC11048758 DOI: 10.3390/diagnostics14080839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 04/10/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open

Abstract

BACKGROUND

In the evolving field of maxillofacial surgery, integrating advanced technologies like Large Language Models (LLMs) into medical practices, especially for trauma triage, presents a promising yet largely unexplored potential. This study aimed to evaluate the feasibility of using LLMs for triaging complex maxillofacial trauma cases by comparing their performance against the expertise of a tertiary referral center.

METHODS

Utilizing a comprehensive review of patient records in a tertiary referral center over a year-long period, standardized prompts detailing patient demographics, injury characteristics, and medical histories were created. These prompts were used to assess the triage suggestions of ChatGPT 4.0 and Google GEMINI against the center's recommendations, supplemented by evaluating the AI's performance using the QAMAI and AIPI questionnaires.

RESULTS

The results in 10 cases of major maxillofacial trauma indicated moderate agreement rates between LLM recommendations and the referral center, with some variances in the suggestion of appropriate examinations (70% ChatGPT and 50% GEMINI) and treatment plans (60% ChatGPT and 45% GEMINI). Notably, the study found no statistically significant differences in several areas of the questionnaires, except in the diagnosis accuracy (GEMINI: 3.30, ChatGPT: 2.30; p = 0.032) and relevance of the recommendations (GEMINI: 2.90, ChatGPT: 3.50; p = 0.021). A Spearman correlation analysis highlighted significant correlations within the two questionnaires, specifically between the QAMAI total score and AIPI treatment scores (rho = 0.767, p = 0.010).

CONCLUSIONS

This exploratory investigation underscores the potential of LLMs in enhancing clinical decision making for maxillofacial trauma cases, indicating a need for further research to refine their application in healthcare settings.

Collapse

Paslı S, Şahin AS, Beşer MF, Topçuoğlu H, Yadigaroğlu M, İmamoğlu M. Assessing the precision of artificial intelligence in ED triage decisions: Insights from a study with ChatGPT. Am J Emerg Med 2024;78:170-175. [PMID: 38295466 DOI: 10.1016/j.ajem.2024.01.037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 12/25/2023] [Accepted: 01/21/2024] [Indexed: 02/02/2024] Open

Abstract

BACKGROUND

The rise in emergency department presentations globally poses challenges for efficient patient management. To address this, various strategies aim to expedite patient management. Artificial intelligence's (AI) consistent performance and rapid data interpretation extend its healthcare applications, especially in emergencies. The introduction of a robust AI tool like ChatGPT, based on GPT-4 developed by OpenAI, can benefit patients and healthcare professionals by improving the speed and accuracy of resource allocation. This study examines ChatGPT's capability to predict triage outcomes based on local emergency department rules.

METHODS

This study is a single-center prospective observational study. The study population consists of all patients who presented to the emergency department with any symptoms and agreed to participate. The study was conducted on three non-consecutive days for a total of 72 h. Patients' chief complaints, vital parameters, medical history and the area to which they were directed by the triage team in the emergency department were recorded. Concurrently, an emergency medicine physician inputted the same data into previously trained GPT-4, according to local rules. According to this data, the triage decisions made by GPT-4 were recorded. In the same process, an emergency medicine specialist determined where the patient should be directed based on the data collected, and this decision was considered the gold standard. Accuracy rates and reliability for directing patients to specific areas by the triage team and GPT-4 were evaluated using Cohen's kappa test. Furthermore, the accuracy of the patient triage process performed by the triage team and GPT-4 was assessed by receiver operating characteristic (ROC) analysis. Statistical analysis considered a value of p < 0.05 as significant.

RESULTS

The study was carried out on 758 patients. Among the participants, 416 (54.9%) were male and 342 (45.1%) were female. Evaluating the primary endpoints of our study - the agreement between the decisions of the triage team, GPT-4 decisions in emergency department triage, and the gold standard - we observed almost perfect agreement both between the triage team and the gold standard and between GPT-4 and the gold standard (Cohen's Kappa 0.893 and 0.899, respectively; p < 0.001 for each).

CONCLUSION

Our findings suggest GPT-4 possess outstanding predictive skills in triaging patients in an emergency setting. GPT-4 can serve as an effective tool to support the triage process.

Collapse

Mu Y, He D. The Potential Applications and Challenges of ChatGPT in the Medical Field. Int J Gen Med 2024;17:817-826. [PMID: 38476626 PMCID: PMC10929156 DOI: 10.2147/ijgm.s456659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 02/26/2024] [Indexed: 03/14/2024] Open

Wang Z, Zhang Z, Traverso A, Dekker A, Qian L, Sun P. Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: enhancing interpretability with a chain of thought approach. Quant Imaging Med Surg 2024;14:1602-1615. [PMID: 38415150 PMCID: PMC10895085 DOI: 10.21037/qims-23-1180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 11/30/2023] [Indexed: 02/29/2024]

Ray PP. ChatGPT's competence in addressing urolithiasis: myth or reality? Int Urol Nephrol 2024;56:149-150. [PMID: 37726510 DOI: 10.1007/s11255-023-03802-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 09/09/2023] [Indexed: 09/21/2023]

dos Santos ML, Victória VNG. Critical evaluation of applications of artificial intelligence based linguistic models in Occupational Health. Rev Bras Med Trab 2024;22:e20231241. [PMID: 39165532 PMCID: PMC11333049 DOI: 10.47626/1679-4435-2023-1241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Accepted: 10/30/2023] [Indexed: 08/22/2024] Open

Alotaibi SS, Rehman A, Hasnain M. Revolutionizing ocular cancer management: a narrative review on exploring the potential role of ChatGPT. Front Public Health 2023;11:1338215. [PMID: 38192545 PMCID: PMC10773849 DOI: 10.3389/fpubh.2023.1338215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 12/04/2023] [Indexed: 01/10/2024] Open

Wang X, Liu XQ. Potential and limitations of ChatGPT and generative artificial intelligence in medical safety education. World J Clin Cases 2023;11:7935-7939. [DOI: 10.12998/wjcc.v11.i32.7935] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 09/21/2023] [Accepted: 11/02/2023] [Indexed: 11/16/2023] Open