1
|
Glicksberg BS, Timsina P, Patel D, Sawant A, Vaid A, Raut G, Charney AW, Apakama D, Carr BG, Freeman R, Nadkarni GN, Klang E. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J Am Med Inform Assoc 2024; 31:1921-1928. [PMID: 38771093 PMCID: PMC11339523 DOI: 10.1093/jamia/ocae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 04/22/2024] [Indexed: 05/22/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) and large language models (LLMs) can play a critical role in emergency room operations by augmenting decision-making about patient admission. However, there are no studies for LLMs using real-world data and scenarios, in comparison to and being informed by traditional supervised machine learning (ML) models. We evaluated the performance of GPT-4 for predicting patient admissions from emergency department (ED) visits. We compared performance to traditional ML models both naively and when informed by few-shot examples and/or numerical probabilities. METHODS We conducted a retrospective study using electronic health records across 7 NYC hospitals. We trained Bio-Clinical-BERT and XGBoost (XGB) models on unstructured and structured data, respectively, and created an ensemble model reflecting ML performance. We then assessed GPT-4 capabilities in many scenarios: through Zero-shot, Few-shot with and without retrieval-augmented generation (RAG), and with and without ML numerical probabilities. RESULTS The Ensemble ML model achieved an area under the receiver operating characteristic curve (AUC) of 0.88, an area under the precision-recall curve (AUPRC) of 0.72 and an accuracy of 82.9%. The naïve GPT-4's performance (0.79 AUC, 0.48 AUPRC, and 77.5% accuracy) showed substantial improvement when given limited, relevant data to learn from (ie, RAG) and underlying ML probabilities (0.87 AUC, 0.71 AUPRC, and 83.1% accuracy). Interestingly, RAG alone boosted performance to near peak levels (0.82 AUC, 0.56 AUPRC, and 81.3% accuracy). CONCLUSIONS The naïve LLM had limited performance but showed significant improvement in predicting ED admissions when supplemented with real-world examples to learn from, particularly through RAG, and/or numerical probabilities from traditional ML models. Its peak performance, although slightly lower than the pure ML model, is noteworthy given its potential for providing reasoning behind predictions. Further refinement of LLMs with real-world data is necessary for successful integration as decision-support tools in care settings.
Collapse
Affiliation(s)
- Benjamin S Glicksberg
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Prem Timsina
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Dhaval Patel
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Ashwin Sawant
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Akhil Vaid
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Ganesh Raut
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Alexander W Charney
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Donald Apakama
- Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Brendan G Carr
- Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Robert Freeman
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| |
Collapse
|
2
|
Alasker A, Alsalamah S, Alshathri N, Almansour N, Alsalamah F, Alghafees M, AlKhamees M, Alsaikhan B. Performance of large language models (LLMs) in providing prostate cancer information. BMC Urol 2024; 24:177. [PMID: 39180045 PMCID: PMC11342655 DOI: 10.1186/s12894-024-01570-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 08/16/2024] [Indexed: 08/26/2024] Open
Abstract
PURPOSE The diagnosis and management of prostate cancer (PCa), the second most common cancer in men worldwide, are highly complex. Hence, patients often seek knowledge through additional resources, including AI chatbots such as ChatGPT and Google Bard. This study aimed to evaluate the performance of LLMs in providing education on PCa. METHODS Common patient questions about PCa were collected from reliable educational websites and evaluated for accuracy, comprehensiveness, readability, and stability by two independent board-certified urologists, with a third resolving discrepancy. Accuracy was measured on a 3-point scale, comprehensiveness was measured on a 5-point Likert scale, and readability was measured using the Flesch Reading Ease (FRE) score and Flesch-Kincaid FK Grade Level. RESULTS A total of 52 questions on general knowledge, diagnosis, treatment, and prevention of PCa were provided to three LLMs. Although there was no significant difference in the overall accuracy of LLMs, ChatGPT-3.5 demonstrated superiority over the other LLMs in terms of general knowledge of PCa (p = 0.018). ChatGPT-4 achieved greater overall comprehensiveness than ChatGPT-3.5 and Bard (p = 0.028). For readability, Bard generated simpler sentences with the highest FRE score (54.7, p < 0.001) and lowest FK reading level (10.2, p < 0.001). CONCLUSION ChatGPT-3.5, ChatGPT-4 and Bard generate accurate, comprehensive, and easily readable PCa material. These AI models might not replace healthcare professionals but can assist in patient education and guidance.
Collapse
Affiliation(s)
- Ahmed Alasker
- Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Seham Alsalamah
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia.
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia.
| | - Nada Alshathri
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Nura Almansour
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Faris Alsalamah
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Mohammad Alghafees
- Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Mohammad AlKhamees
- Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
- Department of Surgical Specialities, College of Medicine, Majmaah University, Majmaah, Saudi Arabia
| | - Bader Alsaikhan
- Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| |
Collapse
|
3
|
Cao JJ, Kwon DH, Ghaziani TT, Kwo P, Tse G, Kesselman A, Kamaya A, Tse JR. Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability. Abdom Radiol (NY) 2024:10.1007/s00261-024-04501-7. [PMID: 39088019 DOI: 10.1007/s00261-024-04501-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 07/10/2024] [Accepted: 07/13/2024] [Indexed: 08/02/2024]
Abstract
PURPOSE To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management. METHODS Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests. RESULTS Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001). CONCLUSION Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.
Collapse
Affiliation(s)
- Jennie J Cao
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA
| | - Daniel H Kwon
- Department of Medicine, San Francisco School of Medicine, University of California, 505 Parnassus Ave, MC1286C, San Francisco, CA, 94144, USA
| | - Tara T Ghaziani
- Department of Medicine, Stanford University School of Medicine, 430 Broadway St MC 6341, Redwood City, CA, 94063, USA
| | - Paul Kwo
- Department of Medicine, Stanford University School of Medicine, 430 Broadway St MC 6341, Redwood City, CA, 94063, USA
| | - Gary Tse
- Department of Radiological Sciences, Los Angeles David Geffen School of Medicine, University of California, 757 Westwood Plaza Los Angeles, Los Angeles, CA, 90095, USA
| | - Andrew Kesselman
- Department of Radiology, Stanford University School of Medicine, 875 Blake Wilbur Drive Palo Alto, Stanford, CA, 94304, USA
| | - Aya Kamaya
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA
| | - Justin R Tse
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA.
| |
Collapse
|
4
|
Levin C, Kagan T, Rosen S, Saban M. An evaluation of the capabilities of language models and nurses in providing neonatal clinical decision support. Int J Nurs Stud 2024; 155:104771. [PMID: 38688103 DOI: 10.1016/j.ijnurstu.2024.104771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 03/26/2024] [Accepted: 04/03/2024] [Indexed: 05/02/2024]
Abstract
AIM To assess the clinical reasoning capabilities of two large language models, ChatGPT-4 and Claude-2.0, compared to those of neonatal nurses during neonatal care scenarios. DESIGN A cross-sectional study with a comparative evaluation using a survey instrument that included six neonatal intensive care unit clinical scenarios. PARTICIPANTS 32 neonatal intensive care nurses with 5-10 years of experience working in the neonatal intensive care units of three medical centers. METHODS Participants responded to 6 written clinical scenarios. Simultaneously, we asked ChatGPT-4 and Claude-2.0 to provide initial assessments and treatment recommendations for the same scenarios. The responses from ChatGPT-4 and Claude-2.0 were then scored by certified neonatal nurse practitioners for accuracy, completeness, and response time. RESULTS Both models demonstrated capabilities in clinical reasoning for neonatal care, with Claude-2.0 significantly outperforming ChatGPT-4 in clinical accuracy and speed. However, limitations were identified across the cases in diagnostic precision, treatment specificity, and response lag. CONCLUSIONS While showing promise, current limitations reinforce the need for deep refinement before ChatGPT-4 and Claude-2.0 can be considered for integration into clinical practice. Additional validation of these tools is important to safely leverage this Artificial Intelligence technology for enhancing clinical decision-making. IMPACT The study provides an understanding of the reasoning accuracy of new Artificial Intelligence models in neonatal clinical care. The current accuracy gaps of ChatGPT-4 and Claude-2.0 need to be addressed prior to clinical usage.
Collapse
Affiliation(s)
- Chedva Levin
- Faculty of School of Life and Health Sciences, Nursing Department, The Jerusalem College of Technology-Lev Academic Center, Jerusalem, Israel; The Department of Vascular Surgery, The Chaim Sheba Medical Center, Tel Hashomer, Ramat Gan, Tel Aviv, Israel
| | | | - Shani Rosen
- Department of Nursing, School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Israel
| | - Mor Saban
- Department of Nursing, School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Israel.
| |
Collapse
|
5
|
Varady NH, Lu AZ, Mazzucco M, Dines JS, Altchek DW, Williams RJ, Kunze KN. Understanding How ChatGPT May Become a Clinical Administrative Tool Through an Investigation on the Ability to Answer Common Patient Questions Concerning Ulnar Collateral Ligament Injuries. Orthop J Sports Med 2024; 12:23259671241257516. [PMID: 39139744 PMCID: PMC11320692 DOI: 10.1177/23259671241257516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/10/2024] [Indexed: 08/15/2024] Open
Abstract
Background The consumer availability and automated response functions of chat generator pretrained transformer (ChatGPT-4), a large language model, poise this application to be utilized for patient health queries and may have a role in serving as an adjunct to minimize administrative and clinical burden. Purpose To evaluate the ability of ChatGPT-4 to respond to patient inquiries concerning ulnar collateral ligament (UCL) injuries and compare these results with the performance of Google. Study Design Cross-sectional study. Methods Google Web Search was used as a benchmark, as it is the most widely used search engine worldwide and the only search engine that generates frequently asked questions (FAQs) when prompted with a query, allowing comparisons through a systematic approach. The query "ulnar collateral ligament reconstruction" was entered into Google, and the top 10 FAQs, answers, and their sources were recorded. ChatGPT-4 was prompted to perform a Google search of FAQs with the same query and to record the sources of answers for comparison. This process was again replicated to obtain 10 new questions requiring numeric instead of open-ended responses. Finally, responses were graded independently for clinical accuracy (grade 0 = inaccurate, grade 1 = somewhat accurate, grade 2 = accurate) by 2 fellowship-trained sports medicine surgeons (D.W.A, J.S.D.) blinded to the search engine and answer source. Results ChatGPT-4 used a greater proportion of academic sources than Google to provide answers to the top 10 FAQs, although this was not statistically significant (90% vs 50%; P = .14). In terms of question overlap, 40% of the most common questions on Google and ChatGPT-4 were the same. When comparing FAQs with numeric responses, 20% of answers were completely overlapping, 30% demonstrated partial overlap, and the remaining 50% did not demonstrate any overlap. All sources used by ChatGPT-4 to answer these FAQs were academic, while only 20% of sources used by Google were academic (P = .0007). The remaining Google sources included social media (40%), medical practices (20%), single-surgeon websites (10%), and commercial websites (10%). The mean (± standard deviation) accuracy for answers given by ChatGPT-4 was significantly greater compared with Google for the top 10 FAQs (1.9 ± 0.2 vs 1.2 ± 0.6; P = .001) and top 10 questions with numeric answers (1.8 ± 0.4 vs 1 ± 0.8; P = .013). Conclusion ChatGPT-4 is capable of providing responses with clinically relevant content concerning UCL injuries and reconstruction. ChatGPT-4 utilized a greater proportion of academic websites to provide responses to FAQs representative of patient inquiries compared with Google Web Search and provided significantly more accurate answers. Moving forward, ChatGPT has the potential to be used as a clinical adjunct when answering queries about UCL injuries and reconstruction, but further validation is warranted before integrated or autonomous use in clinical settings.
Collapse
Affiliation(s)
- Nathan H. Varady
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, USA
| | - Amy Z. Lu
- Weill Cornell Medical College, New York, New York, USA
| | | | - Joshua S. Dines
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, USA
| | | | - Riley J. Williams
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, USA
| | - Kyle N. Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, USA
| |
Collapse
|
6
|
Kunze KN, Varady NH, Mazzucco M, Lu AZ, Chahla J, Martin RK, Ranawat AS, Pearle AD, Williams RJ. The Large Language Model ChatGPT-4 Exhibits Excellent Triage Capabilities and Diagnostic Performance for Patients Presenting With Various Causes of Knee Pain. Arthroscopy 2024:S0749-8063(24)00456-0. [PMID: 38925234 DOI: 10.1016/j.arthro.2024.06.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 05/28/2024] [Accepted: 06/14/2024] [Indexed: 06/28/2024]
Abstract
PURPOSE To provide a proof-of-concept analysis of the appropriateness and performance of ChatGPT-4 to triage, synthesize differential diagnoses, and generate treatment plans concerning common presentations of knee pain. METHODS Twenty knee complaints warranting triage and expanded scenarios were input into ChatGPT-4, with memory cleared prior to each new input to mitigate bias. For the 10 triage complaints, ChatGPT-4 was asked to generate a differential diagnosis that was graded for accuracy and suitability in comparison to a differential created by 2 orthopaedic sports medicine physicians. For the 10 clinical scenarios, ChatGPT-4 was prompted to provide treatment guidance for the patient, which was again graded. To test the higher-order capabilities of ChatGPT-4, further inquiry into these specific management recommendations was performed and graded. RESULTS All ChatGPT-4 diagnoses were deemed appropriate within the spectrum of potential pathologies on a differential. The top diagnosis on the differential was identical between surgeons and ChatGPT-4 for 70% of scenarios, and the top diagnosis provided by the surgeon appeared as either the first or second diagnosis in 90% of scenarios. Overall, 16 of 30 diagnoses (53.3%) in the differential were identical. When provided with 10 expanded vignettes with a single diagnosis, the accuracy of ChatGPT-4 increased to 100%, with the suitability of management graded as appropriate in 90% of cases. Specific information pertaining to conservative management, surgical approaches, and related treatments was appropriate and accurate in 100% of cases. CONCLUSIONS ChatGPT-4 provided clinically reasonable diagnoses to triage patient complaints of knee pain due to various underlying conditions that were generally consistent with differentials provided by sports medicine physicians. Diagnostic performance was enhanced when providing additional information, allowing ChatGPT-4 to reach high predictive accuracy for recommendations concerning management and treatment options. However, ChatGPT-4 may show clinically important error rates for diagnosis depending on prompting strategy and information provided; therefore, further refinements are necessary prior to implementation into clinical workflows. CLINICAL RELEVANCE Although ChatGPT-4 is increasingly being used by patients for health information, the potential for ChatGPT-4 to serve as a clinical support tool is unclear. In this study, we found that ChatGPT-4 was frequently able to diagnose and triage knee complaints appropriately as rated by sports medicine surgeons, suggesting that it may eventually be a useful clinical support tool.
Collapse
Affiliation(s)
- Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.; Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A..
| | - Nathan H Varady
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.; Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A
| | | | - Amy Z Lu
- Weill Cornell College of Medicine, New York, New York, U.S.A
| | - Jorge Chahla
- Department of Orthopaedic Surgery, Rush University Medical Center, Chicago, Illinois, U.S.A
| | - R Kyle Martin
- Department of Orthopedic Surgery, University of Minnesota, Minneapolis, Minnesota, U.S.A
| | - Anil S Ranawat
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.; Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A
| | - Andrew D Pearle
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.; Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A
| | - Riley J Williams
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.; Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A
| |
Collapse
|
7
|
Lim B, Seth I, Cuomo R, Kenney PS, Ross RJ, Sofiadellis F, Pentangelo P, Ceccaroni A, Alfano C, Rozen WM. Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients. Aesthetic Plast Surg 2024:10.1007/s00266-024-04157-0. [PMID: 38898239 DOI: 10.1007/s00266-024-04157-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Accepted: 05/21/2024] [Indexed: 06/21/2024]
Abstract
BACKGROUND Abdominoplasty is a common operation, used for a range of cosmetic and functional issues, often in the context of divarication of recti, significant weight loss, and after pregnancy. Despite this, patient-surgeon communication gaps can hinder informed decision-making. The integration of large language models (LLMs) in healthcare offers potential for enhancing patient information. This study evaluated the feasibility of using LLMs for answering perioperative queries. METHODS This study assessed the efficacy of four leading LLMs-OpenAI's ChatGPT-3.5, Anthropic's Claude, Google's Gemini, and Bing's CoPilot-using fifteen unique prompts. All outputs were evaluated using the Flesch-Kincaid, Flesch Reading Ease score, and Coleman-Liau index for readability assessment. The DISCERN score and a Likert scale were utilized to evaluate quality. Scores were assigned by two plastic surgical residents and then reviewed and discussed until a consensus was reached by five plastic surgeon specialists. RESULTS ChatGPT-3.5 required the highest level for comprehension, followed by Gemini, Claude, then CoPilot. Claude provided the most appropriate and actionable advice. In terms of patient-friendliness, CoPilot outperformed the rest, enhancing engagement and information comprehensiveness. ChatGPT-3.5 and Gemini offered adequate, though unremarkable, advice, employing more professional language. CoPilot uniquely included visual aids and was the only model to use hyperlinks, although they were not very helpful and acceptable, and it faced limitations in responding to certain queries. CONCLUSION ChatGPT-3.5, Gemini, Claude, and Bing's CoPilot showcased differences in readability and reliability. LLMs offer unique advantages for patient care but require careful selection. Future research should integrate LLM strengths and address weaknesses for optimal patient education. LEVEL OF EVIDENCE V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
Affiliation(s)
- Bryan Lim
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Ishith Seth
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Roberto Cuomo
- Plastic Surgery Unit, Department of Medicine, Surgery and Neuroscience, University of Siena, Siena, Italy.
| | - Peter Sinkjær Kenney
- Department of Plastic Surgery, Velje Hospital, Beriderbakken 4, 7100, Vejle, Denmark
- Department of Plastic and Breast Surgery, Aarhus University Hospital, Aarhus, Denmark
| | - Richard J Ross
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Foti Sofiadellis
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | | | | | | | - Warren Matthew Rozen
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| |
Collapse
|
8
|
Amacher SA, Arpagaus A, Sahmer C, Becker C, Gross S, Urben T, Tisljar K, Sutter R, Marsch S, Hunziker S. Prediction of outcomes after cardiac arrest by a generative artificial intelligence model. Resusc Plus 2024; 18:100587. [PMID: 38433764 PMCID: PMC10906512 DOI: 10.1016/j.resplu.2024.100587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 02/01/2024] [Accepted: 02/11/2024] [Indexed: 03/05/2024] Open
Abstract
Aims To investigate the prognostic accuracy of a non-medical generative artificial intelligence model (Chat Generative Pre-Trained Transformer 4 - ChatGPT-4) as a novel aspect in predicting death and poor neurological outcome at hospital discharge based on real-life data from cardiac arrest patients. Methods This prospective cohort study investigates the prognostic performance of ChatGPT-4 to predict outcomes at hospital discharge of adult cardiac arrest patients admitted to intensive care at a large Swiss tertiary academic medical center (COMMUNICATE/PROPHETIC cohort study). We prompted ChatGPT-4 with sixteen prognostic parameters derived from established post-cardiac arrest scores for each patient. We compared the prognostic performance of ChatGPT-4 regarding the area under the curve (AUC), sensitivity, specificity, positive and negative predictive values, and likelihood ratios of three cardiac arrest scores (Out-of-Hospital Cardiac Arrest [OHCA], Cardiac Arrest Hospital Prognosis [CAHP], and PROgnostication using LOGistic regression model for Unselected adult cardiac arrest patients in the Early stages [PROLOGUE score]) for in-hospital mortality and poor neurological outcome. Results Mortality at hospital discharge was 43% (n = 309/713), 54% of patients (n = 387/713) had a poor neurological outcome. ChatGPT-4 showed good discrimination regarding in-hospital mortality with an AUC of 0.85, similar to the OHCA, CAHP, and PROLOGUE (AUCs of 0.82, 0.83, and 0.84, respectively) scores. For poor neurological outcome, ChatGPT-4 showed a similar prediction to the post-cardiac arrest scores (AUC 0.83). Conclusions ChatGPT-4 showed a similar performance in predicting mortality and poor neurological outcome compared to validated post-cardiac arrest scores. However, more research is needed regarding illogical answers for potential incorporation of an LLM in the multimodal outcome prognostication after cardiac arrest.
Collapse
Affiliation(s)
- Simon A. Amacher
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
- Emergency Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
| | - Armon Arpagaus
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Christian Sahmer
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Christoph Becker
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
- Emergency Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
| | - Sebastian Gross
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Tabita Urben
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Kai Tisljar
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
| | - Raoul Sutter
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
- Medical Faculty, University of Basel, Basel, Switzerland
- Division of Neurophysiology, Department of Neurology, University Hospital Basel, Basel, Switzerland
| | - Stephan Marsch
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
- Medical Faculty, University of Basel, Basel, Switzerland
| | - Sabina Hunziker
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
- Medical Faculty, University of Basel, Basel, Switzerland
- Post-Intensive Care Clinic, University Hospital Basel, Basel, Switzerland
| |
Collapse
|
9
|
Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider CR, Forte AJ. Clinical and Surgical Applications of Large Language Models: A Systematic Review. J Clin Med 2024; 13:3041. [PMID: 38892752 PMCID: PMC11172607 DOI: 10.3390/jcm13113041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 05/15/2024] [Accepted: 05/19/2024] [Indexed: 06/21/2024] Open
Abstract
Background: Large language models (LLMs) represent a recent advancement in artificial intelligence with medical applications across various healthcare domains. The objective of this review is to highlight how LLMs can be utilized by clinicians and surgeons in their everyday practice. Methods: A systematic review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Six databases were searched to identify relevant articles. Eligibility criteria emphasized articles focused primarily on clinical and surgical applications of LLMs. Results: The literature search yielded 333 results, with 34 meeting eligibility criteria. All articles were from 2023. There were 14 original research articles, four letters, one interview, and 15 review articles. These articles covered a wide variety of medical specialties, including various surgical subspecialties. Conclusions: LLMs have the potential to enhance healthcare delivery. In clinical settings, LLMs can assist in diagnosis, treatment guidance, patient triage, physician knowledge augmentation, and administrative tasks. In surgical settings, LLMs can assist surgeons with documentation, surgical planning, and intraoperative guidance. However, addressing their limitations and concerns, particularly those related to accuracy and biases, is crucial. LLMs should be viewed as tools to complement, not replace, the expertise of healthcare professionals.
Collapse
Affiliation(s)
| | - Sahar Borna
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | - Syed Ali Haider
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Clifton R. Haider
- Department of Physiology and Biomedical Engineering, Mayo Clinic, Rochester, MN 55905, USA
| | - Antonio Jorge Forte
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
10
|
Preiksaitis C, Ashenburg N, Bunney G, Chu A, Kabeer R, Riley F, Ribeira R, Rose C. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med Inform 2024; 12:e53787. [PMID: 38728687 PMCID: PMC11127144 DOI: 10.2196/53787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 12/20/2023] [Accepted: 04/05/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI), more specifically large language models (LLMs), holds significant potential in revolutionizing emergency care delivery by optimizing clinical workflows and enhancing the quality of decision-making. Although enthusiasm for integrating LLMs into emergency medicine (EM) is growing, the existing literature is characterized by a disparate collection of individual studies, conceptual analyses, and preliminary implementations. Given these complexities and gaps in understanding, a cohesive framework is needed to comprehend the existing body of knowledge on the application of LLMs in EM. OBJECTIVE Given the absence of a comprehensive framework for exploring the roles of LLMs in EM, this scoping review aims to systematically map the existing literature on LLMs' potential applications within EM and identify directions for future research. Addressing this gap will allow for informed advancements in the field. METHODS Using PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) criteria, we searched Ovid MEDLINE, Embase, Web of Science, and Google Scholar for papers published between January 2018 and August 2023 that discussed LLMs' use in EM. We excluded other forms of AI. A total of 1994 unique titles and abstracts were screened, and each full-text paper was independently reviewed by 2 authors. Data were abstracted independently, and 5 authors performed a collaborative quantitative and qualitative synthesis of the data. RESULTS A total of 43 papers were included. Studies were predominantly from 2022 to 2023 and conducted in the United States and China. We uncovered four major themes: (1) clinical decision-making and support was highlighted as a pivotal area, with LLMs playing a substantial role in enhancing patient care, notably through their application in real-time triage, allowing early recognition of patient urgency; (2) efficiency, workflow, and information management demonstrated the capacity of LLMs to significantly boost operational efficiency, particularly through the automation of patient record synthesis, which could reduce administrative burden and enhance patient-centric care; (3) risks, ethics, and transparency were identified as areas of concern, especially regarding the reliability of LLMs' outputs, and specific studies highlighted the challenges of ensuring unbiased decision-making amidst potentially flawed training data sets, stressing the importance of thorough validation and ethical oversight; and (4) education and communication possibilities included LLMs' capacity to enrich medical training, such as through using simulated patient interactions that enhance communication skills. CONCLUSIONS LLMs have the potential to fundamentally transform EM, enhancing clinical decision-making, optimizing workflows, and improving patient outcomes. This review sets the stage for future advancements by identifying key research areas: prospective validation of LLM applications, establishing standards for responsible use, understanding provider and patient perceptions, and improving physicians' AI literacy. Effective integration of LLMs into EM will require collaborative efforts and thorough evaluation to ensure these technologies can be safely and effectively applied.
Collapse
Affiliation(s)
- Carl Preiksaitis
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Nicholas Ashenburg
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Gabrielle Bunney
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Andrew Chu
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Rana Kabeer
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Fran Riley
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Ryan Ribeira
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Christian Rose
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| |
Collapse
|
11
|
Scott IA, Zuccon G. The new paradigm in machine learning - foundation models, large language models and beyond: a primer for physicians. Intern Med J 2024; 54:705-715. [PMID: 38715436 DOI: 10.1111/imj.16393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 03/26/2024] [Indexed: 05/18/2024]
Abstract
Foundation machine learning models are deep learning models capable of performing many different tasks using different data modalities such as text, audio, images and video. They represent a major shift from traditional task-specific machine learning prediction models. Large language models (LLM), brought to wide public prominence in the form of ChatGPT, are text-based foundational models that have the potential to transform medicine by enabling automation of a range of tasks, including writing discharge summaries, answering patients questions and assisting in clinical decision-making. However, such models are not without risk and can potentially cause harm if their development, evaluation and use are devoid of proper scrutiny. This narrative review describes the different types of LLM, their emerging applications and potential limitations and bias and likely future translation into clinical practice.
Collapse
Affiliation(s)
- Ian A Scott
- Centre for Health Services Research, University of Queensland, Woolloongabba, Australia
| | - Guido Zuccon
- School of Electrical Engineering and Computer Sciences, The University of Queensland, St Lucia, Queensland, Australia
| |
Collapse
|
12
|
Frosolini A, Catarzi L, Benedetti S, Latini L, Chisci G, Franz L, Gennaro P, Gabriele G. The Role of Large Language Models (LLMs) in Providing Triage for Maxillofacial Trauma Cases: A Preliminary Study. Diagnostics (Basel) 2024; 14:839. [PMID: 38667484 PMCID: PMC11048758 DOI: 10.3390/diagnostics14080839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 04/10/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND In the evolving field of maxillofacial surgery, integrating advanced technologies like Large Language Models (LLMs) into medical practices, especially for trauma triage, presents a promising yet largely unexplored potential. This study aimed to evaluate the feasibility of using LLMs for triaging complex maxillofacial trauma cases by comparing their performance against the expertise of a tertiary referral center. METHODS Utilizing a comprehensive review of patient records in a tertiary referral center over a year-long period, standardized prompts detailing patient demographics, injury characteristics, and medical histories were created. These prompts were used to assess the triage suggestions of ChatGPT 4.0 and Google GEMINI against the center's recommendations, supplemented by evaluating the AI's performance using the QAMAI and AIPI questionnaires. RESULTS The results in 10 cases of major maxillofacial trauma indicated moderate agreement rates between LLM recommendations and the referral center, with some variances in the suggestion of appropriate examinations (70% ChatGPT and 50% GEMINI) and treatment plans (60% ChatGPT and 45% GEMINI). Notably, the study found no statistically significant differences in several areas of the questionnaires, except in the diagnosis accuracy (GEMINI: 3.30, ChatGPT: 2.30; p = 0.032) and relevance of the recommendations (GEMINI: 2.90, ChatGPT: 3.50; p = 0.021). A Spearman correlation analysis highlighted significant correlations within the two questionnaires, specifically between the QAMAI total score and AIPI treatment scores (rho = 0.767, p = 0.010). CONCLUSIONS This exploratory investigation underscores the potential of LLMs in enhancing clinical decision making for maxillofacial trauma cases, indicating a need for further research to refine their application in healthcare settings.
Collapse
Affiliation(s)
- Andrea Frosolini
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Lisa Catarzi
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Simone Benedetti
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Linda Latini
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Glauco Chisci
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Leonardo Franz
- Phoniatris and Audiology Unit, Department of Neuroscience DNS, University of Padova, 35122 Treviso, Italy;
- Artificial Intelligence in Medicine and Innovation in Clinical Research and Methodology (PhD Program), Department of Clinical and Experimental Sciences, University of Brescia, 25121 Brescia, Italy
| | - Paolo Gennaro
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Guido Gabriele
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| |
Collapse
|
13
|
Paslı S, Şahin AS, Beşer MF, Topçuoğlu H, Yadigaroğlu M, İmamoğlu M. Assessing the precision of artificial intelligence in ED triage decisions: Insights from a study with ChatGPT. Am J Emerg Med 2024; 78:170-175. [PMID: 38295466 DOI: 10.1016/j.ajem.2024.01.037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 12/25/2023] [Accepted: 01/21/2024] [Indexed: 02/02/2024] Open
Abstract
BACKGROUND The rise in emergency department presentations globally poses challenges for efficient patient management. To address this, various strategies aim to expedite patient management. Artificial intelligence's (AI) consistent performance and rapid data interpretation extend its healthcare applications, especially in emergencies. The introduction of a robust AI tool like ChatGPT, based on GPT-4 developed by OpenAI, can benefit patients and healthcare professionals by improving the speed and accuracy of resource allocation. This study examines ChatGPT's capability to predict triage outcomes based on local emergency department rules. METHODS This study is a single-center prospective observational study. The study population consists of all patients who presented to the emergency department with any symptoms and agreed to participate. The study was conducted on three non-consecutive days for a total of 72 h. Patients' chief complaints, vital parameters, medical history and the area to which they were directed by the triage team in the emergency department were recorded. Concurrently, an emergency medicine physician inputted the same data into previously trained GPT-4, according to local rules. According to this data, the triage decisions made by GPT-4 were recorded. In the same process, an emergency medicine specialist determined where the patient should be directed based on the data collected, and this decision was considered the gold standard. Accuracy rates and reliability for directing patients to specific areas by the triage team and GPT-4 were evaluated using Cohen's kappa test. Furthermore, the accuracy of the patient triage process performed by the triage team and GPT-4 was assessed by receiver operating characteristic (ROC) analysis. Statistical analysis considered a value of p < 0.05 as significant. RESULTS The study was carried out on 758 patients. Among the participants, 416 (54.9%) were male and 342 (45.1%) were female. Evaluating the primary endpoints of our study - the agreement between the decisions of the triage team, GPT-4 decisions in emergency department triage, and the gold standard - we observed almost perfect agreement both between the triage team and the gold standard and between GPT-4 and the gold standard (Cohen's Kappa 0.893 and 0.899, respectively; p < 0.001 for each). CONCLUSION Our findings suggest GPT-4 possess outstanding predictive skills in triaging patients in an emergency setting. GPT-4 can serve as an effective tool to support the triage process.
Collapse
Affiliation(s)
- Sinan Paslı
- Karadeniz Technical University, Faculty of Medicine, Department of Emergency Medicine, Trabzon, Turkey.
| | | | | | - Hazal Topçuoğlu
- Siirt Education & Research Hospital, Department of Emergency Medicine, Siirt, Turkey
| | - Metin Yadigaroğlu
- Samsun University, Faculty of Medicine, Department of Emergency Medicine, Samsun, Turkey
| | - Melih İmamoğlu
- Karadeniz Technical University, Faculty of Medicine, Department of Emergency Medicine, Trabzon, Turkey
| |
Collapse
|
14
|
Mu Y, He D. The Potential Applications and Challenges of ChatGPT in the Medical Field. Int J Gen Med 2024; 17:817-826. [PMID: 38476626 PMCID: PMC10929156 DOI: 10.2147/ijgm.s456659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 02/26/2024] [Indexed: 03/14/2024] Open
Abstract
ChatGPT, an AI-driven conversational large language model (LLM), has garnered significant scholarly attention since its inception, owing to its manifold applications in the realm of medical science. This study primarily examines the merits, limitations, anticipated developments, and practical applications of ChatGPT in clinical practice, healthcare, medical education, and medical research. It underscores the necessity for further research and development to enhance its performance and deployment. Moreover, future research avenues encompass ongoing enhancements and standardization of ChatGPT, mitigating its limitations, and exploring its integration and applicability in translational and personalized medicine. Reflecting the narrative nature of this review, a focused literature search was performed to identify relevant publications on ChatGPT's use in medicine. This process was aimed at gathering a broad spectrum of insights to provide a comprehensive overview of the current state and future prospects of ChatGPT in the medical domain. The objective is to aid healthcare professionals in understanding the groundbreaking advancements associated with the latest artificial intelligence tools, while also acknowledging the opportunities and challenges presented by ChatGPT.
Collapse
Affiliation(s)
- Yonglin Mu
- Department of Urology, Children’s Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Dawei He
- Department of Urology, Children’s Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| |
Collapse
|
15
|
Wang Z, Zhang Z, Traverso A, Dekker A, Qian L, Sun P. Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: enhancing interpretability with a chain of thought approach. Quant Imaging Med Surg 2024; 14:1602-1615. [PMID: 38415150 PMCID: PMC10895085 DOI: 10.21037/qims-23-1180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 11/30/2023] [Indexed: 02/29/2024]
Abstract
Background As artificial intelligence (AI) becomes increasingly prevalent in the medical field, the effectiveness of AI-generated medical reports in disease diagnosis remains to be evaluated. ChatGPT is a large language model developed by open AI with a notable capacity for text abstraction and comprehension. This study aimed to explore the capabilities, limitations, and potential of Generative Pre-trained Transformer (GPT)-4 in analyzing thyroid cancer ultrasound reports, providing diagnoses, and recommending treatment plans. Methods Using 109 diverse thyroid cancer cases, we evaluated GPT-4's performance by comparing its generated reports to those from doctors with various levels of experience. We also conducted a Turing Test and a consistency analysis. To enhance the interpretability of the model, we applied the Chain of Thought (CoT) method to deconstruct the decision-making chain of the GPT model. Results GPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression, but showed limitations in diagnostic accuracy. In addition, our consistency analysis highlighted certain discrepancies in the AI's performance. The CoT method effectively enhanced the interpretability of the AI's decision-making process. Conclusions GPT-4 exhibits potential as a supplementary tool in healthcare, especially for generating thyroid gland diagnostic reports. Our proposed online platform, "ThyroAIGuide", alongside the CoT method, underscores the potential of AI to augment diagnostic processes, elevate healthcare accessibility, and advance patient education. However, the journey towards fully integrating AI into healthcare is ongoing, requiring continuous research, development, and careful monitoring by medical professionals to ensure patient safety and quality of care.
Collapse
Affiliation(s)
- Zhixiang Wang
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Zhen Zhang
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Alberto Traverso
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Andre Dekker
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Linxue Qian
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| | - Pengfei Sun
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| |
Collapse
|
16
|
Ray PP. ChatGPT's competence in addressing urolithiasis: myth or reality? Int Urol Nephrol 2024; 56:149-150. [PMID: 37726510 DOI: 10.1007/s11255-023-03802-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 09/09/2023] [Indexed: 09/21/2023]
|
17
|
dos Santos ML, Victória VNG. Critical evaluation of applications of artificial intelligence based linguistic models in Occupational Health. Rev Bras Med Trab 2024; 22:e20231241. [PMID: 39165532 PMCID: PMC11333049 DOI: 10.47626/1679-4435-2023-1241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Accepted: 10/30/2023] [Indexed: 08/22/2024] Open
Abstract
This article explores the impact and potential applications of large language models in Occupational Medicine. Large language models have the ability to provide support for medical decision-making, patient screening, summarization and creation of technical, scientific, and legal documents, training and education for doctors and occupational health teams, as well as patient education, potentially leading to lower costs, reduced time expenditure, and a lower incidence of human errors. Despite promising results and a wide range of applications, large language models also have significant limitations in terms of their accuracy, the risk of generating false information, and incorrect recommendations. Various ethical aspects that have not been well elucidated by the medical and academic communities should also be considered, and the lack of regulation by government entities can create areas of legal uncertainty regarding their use in Occupational Medicine and in the legal environment. Significant future improvements can be expected in these models in the coming years, and further studies on the applications of large language models in Occupational Medicine should be encouraged.
Collapse
Affiliation(s)
- Mateus Lins dos Santos
- 6ª Vara, Justiça Federal em Sergipe, Itabaiana, SE, Brazil
- 9ª Vara, Justiça Federal em Sergipe, Propriá, SE,
Brazil
| | | |
Collapse
|
18
|
Alotaibi SS, Rehman A, Hasnain M. Revolutionizing ocular cancer management: a narrative review on exploring the potential role of ChatGPT. Front Public Health 2023; 11:1338215. [PMID: 38192545 PMCID: PMC10773849 DOI: 10.3389/fpubh.2023.1338215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 12/04/2023] [Indexed: 01/10/2024] Open
Abstract
This paper pioneers the exploration of ocular cancer, and its management with the help of Artificial Intelligence (AI) technology. Existing literature presents a significant increase in new eye cancer cases in 2023, experiencing a higher incidence rate. Extensive research was conducted using online databases such as PubMed, ACM Digital Library, ScienceDirect, and Springer. To conduct this review, Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines are used. Of the collected 62 studies, only 20 documents met the inclusion criteria. The review study identifies seven ocular cancer types. Important challenges associated with ocular cancer are highlighted, including limited awareness about eye cancer, restricted healthcare access, financial barriers, and insufficient infrastructure support. Financial barriers is one of the widely examined ocular cancer challenges in the literature. The potential role and limitations of ChatGPT are discussed, emphasizing its usefulness in providing general information to physicians, noting its inability to deliver up-to-date information. The paper concludes by presenting the potential future applications of ChatGPT to advance research on ocular cancer globally.
Collapse
Affiliation(s)
- Saud S. Alotaibi
- Information Systems Department, Umm Al-Qura University, Makkah, Saudi Arabia
| | - Amna Rehman
- Department of Computer Science, Lahore Leads University, Lahore, Pakistan
| | - Muhammad Hasnain
- Department of Computer Science, Lahore Leads University, Lahore, Pakistan
| |
Collapse
|
19
|
Wang X, Liu XQ. Potential and limitations of ChatGPT and generative artificial intelligence in medical safety education. World J Clin Cases 2023; 11:7935-7939. [DOI: 10.12998/wjcc.v11.i32.7935] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 09/21/2023] [Accepted: 11/02/2023] [Indexed: 11/16/2023] Open
Abstract
The primary objectives of medical safety education are to provide the public with essential knowledge about medications and to foster a scientific approach to drug usage. The era of using artificial intelligence to revolutionize medical safety education has already dawned, and ChatGPT and other generative artificial intelligence models have immense potential in this domain. Notably, they offer a wealth of knowledge, anonymity, continuous availability, and personalized services. However, the practical implementation of generative artificial intelligence models such as ChatGPT in medical safety education still faces several challenges, including concerns about the accuracy of information, legal responsibilities, and ethical obligations. Moving forward, it is crucial to intelligently upgrade ChatGPT by leveraging the strengths of existing medical practices. This task involves further integrating the model with real-life scenarios and proactively addressing ethical and security issues with the ultimate goal of providing the public with comprehensive, convenient, efficient, and personalized medical services.
Collapse
Affiliation(s)
- Xin Wang
- School of Education, Tianjin University, Tianjin 300350, China
| | - Xin-Qiao Liu
- School of Education, Tianjin University, Tianjin 300350, China
| |
Collapse
|