Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Jiang LY, Liu XC, Nejatian NP, Nasir-Moin M, Wang D, Abidin A, Eaton K, Riina HA, Laufer I, Punjabi P, Miceli M, Kim NC, Orillac C, Schnurman Z, Livia C, Weiss H, Kurland D, Neifert S, Dastagirzada Y, Kondziolka D, Cheung ATM, Yang G, Cao M, Flores M, Costa AB, Aphinyanaphongs Y, Cho K, Oermann EK. Health system-scale language models are all-purpose prediction engines. Nature 2023;619:357-362. [PMID: 37286606 PMCID: PMC10338337 DOI: 10.1038/s41586-023-06160-y] [Citation(s) in RCA: 35] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 05/02/2023] [Indexed: 06/09/2023]

For:	Jiang LY, Liu XC, Nejatian NP, Nasir-Moin M, Wang D, Abidin A, Eaton K, Riina HA, Laufer I, Punjabi P, Miceli M, Kim NC, Orillac C, Schnurman Z, Livia C, Weiss H, Kurland D, Neifert S, Dastagirzada Y, Kondziolka D, Cheung ATM, Yang G, Cao M, Flores M, Costa AB, Aphinyanaphongs Y, Cho K, Oermann EK. Health system-scale language models are all-purpose prediction engines. Nature 2023;619:357-362. [PMID: 37286606 PMCID: PMC10338337 DOI: 10.1038/s41586-023-06160-y] [Citation(s) in RCA: 35] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 05/02/2023] [Indexed: 06/09/2023]

Number

Cited by Other Article(s)

Beil M, Moreno R, Fronczek J, Kogan Y, Moreno RPJ, Flaatten H, Guidet B, de Lange D, Leaver S, Nachshon A, van Heerden PV, Joskowicz L, Sviri S, Jung C, Szczeklik W. Prognosticating the outcome of intensive care in older patients-a narrative review. Ann Intensive Care 2024;14:97. [PMID: 38907141 PMCID: PMC11192712 DOI: 10.1186/s13613-024-01330-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 06/10/2024] [Indexed: 06/23/2024] Open

Affiliation(s)

Michael Beil Department of Medical Intensive Care, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
Rui Moreno Unidade Local de Saúde São José, Hospital de São José, Lisbon, Portugal Centro Clínico Académico de Lisboa, Lisbon, Portugal Faculdade de Ciências da Saúde, Universidade da Beira Interior, Covilhã, Portugal
Jakub Fronczek Center for Intensive Care and Perioperative Medicine, Jagiellonian University Medical College, Krakow, Poland
Yuri Kogan Institute for Medical Biomathematics, Bene Ataroth, Israel
Rui Paulo Jorge Moreno Imperial College Business School, London, UK
Hans Flaatten Department of Research and Development, Haukeland University Hospital, Bergen, Norway
Bertrand Guidet INSERM, Institut Pierre Louis d'Epidémiologie Et de Santé Publique, AP-HP, Hôpital Saint Antoine, Sorbonne Université, Service MIR, Paris, France
Dylan de Lange Department of Intensive Care Medicine, University Medical Center, University Utrecht, Utrecht, The Netherlands
Susannah Leaver General Intensive Care, St George's University Hospitals NHS Foundation Trust, London, UK
Akiva Nachshon General Intensive Care Unit, Department of Anaesthesiology, Critical Care and Pain Medicine, Faculty of Medicine, Hebrew University and, Hadassah University Medical Center, Jerusalem, Israel
Peter Vernon van Heerden General Intensive Care Unit, Department of Anaesthesiology, Critical Care and Pain Medicine, Faculty of Medicine, Hebrew University and, Hadassah University Medical Center, Jerusalem, Israel
Leo Joskowicz School of Computer Science and Engineering and Center for Computational Medicine, The Hebrew University of Jerusalem, Jerusalem, Israel
Sigal Sviri Department of Medical Intensive Care, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
Christian Jung Department of Cardiology, Pulmonology and Vascular Medicine, Faculty of Medicine, Heinrich-Heine-University, University Duesseldorf, Moorenstraße 5, 40225, Düsseldorf, Germany.
Wojciech Szczeklik Center for Intensive Care and Perioperative Medicine, Jagiellonian University Medical College, Krakow, Poland

Collapse

Yao JJ, Aggarwal M, Lopez RD, Namdari S. Current Concepts Review: Large Language Models in Orthopaedics: Definitions, Uses, and Limitations. J Bone Joint Surg Am 2024:00004623-990000000-01136. [PMID: 38896652 DOI: 10.2106/jbjs.23.01417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]

Volkmer S, Meyer-Lindenberg A, Schwarz E. Large language models in psychiatry: Opportunities and challenges. Psychiatry Res 2024;339:116026. [PMID: 38909412 DOI: 10.1016/j.psychres.2024.116026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 05/17/2024] [Accepted: 06/10/2024] [Indexed: 06/25/2024]

Longwell JB, Hirsch I, Binder F, Gonzalez Conchas GA, Mau D, Jang R, Krishnan RG, Grant RC. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open 2024;7:e2417641. [PMID: 38888919 PMCID: PMC11185976 DOI: 10.1001/jamanetworkopen.2024.17641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 04/18/2024] [Indexed: 06/20/2024] Open

Abstract

Importance

Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information.

Objective

To evaluate the accuracy and safety of LLM answers on medical oncology examination questions.

Design, Setting, and Participants

This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs.

Main Outcomes and Measures

The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm.

Results

Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm.

Conclusions and Relevance

In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.

Collapse

Amacher SA, Arpagaus A, Sahmer C, Becker C, Gross S, Urben T, Tisljar K, Sutter R, Marsch S, Hunziker S. Prediction of outcomes after cardiac arrest by a generative artificial intelligence model. Resusc Plus 2024;18:100587. [PMID: 38433764 PMCID: PMC10906512 DOI: 10.1016/j.resplu.2024.100587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 02/01/2024] [Accepted: 02/11/2024] [Indexed: 03/05/2024] Open

Abstract

Aims

To investigate the prognostic accuracy of a non-medical generative artificial intelligence model (Chat Generative Pre-Trained Transformer 4 - ChatGPT-4) as a novel aspect in predicting death and poor neurological outcome at hospital discharge based on real-life data from cardiac arrest patients.

Methods

This prospective cohort study investigates the prognostic performance of ChatGPT-4 to predict outcomes at hospital discharge of adult cardiac arrest patients admitted to intensive care at a large Swiss tertiary academic medical center (COMMUNICATE/PROPHETIC cohort study). We prompted ChatGPT-4 with sixteen prognostic parameters derived from established post-cardiac arrest scores for each patient. We compared the prognostic performance of ChatGPT-4 regarding the area under the curve (AUC), sensitivity, specificity, positive and negative predictive values, and likelihood ratios of three cardiac arrest scores (Out-of-Hospital Cardiac Arrest [OHCA], Cardiac Arrest Hospital Prognosis [CAHP], and PROgnostication using LOGistic regression model for Unselected adult cardiac arrest patients in the Early stages [PROLOGUE score]) for in-hospital mortality and poor neurological outcome.

Results

Mortality at hospital discharge was 43% (n = 309/713), 54% of patients (n = 387/713) had a poor neurological outcome. ChatGPT-4 showed good discrimination regarding in-hospital mortality with an AUC of 0.85, similar to the OHCA, CAHP, and PROLOGUE (AUCs of 0.82, 0.83, and 0.84, respectively) scores. For poor neurological outcome, ChatGPT-4 showed a similar prediction to the post-cardiac arrest scores (AUC 0.83).

Conclusions

ChatGPT-4 showed a similar performance in predicting mortality and poor neurological outcome compared to validated post-cardiac arrest scores. However, more research is needed regarding illogical answers for potential incorporation of an LLM in the multimodal outcome prognostication after cardiac arrest.

Collapse

Perez-Lopez R, Ghaffari Laleh N, Mahmood F, Kather JN. A guide to artificial intelligence for cancer researchers. Nat Rev Cancer 2024;24:427-441. [PMID: 38755439 DOI: 10.1038/s41568-024-00694-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/09/2024] [Indexed: 05/18/2024]

Xu H, Usuyama N, Bagga J, Zhang S, Rao R, Naumann T, Wong C, Gero Z, González J, Gu Y, Xu Y, Wei M, Wang W, Ma S, Wei F, Yang J, Li C, Gao J, Rosemon J, Bower T, Lee S, Weerasinghe R, Wright BJ, Robicsek A, Piening B, Bifulco C, Wang S, Poon H. A whole-slide foundation model for digital pathology from real-world data. Nature 2024;630:181-188. [PMID: 38778098 PMCID: PMC11153137 DOI: 10.1038/s41586-024-07441-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 04/19/2024] [Indexed: 05/25/2024]

Akhondi-Asl A, Yang Y, Luchette M, Burns JP, Mehta NM, Geva A. Comparing the Quality of Domain-Specific Versus General Language Models for Artificial Intelligence-Generated Differential Diagnoses in PICU Patients. Pediatr Crit Care Med 2024;25:e273-e282. [PMID: 38329382 DOI: 10.1097/pcc.0000000000003468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]

Abstract

OBJECTIVES

Generative language models (LMs) are being evaluated in a variety of tasks in healthcare, but pediatric critical care studies are scant. Our objective was to evaluate the utility of generative LMs in the pediatric critical care setting and to determine whether domain-adapted LMs can outperform much larger general-domain LMs in generating a differential diagnosis from the admission notes of PICU patients.

DESIGN

Single-center retrospective cohort study.

SETTING

Quaternary 40-bed PICU.

PATIENTS

Notes from all patients admitted to the PICU between January 2012 and April 2023 were used for model development. One hundred thirty randomly selected admission notes were used for evaluation.

INTERVENTIONS

None.

MEASUREMENTS AND MAIN RESULTS

Five experts in critical care used a 5-point Likert scale to independently evaluate the overall quality of differential diagnoses: 1) written by the clinician in the original notes, 2) generated by two general LMs (BioGPT-Large and LLaMa-65B), and 3) generated by two fine-tuned models (fine-tuned BioGPT-Large and fine-tuned LLaMa-7B). Differences among differential diagnoses were compared using mixed methods regression models. We used 1,916,538 notes from 32,454 unique patients for model development and validation. The mean quality scores of the differential diagnoses generated by the clinicians and fine-tuned LLaMa-7B, the best-performing LM, were 3.43 and 2.88, respectively (absolute difference 0.54 units [95% CI, 0.37-0.72], p < 0.001). Fine-tuned LLaMa-7B performed better than LLaMa-65B (absolute difference 0.23 unit [95% CI, 0.06-0.41], p = 0.009) and BioGPT-Large (absolute difference 0.86 unit [95% CI, 0.69-1.0], p < 0.001). The differential diagnosis generated by clinicians and fine-tuned LLaMa-7B were ranked as the highest quality in 144 (55%) and 74 cases (29%), respectively.

CONCLUSIONS

A smaller LM fine-tuned using notes of PICU patients outperformed much larger models trained on general-domain data. Currently, LMs remain inferior but may serve as an adjunct to human clinicians in real-world tasks using real-world data.

Collapse

Affiliation(s)

Alireza Akhondi-Asl Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA Department of Anaesthesia, Harvard Medical School, Boston, MA Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
Youyang Yang Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA Department of Anaesthesia, Harvard Medical School, Boston, MA Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
Matthew Luchette Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA Department of Anaesthesia, Harvard Medical School, Boston, MA Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
Jeffrey P Burns Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA Department of Anaesthesia, Harvard Medical School, Boston, MA Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
Nilesh M Mehta Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
Alon Geva Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA Department of Anaesthesia, Harvard Medical School, Boston, MA Computational Health Informatics Program, Boston Children's Hospital, Boston, MA

Collapse

Bryant AK, Zamora‐Resendiz R, Dai X, Morrow D, Lin Y, Jungles KM, Rae JM, Tate A, Pearson AN, Jiang R, Fritsche L, Lawrence TS, Zou W, Schipper M, Ramnath N, Yoo S, Crivelli S, Green MD. Artificial intelligence to unlock real-world evidence in clinical oncology: A primer on recent advances. Cancer Med 2024;13:e7253. [PMID: 38899720 PMCID: PMC11187737 DOI: 10.1002/cam4.7253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 02/05/2024] [Accepted: 04/28/2024] [Indexed: 06/21/2024] Open

Affiliation(s)

Alex K. Bryant Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA Department of Radiation Oncology, Veterans Affairs Ann Arbor Healthcare SystemAnn ArborMichiganUSA
Rafael Zamora‐Resendiz Applied Mathematics and Computational Research Division, Lawrence Berkeley National LaboratoryBerkeleyCaliforniaUSA
Xin Dai Computational Science Initiative, Brookhaven National LaboratoryUptonNew YorkUSA
Destinee Morrow Applied Mathematics and Computational Research Division, Lawrence Berkeley National LaboratoryBerkeleyCaliforniaUSA
Yuewei Lin Computational Science Initiative, Brookhaven National LaboratoryUptonNew YorkUSA
Kassidy M. Jungles Department of PharmacologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
James M. Rae Department of PharmacologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA Department of Internal MedicineUniversity of Michigan School of MedicineAnn ArborMichiganUSA
Akshay Tate Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
Ashley N. Pearson Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
Ralph Jiang Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA Department of StatisticsUniversity of MichiganAnn ArborMichiganUSA
Lars Fritsche Department of StatisticsUniversity of MichiganAnn ArborMichiganUSA
Theodore S. Lawrence Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
Weiping Zou Department of StatisticsUniversity of MichiganAnn ArborMichiganUSA Center of Excellence for Cancer Immunology and ImmunotherapyUniversity of Michigan Rogel Cancer CenterAnn ArborMichiganUSA Department of PathologyUniversity of MichiganAnn ArborMichiganUSA Graduate Program in ImmunologyUniversity of MichiganAnn ArborMichiganUSA
Matthew Schipper Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA Department of PharmacologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
Nithya Ramnath Division of Hematology Oncology, Department of MedicineUniversity of Michigan School of MedicineAnn ArborMichiganUSA Division of Hematology Oncology, Department of MedicineVeterans Affairs Ann Arbor Healthcare SystemAnn ArborMichiganUSA
Shinjae Yoo Computational Science Initiative, Brookhaven National LaboratoryUptonNew YorkUSA
Silvia Crivelli Applied Mathematics and Computational Research Division, Lawrence Berkeley National LaboratoryBerkeleyCaliforniaUSA
Michael D. Green Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA Department of Radiation Oncology, Veterans Affairs Ann Arbor Healthcare SystemAnn ArborMichiganUSA Graduate Program in ImmunologyUniversity of MichiganAnn ArborMichiganUSA Graduate Program in Cancer BiologyUniversity of MichiganAnn ArborMichiganUSA Department of Microbiology and ImmunologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA

Collapse

Pais C, Liu J, Voigt R, Gupta V, Wade E, Bayati M. Large language models for preventing medication direction errors in online pharmacies. Nat Med 2024;30:1574-1582. [PMID: 38664535 PMCID: PMC11186789 DOI: 10.1038/s41591-024-02933-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Accepted: 03/20/2024] [Indexed: 05/04/2024]

Glicksberg BS, Timsina P, Patel D, Sawant A, Vaid A, Raut G, Charney AW, Apakama D, Carr BG, Freeman R, Nadkarni GN, Klang E. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J Am Med Inform Assoc 2024:ocae103. [PMID: 38771093 DOI: 10.1093/jamia/ocae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 04/22/2024] [Indexed: 05/22/2024] Open

Abstract

BACKGROUND

Artificial intelligence (AI) and large language models (LLMs) can play a critical role in emergency room operations by augmenting decision-making about patient admission. However, there are no studies for LLMs using real-world data and scenarios, in comparison to and being informed by traditional supervised machine learning (ML) models. We evaluated the performance of GPT-4 for predicting patient admissions from emergency department (ED) visits. We compared performance to traditional ML models both naively and when informed by few-shot examples and/or numerical probabilities.

METHODS

We conducted a retrospective study using electronic health records across 7 NYC hospitals. We trained Bio-Clinical-BERT and XGBoost (XGB) models on unstructured and structured data, respectively, and created an ensemble model reflecting ML performance. We then assessed GPT-4 capabilities in many scenarios: through Zero-shot, Few-shot with and without retrieval-augmented generation (RAG), and with and without ML numerical probabilities.

RESULTS

The Ensemble ML model achieved an area under the receiver operating characteristic curve (AUC) of 0.88, an area under the precision-recall curve (AUPRC) of 0.72 and an accuracy of 82.9%. The naïve GPT-4's performance (0.79 AUC, 0.48 AUPRC, and 77.5% accuracy) showed substantial improvement when given limited, relevant data to learn from (ie, RAG) and underlying ML probabilities (0.87 AUC, 0.71 AUPRC, and 83.1% accuracy). Interestingly, RAG alone boosted performance to near peak levels (0.82 AUC, 0.56 AUPRC, and 81.3% accuracy).

CONCLUSIONS

The naïve LLM had limited performance but showed significant improvement in predicting ED admissions when supplemented with real-world examples to learn from, particularly through RAG, and/or numerical probabilities from traditional ML models. Its peak performance, although slightly lower than the pure ML model, is noteworthy given its potential for providing reasoning behind predictions. Further refinement of LLMs with real-world data is necessary for successful integration as decision-support tools in care settings.

Collapse

Affiliation(s)

Benjamin S Glicksberg Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Prem Timsina Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Dhaval Patel Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Ashwin Sawant Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Akhil Vaid Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Ganesh Raut Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Alexander W Charney The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Donald Apakama Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Brendan G Carr Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Robert Freeman Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Girish N Nadkarni Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
Eyal Klang Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States

Collapse

Murugan M, Yuan B, Venner E, Ballantyne CM, Robinson KM, Coons JC, Wang L, Empey PE, Gibbs RA. Empowering personalized pharmacogenomics with generative AI solutions. J Am Med Inform Assoc 2024;31:1356-1366. [PMID: 38447590 PMCID: PMC11105140 DOI: 10.1093/jamia/ocae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 02/06/2024] [Accepted: 02/19/2024] [Indexed: 03/08/2024] Open

Abstract

OBJECTIVE

This study evaluates an AI assistant developed using OpenAI's GPT-4 for interpreting pharmacogenomic (PGx) testing results, aiming to improve decision-making and knowledge sharing in clinical genetics and to enhance patient care with equitable access.

MATERIALS AND METHODS

The AI assistant employs retrieval-augmented generation (RAG), which combines retrieval and generative techniques, by harnessing a knowledge base (KB) that comprises data from the Clinical Pharmacogenetics Implementation Consortium (CPIC). It uses context-aware GPT-4 to generate tailored responses to user queries from this KB, further refined through prompt engineering and guardrails.

RESULTS

Evaluated against a specialized PGx question catalog, the AI assistant showed high efficacy in addressing user queries. Compared with OpenAI's ChatGPT 3.5, it demonstrated better performance, especially in provider-specific queries requiring specialized data and citations. Key areas for improvement include enhancing accuracy, relevancy, and representative language in responses.

DISCUSSION

The integration of context-aware GPT-4 with RAG significantly enhanced the AI assistant's utility. RAG's ability to incorporate domain-specific CPIC data, including recent literature, proved beneficial. Challenges persist, such as the need for specialized genetic/PGx models to improve accuracy and relevancy and addressing ethical, regulatory, and safety concerns.

CONCLUSION

This study underscores generative AI's potential for transforming healthcare provider support and patient accessibility to complex pharmacogenomic information. While careful implementation of large language models like GPT-4 is necessary, it is clear that they can substantially improve understanding of pharmacogenomic data. With further development, these tools could augment healthcare expertise, provider productivity, and the delivery of equitable, patient-centered healthcare services.

Collapse

Bitterman DS, Downing A, Maués J, Lustberg M. Promise and Perils of Large Language Models for Cancer Survivorship and Supportive Care. J Clin Oncol 2024;42:1607-1611. [PMID: 38452323 PMCID: PMC11095890 DOI: 10.1200/jco.23.02439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 12/07/2023] [Accepted: 01/17/2024] [Indexed: 03/09/2024] Open

Friedman AB, Delgado MK, Weissman GE. Artificial Intelligence for Emergency Care Triage-Much Promise, but Still Much to Learn. JAMA Netw Open 2024;7:e248857. [PMID: 38713470 DOI: 10.1001/jamanetworkopen.2024.8857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 05/08/2024] Open

Alter IL, Chan K, Lechien J, Rameau A. An introduction to machine learning and generative artificial intelligence for otolaryngologists-head and neck surgeons: a narrative review. Eur Arch Otorhinolaryngol 2024;281:2723-2731. [PMID: 38393353 DOI: 10.1007/s00405-024-08512-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Accepted: 01/25/2024] [Indexed: 02/25/2024]

Wu J, Wu X, Qiu Z, Li M, Lin S, Zhang Y, Zheng Y, Yuan C, Yang J. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inform Assoc 2024:ocae079. [PMID: 38684792 DOI: 10.1093/jamia/ocae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 03/14/2024] [Accepted: 04/02/2024] [Indexed: 05/02/2024] Open

Abstract

OBJECTIVES

Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance.

MATERIALS AND METHODS

The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381 149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT (GPT-3.5), GPT-4, Baichuan2-7B, Baichuan2-13B, and QWEN-72B in CNMLE-2022 and further investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 distinct perspectives.

RESULTS

Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE framework, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT's performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70, affirming the effectiveness and robustness of the framework. It also enabled a smaller Baichuan2-13B to pass the examination, showcasing the great potential in low-resource settings.

DISCUSSION AND CONCLUSION

This study shed light on the optimal practices to enhance the capabilities of LLMs in non-English medical scenarios. By synergizing medical knowledge through in-context learning, LLMs can extend clinical insight beyond language barriers in healthcare, significantly reducing language-related disparities of LLM applications and ensuring global benefit in this field.

Collapse

Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.26.24306390. [PMID: 38712148 PMCID: PMC11071576 DOI: 10.1101/2024.04.26.24306390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]

Abstract

Background

The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators.

Objective

This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare.

Methods

We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns.

Results

Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research.

Conclusions

Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.

Collapse

Gilbert S, Kather JN, Hogan A. Augmented non-hallucinating large language models as medical information curators. NPJ Digit Med 2024;7:100. [PMID: 38654142 DOI: 10.1038/s41746-024-01081-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 03/14/2024] [Indexed: 04/25/2024] Open

Qu Y, Wei C, Du P, Che W, Zhang C, Ouyang W, Bian Y, Xu F, Hu B, Du K, Wu H, Liu J, Liu Q. Integration of cognitive tasks into artificial general intelligence test for large models. iScience 2024;27:109550. [PMID: 38595796 PMCID: PMC11001637 DOI: 10.1016/j.isci.2024.109550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024] Open

Frosolini A, Catarzi L, Benedetti S, Latini L, Chisci G, Franz L, Gennaro P, Gabriele G. The Role of Large Language Models (LLMs) in Providing Triage for Maxillofacial Trauma Cases: A Preliminary Study. Diagnostics (Basel) 2024;14:839. [PMID: 38667484 PMCID: PMC11048758 DOI: 10.3390/diagnostics14080839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 04/10/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open

Abstract

BACKGROUND

In the evolving field of maxillofacial surgery, integrating advanced technologies like Large Language Models (LLMs) into medical practices, especially for trauma triage, presents a promising yet largely unexplored potential. This study aimed to evaluate the feasibility of using LLMs for triaging complex maxillofacial trauma cases by comparing their performance against the expertise of a tertiary referral center.

METHODS

Utilizing a comprehensive review of patient records in a tertiary referral center over a year-long period, standardized prompts detailing patient demographics, injury characteristics, and medical histories were created. These prompts were used to assess the triage suggestions of ChatGPT 4.0 and Google GEMINI against the center's recommendations, supplemented by evaluating the AI's performance using the QAMAI and AIPI questionnaires.

RESULTS

The results in 10 cases of major maxillofacial trauma indicated moderate agreement rates between LLM recommendations and the referral center, with some variances in the suggestion of appropriate examinations (70% ChatGPT and 50% GEMINI) and treatment plans (60% ChatGPT and 45% GEMINI). Notably, the study found no statistically significant differences in several areas of the questionnaires, except in the diagnosis accuracy (GEMINI: 3.30, ChatGPT: 2.30; p = 0.032) and relevance of the recommendations (GEMINI: 2.90, ChatGPT: 3.50; p = 0.021). A Spearman correlation analysis highlighted significant correlations within the two questionnaires, specifically between the QAMAI total score and AIPI treatment scores (rho = 0.767, p = 0.010).

CONCLUSIONS

This exploratory investigation underscores the potential of LLMs in enhancing clinical decision making for maxillofacial trauma cases, indicating a need for further research to refine their application in healthcare settings.

Collapse

Ma C, Tan W, He R, Yan B. Pretraining a foundation model for generalizable fluorescence microscopy-based image restoration. Nat Methods 2024:10.1038/s41592-024-02244-3. [PMID: 38609490 DOI: 10.1038/s41592-024-02244-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 03/13/2024] [Indexed: 04/14/2024]

Kim C, Gadgil SU, DeGrave AJ, Omiye JA, Cai ZR, Daneshjou R, Lee SI. Transparent medical image AI via an image-text foundation model grounded in medical literature. Nat Med 2024;30:1154-1165. [PMID: 38627560 DOI: 10.1038/s41591-024-02887-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 02/27/2024] [Indexed: 04/21/2024]

Yurkovich JT, Evans SJ, Rappaport N, Boore JL, Lovejoy JC, Price ND, Hood LE. The transition from genomics to phenomics in personalized population health. Nat Rev Genet 2024;25:286-302. [PMID: 38093095 DOI: 10.1038/s41576-023-00674-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/03/2023] [Indexed: 03/21/2024]

Truhn D, Eckardt JN, Ferber D, Kather JN. Large language models and multimodal foundation models for precision oncology. NPJ Precis Oncol 2024;8:72. [PMID: 38519519 PMCID: PMC10959931 DOI: 10.1038/s41698-024-00573-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 03/12/2024] [Indexed: 03/25/2024] Open

Ge J, Chen IY, Pletcher MJ, Lai JC. Prompt Engineering for Generative Artificial Intelligence in Gastroenterology and Hepatology. Am J Gastroenterol 2024:00000434-990000000-01003. [PMID: 38294157 DOI: 10.14309/ajg.0000000000002689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 12/28/2023] [Indexed: 02/01/2024]

Sorin V, Glicksberg BS, Artsi Y, Barash Y, Konen E, Nadkarni GN, Klang E. Utilizing large language models in breast cancer management: systematic review. J Cancer Res Clin Oncol 2024;150:140. [PMID: 38504034 PMCID: PMC10950983 DOI: 10.1007/s00432-024-05678-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Accepted: 03/01/2024] [Indexed: 03/21/2024]

Ge J, Sun S, Owens J, Galvez V, Gologorskaya O, Lai JC, Pletcher MJ, Lai K. Development of a liver disease-specific large language model chat interface using retrieval-augmented generation. Hepatology 2024:01515467-990000000-00791. [PMID: 38451962 DOI: 10.1097/hep.0000000000000834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 02/24/2024] [Indexed: 03/09/2024]

Abstract

BACKGROUND AND AIMS

Large language models (LLMs) have significant capabilities in clinical information processing tasks. Commercially available LLMs, however, are not optimized for clinical uses and are prone to generating hallucinatory information. Retrieval-augmented generation (RAG) is an enterprise architecture that allows the embedding of customized data into LLMs. This approach "specializes" the LLMs and is thought to reduce hallucinations.

APPROACH AND RESULTS

We developed "LiVersa," a liver disease-specific LLM, by using our institution's protected health information-complaint text embedding and LLM platform, "Versa." We conducted RAG on 30 publicly available American Association for the Study of Liver Diseases guidance documents to be incorporated into LiVersa. We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.

RESULTS

We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.

CONCLUSIONS

In this demonstration, we built disease-specific and protected health information-compliant LLMs using RAG. While LiVersa demonstrated higher accuracy in answering questions related to hepatology, there were some deficiencies due to limitations set by the number of documents used for RAG. LiVersa will likely require further refinement before potential live deployment. The LiVersa prototype, however, is a proof of concept for utilizing RAG to customize LLMs for clinical use cases.

Collapse

Raghu Subramanian C, Yang DA, Khanna R. Enhancing Health Care Communication With Large Language Models-The Role, Challenges, and Future Directions. JAMA Netw Open 2024;7:e240347. [PMID: 38466311 DOI: 10.1001/jamanetworkopen.2024.0347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/12/2024] Open

Obradovich N, Johnson T, Paulus MP. Managerial and Organizational Challenges in the Age of AI. JAMA Psychiatry 2024;81:219-220. [PMID: 38265819 DOI: 10.1001/jamapsychiatry.2023.5247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/25/2024]

Valizadeh A, Moassefi M, Nakhostin-Ansari A, Heidari Some'eh S, Hosseini-Asl H, Saghab Torbati M, Aghajani R, Maleki Ghorbani Z, Menbari-Oskouie I, Aghajani F, Mirzamohamadi A, Ghafouri M, Faghani S, Memari AH. Automated diagnosis of autism with artificial intelligence: State of the art. Rev Neurosci 2024;35:141-163. [PMID: 37678819 DOI: 10.1515/revneuro-2023-0050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 07/28/2023] [Indexed: 09/09/2023]

Affiliation(s)

Amir Valizadeh Neuroscience Institute, Tehran University of Medical Sciences, PO: 1419733141, Tehran, Iran
Mana Moassefi Neuroscience Institute, Tehran University of Medical Sciences, PO: 1419733141, Tehran, Iran
Amin Nakhostin-Ansari Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
Soheil Heidari Some'eh Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran Students' Scientific Research Center, Tehran University of Medical Sciences, PO: 1417755331, Tehran, Iran
Hossein Hosseini-Asl Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran Students' Scientific Research Center, Tehran University of Medical Sciences, PO: 1417755331, Tehran, Iran
Mehrnush Saghab Torbati Islamic Azad University of Zahedan, PO: 9816743545, Zahedan, Iran
Reyhaneh Aghajani Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran Students' Scientific Research Center, Tehran University of Medical Sciences, PO: 1417755331, Tehran, Iran
Zahra Maleki Ghorbani Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran Students' Scientific Research Center, Tehran University of Medical Sciences, PO: 1417755331, Tehran, Iran
Iman Menbari-Oskouie Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
Faezeh Aghajani Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran Research Development Center, Arash Women's Hospital, Tehran University of Medical Sciences, PO: 14695542, Tehran, Iran
Alireza Mirzamohamadi Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran Students' Scientific Research Center, Tehran University of Medical Sciences, PO: 1417755331, Tehran, Iran
Mohammad Ghafouri Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
Shahriar Faghani Shariati Hospital, Department of Radiology, Tehran University of Medical Sciences, PO: 1411713135, Tehran, Iran Interdisciplinary Neuroscience Research Program (INRP), Tehran University of Medical Sciences, PO: 1416634793, Tehran, Iran
Amir Hossein Memari Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran

Collapse

Shahab O, El Kurdi B, Shaukat A, Nadkarni G, Soroush A. Large language models: a primer and gastroenterology applications. Therap Adv Gastroenterol 2024;17:17562848241227031. [PMID: 38390029 PMCID: PMC10883116 DOI: 10.1177/17562848241227031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Accepted: 01/02/2024] [Indexed: 02/24/2024] Open

Han C, Kim DW, Kim S, Chan You S, Park JY, Bae S, Yoon D. Evaluation of GPT-4 for 10-year cardiovascular risk prediction: Insights from the UK Biobank and KoGES data. iScience 2024;27:109022. [PMID: 38357664 PMCID: PMC10865411 DOI: 10.1016/j.isci.2024.109022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 11/28/2023] [Accepted: 01/22/2024] [Indexed: 02/16/2024] Open

Bey R, Cohen A, Trebossen V, Dura B, Geoffroy PA, Jean C, Landman B, Petit-Jean T, Chatellier G, Sallah K, Tannier X, Bourmaud A, Delorme R. Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality. NPJ MENTAL HEALTH RESEARCH 2024;3:6. [PMID: 38609541 PMCID: PMC10955903 DOI: 10.1038/s44184-023-00046-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Accepted: 12/06/2023] [Indexed: 04/14/2024]

Abstract

There is an urgent need to monitor the mental health of large populations, especially during crises such as the COVID-19 pandemic, to timely identify the most at-risk subgroups and to design targeted prevention campaigns. We therefore developed and validated surveillance indicators related to suicidality: the monthly number of hospitalisations caused by suicide attempts and the prevalence among them of five known risks factors. They were automatically computed analysing the electronic health records of fifteen university hospitals of the Paris area, France, using natural language processing algorithms based on artificial intelligence. We evaluated the relevance of these indicators conducting a retrospective cohort study. Considering 2,911,920 records contained in a common data warehouse, we tested for changes after the pandemic outbreak in the slope of the monthly number of suicide attempts by conducting an interrupted time-series analysis. We segmented the assessment time in two sub-periods: before (August 1, 2017, to February 29, 2020) and during (March 1, 2020, to June 31, 2022) the COVID-19 pandemic. We detected 14,023 hospitalisations caused by suicide attempts. Their monthly number accelerated after the COVID-19 outbreak with an estimated trend variation reaching 3.7 (95%CI 2.1-5.3), mainly driven by an increase among girls aged 8-17 (trend variation 1.8, 95%CI 1.2-2.5). After the pandemic outbreak, acts of domestic, physical and sexual violence were more often reported (prevalence ratios: 1.3, 95%CI 1.16-1.48; 1.3, 95%CI 1.10-1.64 and 1.7, 95%CI 1.48-1.98), fewer patients died (p = 0.007) and stays were shorter (p < 0.001). Our study demonstrates that textual clinical data collected in multiple hospitals can be jointly analysed to compute timely indicators describing mental health conditions of populations. Our findings also highlight the need to better take into account the violence imposed on women, especially at early ages and in the aftermath of the COVID-19 pandemic.

Collapse

Affiliation(s)

Romain Bey Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
Ariel Cohen Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France.
Vincent Trebossen Child and Adolescent Psychiatry Department, Robert Debré University Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
Basile Dura Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
Pierre-Alexis Geoffroy Département de psychiatrie et d'addictologie, GHU Paris Nord, DMU neurosciences, Bichat - Claude Bernard Hospital, Assistance Publique-Hôpitaux de Paris, 75018, Paris, France GHU Paris - psychiatry & neurosciences, 1, rue Cabanis, 75014, Paris, France NeuroDiderot, Inserm, FHU I2-D2, université Paris Cité, 75019, Paris, France CNRS UPR 3212, Institute for cellular and integrative neurosciences, 67000, Strasbourg, France
Charline Jean Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France Université Paris-Est Créteil, INSERM, IMRB U955, Créteil, France Service Santé Publique & URC, Hôpital Henri Mondor, Assistance Publique-Hôpitaux de Paris, Créteil, France
Benjamin Landman Child and Adolescent Psychiatry Department, Robert Debré University Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
Thomas Petit-Jean Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
Gilles Chatellier Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France Université Paris Cité, Paris, France
Kankoe Sallah URC PNVS, CIC-EC 1425, INSERM, Bichat - Claude Bernard Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
Xavier Tannier Sorbonne Université, Inserm, Université Sorbonne Paris Nord, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances pour la e-Santé (LIMICS), Paris, France
Aurelie Bourmaud Université Paris Cité, Paris, France Clinical Epidemiology Unit, Robert Debré University Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France CIC 1426, Inserm, Paris, France
Richard Delorme Child and Adolescent Psychiatry Department, Robert Debré University Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France Human Genetics and Cognitive Functions, Institut Pasteur, Paris, France

Collapse

Boonstra MJ, Weissenbacher D, Moore JH, Gonzalez-Hernandez G, Asselbergs FW. Artificial intelligence: revolutionizing cardiology with large language models. Eur Heart J 2024;45:332-345. [PMID: 38170821 PMCID: PMC10834163 DOI: 10.1093/eurheartj/ehad838] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 12/01/2023] [Accepted: 12/05/2023] [Indexed: 01/05/2024] Open

Liu X, Wu J, Shao A, Shen W, Ye P, Wang Y, Ye J, Jin K, Yang J. Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study. J Med Internet Res 2024;26:e51926. [PMID: 38252483 PMCID: PMC10845019 DOI: 10.2196/51926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 10/07/2023] [Accepted: 11/30/2023] [Indexed: 01/23/2024] Open

Abstract

BACKGROUND

Benefiting from rich knowledge and the exceptional ability to understand text, large language models like ChatGPT have shown great potential in English clinical environments. However, the performance of ChatGPT in non-English clinical settings, as well as its reasoning, have not been explored in depth.

OBJECTIVE

This study aimed to evaluate ChatGPT's diagnostic performance and inference abilities for retinal vascular diseases in a non-English clinical environment.

METHODS

In this cross-sectional study, we collected 1226 fundus fluorescein angiography reports and corresponding diagnoses written in Chinese and tested ChatGPT with 4 prompting strategies (direct diagnosis or diagnosis with a step-by-step reasoning process and in Chinese or English).

RESULTS

Compared with ChatGPT using Chinese prompts for direct diagnosis that achieved an F1-score of 70.47%, ChatGPT using English prompts for direct diagnosis achieved the best diagnostic performance (80.05%), which was inferior to ophthalmologists (89.35%) but close to ophthalmologist interns (82.69%). As for its inference abilities, although ChatGPT can derive a reasoning process with a low error rate (0.4 per report) for both Chinese and English prompts, ophthalmologists identified that the latter brought more reasoning steps with less incompleteness (44.31%), misinformation (1.96%), and hallucinations (0.59%) (all P<.001). Also, analysis of the robustness of ChatGPT with different language prompts indicated significant differences in the recall (P=.03) and F1-score (P=.04) between Chinese and English prompts. In short, when prompted in English, ChatGPT exhibited enhanced diagnostic and inference capabilities for retinal vascular disease classification based on Chinese fundus fluorescein angiography reports.

CONCLUSIONS

ChatGPT can serve as a helpful medical assistant to provide diagnosis in non-English clinical environments, but there are still performance gaps, language disparities, and errors compared to professionals, which demonstrate the potential limitations and the need to continually explore more robust large language models in ophthalmology practice.

Collapse

Ong JCL, Seng BJJ, Law JZF, Low LL, Kwa ALH, Giacomini KM, Ting DSW. Artificial intelligence, ChatGPT, and other large language models for social determinants of health: Current state and future directions. Cell Rep Med 2024;5:101356. [PMID: 38232690 PMCID: PMC10829781 DOI: 10.1016/j.xcrm.2023.101356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 10/12/2023] [Accepted: 12/10/2023] [Indexed: 01/19/2024]

Schonfeld E, Mordekai N, Berg A, Johnstone T, Shah A, Shah V, Haider G, Marianayagam NJ, Veeravagu A. Machine Learning in Neurosurgery: Toward Complex Inputs, Actionable Predictions, and Generalizable Translations. Cureus 2024;16:e51963. [PMID: 38333513 PMCID: PMC10851045 DOI: 10.7759/cureus.51963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2023] [Accepted: 01/08/2024] [Indexed: 02/10/2024] Open

Wu X, Zhang B. ChatGPT promotes healthcare: current applications and potential challenges. Int J Surg 2024;110:606-608. [PMID: 37816164 PMCID: PMC10793836 DOI: 10.1097/js9.0000000000000802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Accepted: 09/17/2023] [Indexed: 10/12/2023]

Zack T, Lehman E, Suzgun M, Rodriguez JA, Celi LA, Gichoya J, Jurafsky D, Szolovits P, Bates DW, Abdulnour REE, Butte AJ, Alsentzer E. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health 2024;6:e12-e22. [PMID: 38123252 DOI: 10.1016/s2589-7500(23)00225-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 09/30/2023] [Accepted: 10/26/2023] [Indexed: 12/23/2023]

Abstract

BACKGROUND

Large language models (LLMs) such as GPT-4 hold great promise as transformative tools in health care, ranging from automating administrative tasks to augmenting clinical decision making. However, these models also pose a danger of perpetuating biases and delivering incorrect medical diagnoses, which can have a direct, harmful impact on medical care. We aimed to assess whether GPT-4 encodes racial and gender biases that impact its use in health care.

METHODS

Using the Azure OpenAI application interface, this model evaluation study tested whether GPT-4 encodes racial and gender biases and examined the impact of such biases on four potential applications of LLMs in the clinical domain-namely, medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment. We conducted experiments with prompts designed to resemble typical use of GPT-4 within clinical and medical education applications. We used clinical vignettes from NEJM Healer and from published research on implicit bias in health care. GPT-4 estimates of the demographic distribution of medical conditions were compared with true US prevalence estimates. Differential diagnosis and treatment planning were evaluated across demographic groups using standard statistical tests for significance between groups.

FINDINGS

We found that GPT-4 did not appropriately model the demographic diversity of medical conditions, consistently producing clinical vignettes that stereotype demographic presentations. The differential diagnoses created by GPT-4 for standardised clinical vignettes were more likely to include diagnoses that stereotype certain races, ethnicities, and genders. Assessment and plans created by the model showed significant association between demographic attributes and recommendations for more expensive procedures as well as differences in patient perception.

INTERPRETATION

Our findings highlight the urgent need for comprehensive and transparent bias assessments of LLM tools such as GPT-4 for intended use cases before they are integrated into clinical care. We discuss the potential sources of these biases and potential mitigation strategies before clinical implementation.

FUNDING

Priscilla Chan and Mark Zuckerberg.

Collapse

Affiliation(s)

Travis Zack Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA, USA
Eric Lehman Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
Mirac Suzgun Department of Computer Science, Stanford University, Stanford, CA, USA; Stanford Law School, Stanford University, Stanford, CA, USA
Jorge A Rodriguez Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA
Leo Anthony Celi Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA; Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA; Department of Biostatistics, Harvard T H Chan School of Public Health, Boston, MA, USA
Judy Gichoya Department of Radiology, Emory University, Atlanta, GA, USA
Dan Jurafsky Department of Computer Science, Stanford University, Stanford, CA, USA; Department of Linguistics, Stanford University, Stanford, CA, USA
Peter Szolovits Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
David W Bates Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA; Department of Health Policy and Management, Harvard T H Chan School of Public Health, Boston, MA, USA
Raja-Elie E Abdulnour Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
Atul J Butte Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA; Center for Data-Driven Insights and Innovation, University of California, Office of the President, Oakland, CA, USA
Emily Alsentzer Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.

Collapse

Wang C, Liu S, Li A, Liu J. Text Dialogue Analysis for Primary Screening of Mild Cognitive Impairment: Development and Validation Study. J Med Internet Res 2023;25:e51501. [PMID: 38157230 PMCID: PMC10787336 DOI: 10.2196/51501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Revised: 09/28/2023] [Accepted: 11/27/2023] [Indexed: 01/03/2024] Open

Abstract

BACKGROUND

Artificial intelligence models tailored to diagnose cognitive impairment have shown excellent results. However, it is unclear whether large linguistic models can rival specialized models by text alone.

OBJECTIVE

In this study, we explored the performance of ChatGPT for primary screening of mild cognitive impairment (MCI) and standardized the design steps and components of the prompts.

METHODS

We gathered a total of 174 participants from the DementiaBank screening and classified 70% of them into the training set and 30% of them into the test set. Only text dialogues were kept. Sentences were cleaned using a macro code, followed by a manual check. The prompt consisted of 5 main parts, including character setting, scoring system setting, indicator setting, output setting, and explanatory information setting. Three dimensions of variables from published studies were included: vocabulary (ie, word frequency and word ratio, phrase frequency and phrase ratio, and lexical complexity), syntax and grammar (ie, syntactic complexity and grammatical components), and semantics (ie, semantic density and semantic coherence). We used R 4.3.0. for the analysis of variables and diagnostic indicators.

RESULTS

Three additional indicators related to the severity of MCI were incorporated into the final prompt for the model. These indicators were effective in discriminating between MCI and cognitively normal participants: tip-of-the-tongue phenomenon (P<.001), difficulty with complex ideas (P<.001), and memory issues (P<.001). The final GPT-4 model achieved a sensitivity of 0.8636, a specificity of 0.9487, and an area under the curve of 0.9062 on the training set; on the test set, the sensitivity, specificity, and area under the curve reached 0.7727, 0.8333, and 0.8030, respectively.

CONCLUSIONS

ChatGPT was effective in the primary screening of participants with possible MCI. Improved standardization of prompts by clinicians would also improve the performance of the model. It is important to note that ChatGPT is not a substitute for a clinician making a diagnosis.

Collapse

Koranteng E, Rao A, Flores E, Lev M, Landman A, Dreyer K, Succi M. Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care. JMIR MEDICAL EDUCATION 2023;9:e51199. [PMID: 38153778 PMCID: PMC10884892 DOI: 10.2196/51199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 10/01/2023] [Accepted: 10/14/2023] [Indexed: 12/29/2023]

Beaulieu-Jones BK, Villamar MF, Scordis P, Bartmann AP, Ali W, Wissel BD, Alsentzer E, de Jong J, Patra A, Kohane I. Predicting seizure recurrence after an initial seizure-like episode from routine clinical notes using large language models: a retrospective cohort study. Lancet Digit Health 2023;5:e882-e894. [PMID: 38000873 PMCID: PMC10695164 DOI: 10.1016/s2589-7500(23)00179-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 08/08/2023] [Accepted: 08/31/2023] [Indexed: 11/26/2023]

Abstract

BACKGROUND

The evaluation and management of first-time seizure-like events in children can be difficult because these episodes are not always directly observed and might be epileptic seizures or other conditions (seizure mimics). We aimed to evaluate whether machine learning models using real-world data could predict seizure recurrence after an initial seizure-like event.

METHODS

This retrospective cohort study compared models trained and evaluated on two separate datasets between Jan 1, 2010, and Jan 1, 2020: electronic medical records (EMRs) at Boston Children's Hospital and de-identified, patient-level, administrative claims data from the IBM MarketScan research database. The study population comprised patients with an initial diagnosis of either epilepsy or convulsions before the age of 21 years, based on International Classification of Diseases, Clinical Modification (ICD-CM) codes. We compared machine learning-based predictive modelling using structured data (logistic regression and XGBoost) with emerging techniques in natural language processing by use of large language models.

FINDINGS

The primary cohort comprised 14 021 patients at Boston Children's Hospital matching inclusion criteria with an initial seizure-like event and the comparison cohort comprised 15 062 patients within the IBM MarketScan research database. Seizure recurrence based on a composite expert-derived definition occurred in 57% of patients at Boston Children's Hospital and 63% of patients within IBM MarketScan. Large language models with additional domain-specific and location-specific pre-training on patients excluded from the study (F1-score 0·826 [95% CI 0·817-0·835], AUC 0·897 [95% CI 0·875-0·913]) performed best. All large language models, including the base model without additional pre-training (F1-score 0·739 [95% CI 0·738-0·741], AUROC 0·846 [95% CI 0·826-0·861]) outperformed models trained with structured data. With structured data only, XGBoost outperformed logistic regression and XGBoost models trained with the Boston Children's Hospital EMR (logistic regression: F1-score 0·650 [95% CI 0·643-0·657], AUC 0·694 [95% CI 0·685-0·705], XGBoost: F1-score 0·679 [0·676-0·683], AUC 0·725 [0·717-0·734]) performed similarly to models trained on the IBM MarketScan database (logistic regression: F1-score 0·596 [0·590-0·601], AUC 0·670 [0·664-0·675], XGBoost: F1-score 0·678 [0·668-0·687], AUC 0·710 [0·703-0·714]).

INTERPRETATION

Physician's clinical notes about an initial seizure-like event include substantial signals for prediction of seizure recurrence, and additional domain-specific and location-specific pre-training can significantly improve the performance of clinical large language models, even for specialised cohorts.

FUNDING

UCB, National Institute of Neurological Disorders and Stroke (US National Institutes of Health).

Collapse

Overgaard SM, Graham MG, Brereton T, Pencina MJ, Halamka JD, Vidal DE, Economou-Zavlanos NJ. Implementing quality management systems to close the AI translation gap and facilitate safe, ethical, and effective health AI solutions. NPJ Digit Med 2023;6:218. [PMID: 38007604 PMCID: PMC10676432 DOI: 10.1038/s41746-023-00968-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 11/15/2023] [Indexed: 11/27/2023] Open

Peng C, Yang X, Chen A, Smith KE, PourNejatian N, Costa AB, Martin C, Flores MG, Zhang Y, Magoc T, Lipori G, Mitchell DA, Ospina NS, Ahmed MM, Hogan WR, Shenkman EA, Guo Y, Bian J, Wu Y. A study of generative large language model for medical research and healthcare. NPJ Digit Med 2023;6:210. [PMID: 37973919 PMCID: PMC10654385 DOI: 10.1038/s41746-023-00958-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 11/01/2023] [Indexed: 11/19/2023] Open

Affiliation(s)

Cheng Peng Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
Xi Yang Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
Aokun Chen Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
Kaleb E Smith NVIDIA, Santa Clara, CA, USA
Nima PourNejatian NVIDIA, Santa Clara, CA, USA
Anthony B Costa NVIDIA, Santa Clara, CA, USA
Cheryl Martin NVIDIA, Santa Clara, CA, USA
Mona G Flores NVIDIA, Santa Clara, CA, USA
Ying Zhang Research Computing, University of Florida, Gainesville, FL, USA
Tanja Magoc Integrated Data Repository Research Services, University of Florida, Gainesville, FL, USA
Gloria Lipori Integrated Data Repository Research Services, University of Florida, Gainesville, FL, USA Lillian S. Wells Department of Neurosurgery, Clinical and Translational Science Institute, University of Florida, Gainesville, FL, USA
Duane A Mitchell Lillian S. Wells Department of Neurosurgery, Clinical and Translational Science Institute, University of Florida, Gainesville, FL, USA
Naykky S Ospina Division of Endocrinology, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
Mustafa M Ahmed Division of Cardiovascular Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
William R Hogan Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
Elizabeth A Shenkman Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
Yi Guo Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
Jiang Bian Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
Yonghui Wu Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA. Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.

Collapse

Ge J, Sun S, Owens J, Galvez V, Gologorskaya O, Lai JC, Pletcher MJ, Lai K. Development of a Liver Disease-Specific Large Language Model Chat Interface using Retrieval Augmented Generation. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.11.10.23298364. [PMID: 37986764 PMCID: PMC10659484 DOI: 10.1101/2023.11.10.23298364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]

Decker H, Trang K, Ramirez J, Colley A, Pierce L, Coleman M, Bongiovanni T, Melton GB, Wick E. Large Language Model-Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures. JAMA Netw Open 2023;6:e2336997. [PMID: 37812419 PMCID: PMC10562939 DOI: 10.1001/jamanetworkopen.2023.36997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Accepted: 08/29/2023] [Indexed: 10/10/2023] Open

Abstract

Importance

Informed consent is a critical component of patient care before invasive procedures, yet it is frequently inadequate. Electronic consent forms have the potential to facilitate patient comprehension if they provide information that is readable, accurate, and complete; it is not known if large language model (LLM)-based chatbots may improve informed consent documentation by generating accurate and complete information that is easily understood by patients.

Objective

To compare the readability, accuracy, and completeness of LLM-based chatbot- vs surgeon-generated information on the risks, benefits, and alternatives (RBAs) of common surgical procedures.

Design, Setting, and Participants

This cross-sectional study compared randomly selected surgeon-generated RBAs used in signed electronic consent forms at an academic referral center in San Francisco with LLM-based chatbot-generated (ChatGPT-3.5, OpenAI) RBAs for 6 surgical procedures (colectomy, coronary artery bypass graft, laparoscopic cholecystectomy, inguinal hernia repair, knee arthroplasty, and spinal fusion).

Main Outcomes and Measures

Readability was measured using previously validated scales (Flesh-Kincaid grade level, Gunning Fog index, the Simple Measure of Gobbledygook, and the Coleman-Liau index). Scores range from 0 to greater than 20 to indicate the years of education required to understand a text. Accuracy and completeness were assessed using a rubric developed with recommendations from LeapFrog, the Joint Commission, and the American College of Surgeons. Both composite and RBA subgroup scores were compared.

Results

The total sample consisted of 36 RBAs, with 1 RBA generated by the LLM-based chatbot and 5 RBAs generated by a surgeon for each of the 6 surgical procedures. The mean (SD) readability score for the LLM-based chatbot RBAs was 12.9 (2.0) vs 15.7 (4.0) for surgeon-generated RBAs (P = .10). The mean (SD) composite completeness and accuracy score was lower for surgeons' RBAs at 1.6 (0.5) than for LLM-based chatbot RBAs at 2.2 (0.4) (P < .001). The LLM-based chatbot scores were higher than the surgeon-generated scores for descriptions of the benefits of surgery (2.3 [0.7] vs 1.4 [0.7]; P < .001) and alternatives to surgery (2.7 [0.5] vs 1.4 [0.7]; P < .001). There was no significant difference in chatbot vs surgeon RBA scores for risks of surgery (1.7 [0.5] vs 1.7 [0.4]; P = .38).

Conclusions and Relevance

The findings of this cross-sectional study suggest that despite not being perfect, LLM-based chatbots have the potential to enhance informed consent documentation. If an LLM were embedded in electronic health records in a manner compliant with the Health Insurance Portability and Accountability Act, it could be used to provide personalized risk information while easing documentation burden for physicians.

Collapse

Jung KH. Uncover This Tech Term: Foundation Model. Korean J Radiol 2023;24:1038-1041. [PMID: 37793672 PMCID: PMC10550749 DOI: 10.3348/kjr.2023.0790] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 08/23/2023] [Indexed: 10/06/2023] Open

Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, Nadkarni G, Klang E. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 2023;13:16492. [PMID: 37779171 PMCID: PMC10543445 DOI: 10.1038/s41598-023-43436-9] [Citation(s) in RCA: 30] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 09/23/2023] [Indexed: 10/03/2023] Open

Cheung ATM, Nasir-Moin M, Oermann EK. ChatGPT and the Law of the Horse. THE AMERICAN JOURNAL OF BIOETHICS : AJOB 2023;23:55-57. [PMID: 37812113 DOI: 10.1080/15265161.2023.2250279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]

Robinson ML, Garibaldi BT, Lindquist MA. When Clinical Prediction Is Steering the Ship, Beware the Drift of Its Wake. Ann Intern Med 2023;176:1424-1425. [PMID: 37812777 DOI: 10.7326/m23-2345] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/11/2023] Open