1
|
Beil M, Moreno R, Fronczek J, Kogan Y, Moreno RPJ, Flaatten H, Guidet B, de Lange D, Leaver S, Nachshon A, van Heerden PV, Joskowicz L, Sviri S, Jung C, Szczeklik W. Prognosticating the outcome of intensive care in older patients-a narrative review. Ann Intensive Care 2024; 14:97. [PMID: 38907141 PMCID: PMC11192712 DOI: 10.1186/s13613-024-01330-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 06/10/2024] [Indexed: 06/23/2024] Open
Abstract
Prognosis determines major decisions regarding treatment for critically ill patients. Statistical models have been developed to predict the probability of survival and other outcomes of intensive care. Although they were trained on the characteristics of large patient cohorts, they often do not represent very old patients (age ≥ 80 years) appropriately. Moreover, the heterogeneity within this particular group impairs the utility of statistical predictions for informing decision-making in very old individuals. In addition to these methodological problems, the diversity of cultural attitudes, available resources as well as variations of legal and professional norms limit the generalisability of prediction models, especially in patients with complex multi-morbidity and pre-existing functional impairments. Thus, current approaches to prognosticating outcomes in very old patients are imperfect and can generate substantial uncertainty about optimal trajectories of critical care in the individual. This article presents the state of the art and new approaches to predicting outcomes of intensive care for these patients. Special emphasis has been given to the integration of predictions into the decision-making for individual patients. This requires quantification of prognostic uncertainty and a careful alignment of decisions with the preferences of patients, who might prioritise functional outcomes over survival. Since the performance of outcome predictions for the individual patient may improve over time, time-limited trials in intensive care may be an appropriate way to increase the confidence in decisions about life-sustaining treatment.
Collapse
Affiliation(s)
- Michael Beil
- Department of Medical Intensive Care, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| | - Rui Moreno
- Unidade Local de Saúde São José, Hospital de São José, Lisbon, Portugal
- Centro Clínico Académico de Lisboa, Lisbon, Portugal
- Faculdade de Ciências da Saúde, Universidade da Beira Interior, Covilhã, Portugal
| | - Jakub Fronczek
- Center for Intensive Care and Perioperative Medicine, Jagiellonian University Medical College, Krakow, Poland
| | - Yuri Kogan
- Institute for Medical Biomathematics, Bene Ataroth, Israel
| | | | - Hans Flaatten
- Department of Research and Development, Haukeland University Hospital, Bergen, Norway
| | - Bertrand Guidet
- INSERM, Institut Pierre Louis d'Epidémiologie Et de Santé Publique, AP-HP, Hôpital Saint Antoine, Sorbonne Université, Service MIR, Paris, France
| | - Dylan de Lange
- Department of Intensive Care Medicine, University Medical Center, University Utrecht, Utrecht, The Netherlands
| | - Susannah Leaver
- General Intensive Care, St George's University Hospitals NHS Foundation Trust, London, UK
| | - Akiva Nachshon
- General Intensive Care Unit, Department of Anaesthesiology, Critical Care and Pain Medicine, Faculty of Medicine, Hebrew University and, Hadassah University Medical Center, Jerusalem, Israel
| | - Peter Vernon van Heerden
- General Intensive Care Unit, Department of Anaesthesiology, Critical Care and Pain Medicine, Faculty of Medicine, Hebrew University and, Hadassah University Medical Center, Jerusalem, Israel
| | - Leo Joskowicz
- School of Computer Science and Engineering and Center for Computational Medicine, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Sigal Sviri
- Department of Medical Intensive Care, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| | - Christian Jung
- Department of Cardiology, Pulmonology and Vascular Medicine, Faculty of Medicine, Heinrich-Heine-University, University Duesseldorf, Moorenstraße 5, 40225, Düsseldorf, Germany.
| | - Wojciech Szczeklik
- Center for Intensive Care and Perioperative Medicine, Jagiellonian University Medical College, Krakow, Poland
| |
Collapse
|
2
|
Yao JJ, Aggarwal M, Lopez RD, Namdari S. Current Concepts Review: Large Language Models in Orthopaedics: Definitions, Uses, and Limitations. J Bone Joint Surg Am 2024:00004623-990000000-01136. [PMID: 38896652 DOI: 10.2106/jbjs.23.01417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
➤ Large language models are a subset of artificial intelligence. Large language models are powerful tools that excel in natural language text processing and generation.➤ There are many potential clinical, research, and educational applications of large language models in orthopaedics, but the development of these applications needs to be focused on patient safety and the maintenance of high standards.➤ There are numerous methodological, ethical, and regulatory concerns with regard to the use of large language models. Orthopaedic surgeons need to be aware of the controversies and advocate for an alignment of these models with patient and caregiver priorities.
Collapse
Affiliation(s)
- Jie J Yao
- Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, Pennsylvania
| | | | - Ryan D Lopez
- Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Surena Namdari
- Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, Pennsylvania
| |
Collapse
|
3
|
Volkmer S, Meyer-Lindenberg A, Schwarz E. Large language models in psychiatry: Opportunities and challenges. Psychiatry Res 2024; 339:116026. [PMID: 38909412 DOI: 10.1016/j.psychres.2024.116026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 05/17/2024] [Accepted: 06/10/2024] [Indexed: 06/25/2024]
Abstract
The ability of Large Language Models (LLMs) to analyze and respond to freely written text is causing increasing excitement in the field of psychiatry; the application of such models presents unique opportunities and challenges for psychiatric applications. This review article seeks to offer a comprehensive overview of LLMs in psychiatry, their model architecture, potential use cases, and clinical considerations. LLM frameworks such as ChatGPT/GPT-4 are trained on huge amounts of text data that are sometimes fine-tuned for specific tasks. This opens up a wide range of possible psychiatric applications, such as accurately predicting individual patient risk factors for specific disorders, engaging in therapeutic intervention, and analyzing therapeutic material, to name a few. However, adoption in the psychiatric setting presents many challenges, including inherent limitations and biases in LLMs, concerns about explainability and privacy, and the potential damage resulting from produced misinformation. This review covers potential opportunities and limitations and highlights potential considerations when these models are applied in a real-world psychiatric context.
Collapse
Affiliation(s)
- Sebastian Volkmer
- Hector Institute for Artificial Intelligence in Psychiatry, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany; Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
| | - Andreas Meyer-Lindenberg
- Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
| | - Emanuel Schwarz
- Hector Institute for Artificial Intelligence in Psychiatry, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany; Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany.
| |
Collapse
|
4
|
Longwell JB, Hirsch I, Binder F, Gonzalez Conchas GA, Mau D, Jang R, Krishnan RG, Grant RC. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open 2024; 7:e2417641. [PMID: 38888919 PMCID: PMC11185976 DOI: 10.1001/jamanetworkopen.2024.17641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 04/18/2024] [Indexed: 06/20/2024] Open
Abstract
Importance Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information. Objective To evaluate the accuracy and safety of LLM answers on medical oncology examination questions. Design, Setting, and Participants This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs. Main Outcomes and Measures The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm. Results Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm. Conclusions and Relevance In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.
Collapse
Affiliation(s)
- Jack B. Longwell
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Ian Hirsch
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Fernando Binder
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | | | - Daniel Mau
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada
| | - Raymond Jang
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Rahul G. Krishnan
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Robert C. Grant
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
- Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada
- ICES, Toronto, Ontario, Canada
| |
Collapse
|
5
|
Amacher SA, Arpagaus A, Sahmer C, Becker C, Gross S, Urben T, Tisljar K, Sutter R, Marsch S, Hunziker S. Prediction of outcomes after cardiac arrest by a generative artificial intelligence model. Resusc Plus 2024; 18:100587. [PMID: 38433764 PMCID: PMC10906512 DOI: 10.1016/j.resplu.2024.100587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 02/01/2024] [Accepted: 02/11/2024] [Indexed: 03/05/2024] Open
Abstract
Aims To investigate the prognostic accuracy of a non-medical generative artificial intelligence model (Chat Generative Pre-Trained Transformer 4 - ChatGPT-4) as a novel aspect in predicting death and poor neurological outcome at hospital discharge based on real-life data from cardiac arrest patients. Methods This prospective cohort study investigates the prognostic performance of ChatGPT-4 to predict outcomes at hospital discharge of adult cardiac arrest patients admitted to intensive care at a large Swiss tertiary academic medical center (COMMUNICATE/PROPHETIC cohort study). We prompted ChatGPT-4 with sixteen prognostic parameters derived from established post-cardiac arrest scores for each patient. We compared the prognostic performance of ChatGPT-4 regarding the area under the curve (AUC), sensitivity, specificity, positive and negative predictive values, and likelihood ratios of three cardiac arrest scores (Out-of-Hospital Cardiac Arrest [OHCA], Cardiac Arrest Hospital Prognosis [CAHP], and PROgnostication using LOGistic regression model for Unselected adult cardiac arrest patients in the Early stages [PROLOGUE score]) for in-hospital mortality and poor neurological outcome. Results Mortality at hospital discharge was 43% (n = 309/713), 54% of patients (n = 387/713) had a poor neurological outcome. ChatGPT-4 showed good discrimination regarding in-hospital mortality with an AUC of 0.85, similar to the OHCA, CAHP, and PROLOGUE (AUCs of 0.82, 0.83, and 0.84, respectively) scores. For poor neurological outcome, ChatGPT-4 showed a similar prediction to the post-cardiac arrest scores (AUC 0.83). Conclusions ChatGPT-4 showed a similar performance in predicting mortality and poor neurological outcome compared to validated post-cardiac arrest scores. However, more research is needed regarding illogical answers for potential incorporation of an LLM in the multimodal outcome prognostication after cardiac arrest.
Collapse
Affiliation(s)
- Simon A. Amacher
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
- Emergency Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
| | - Armon Arpagaus
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Christian Sahmer
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Christoph Becker
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
- Emergency Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
| | - Sebastian Gross
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Tabita Urben
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Kai Tisljar
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
| | - Raoul Sutter
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
- Medical Faculty, University of Basel, Basel, Switzerland
- Division of Neurophysiology, Department of Neurology, University Hospital Basel, Basel, Switzerland
| | - Stephan Marsch
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
- Medical Faculty, University of Basel, Basel, Switzerland
| | - Sabina Hunziker
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
- Medical Faculty, University of Basel, Basel, Switzerland
- Post-Intensive Care Clinic, University Hospital Basel, Basel, Switzerland
| |
Collapse
|
6
|
Perez-Lopez R, Ghaffari Laleh N, Mahmood F, Kather JN. A guide to artificial intelligence for cancer researchers. Nat Rev Cancer 2024; 24:427-441. [PMID: 38755439 DOI: 10.1038/s41568-024-00694-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/09/2024] [Indexed: 05/18/2024]
Abstract
Artificial intelligence (AI) has been commoditized. It has evolved from a specialty resource to a readily accessible tool for cancer researchers. AI-based tools can boost research productivity in daily workflows, but can also extract hidden information from existing data, thereby enabling new scientific discoveries. Building a basic literacy in these tools is useful for every cancer researcher. Researchers with a traditional biological science focus can use AI-based tools through off-the-shelf software, whereas those who are more computationally inclined can develop their own AI-based software pipelines. In this article, we provide a practical guide for non-computational cancer researchers to understand how AI-based tools can benefit them. We convey general principles of AI for applications in image analysis, natural language processing and drug discovery. In addition, we give examples of how non-computational researchers can get started on the journey to productively use AI in their own work.
Collapse
Affiliation(s)
- Raquel Perez-Lopez
- Radiomics Group, Vall d'Hebron Institute of Oncology, Vall d'Hebron Barcelona Hospital Campus, Barcelona, Spain
| | - Narmin Ghaffari Laleh
- Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
| | - Faisal Mahmood
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA
| | - Jakob Nikolas Kather
- Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany.
- Department of Medicine I, University Hospital Dresden, Dresden, Germany.
- Medical Oncology, National Center for Tumour Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany.
| |
Collapse
|
7
|
Xu H, Usuyama N, Bagga J, Zhang S, Rao R, Naumann T, Wong C, Gero Z, González J, Gu Y, Xu Y, Wei M, Wang W, Ma S, Wei F, Yang J, Li C, Gao J, Rosemon J, Bower T, Lee S, Weerasinghe R, Wright BJ, Robicsek A, Piening B, Bifulco C, Wang S, Poon H. A whole-slide foundation model for digital pathology from real-world data. Nature 2024; 630:181-188. [PMID: 38778098 PMCID: PMC11153137 DOI: 10.1038/s41586-024-07441-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 04/19/2024] [Indexed: 05/25/2024]
Abstract
Digital pathology poses unique computational challenges, as a standard gigapixel slide may comprise tens of thousands of image tiles1-3. Prior models have often resorted to subsampling a small portion of tiles for each slide, thus missing the important slide-level context4. Here we present Prov-GigaPath, a whole-slide pathology foundation model pretrained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides from Providence, a large US health network comprising 28 cancer centres. The slides originated from more than 30,000 patients covering 31 major tissue types. To pretrain Prov-GigaPath, we propose GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. To scale GigaPath for slide-level learning with tens of thousands of image tiles, GigaPath adapts the newly developed LongNet5 method to digital pathology. To evaluate Prov-GigaPath, we construct a digital pathology benchmark comprising 9 cancer subtyping tasks and 17 pathomics tasks, using both Providence and TCGA data6. With large-scale pretraining and ultra-large-context modelling, Prov-GigaPath attains state-of-the-art performance on 25 out of 26 tasks, with significant improvement over the second-best method on 18 tasks. We further demonstrate the potential of Prov-GigaPath on vision-language pretraining for pathology7,8 by incorporating the pathology reports. In sum, Prov-GigaPath is an open-weight foundation model that achieves state-of-the-art performance on various digital pathology tasks, demonstrating the importance of real-world data and whole-slide modelling.
Collapse
Affiliation(s)
- Hanwen Xu
- Microsoft Research, Redmond, WA, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | | | | | | | | | | | | | | | | | - Yu Gu
- Microsoft Research, Redmond, WA, USA
| | - Yanbo Xu
- Microsoft Research, Redmond, WA, USA
| | - Mu Wei
- Microsoft Research, Redmond, WA, USA
| | | | | | - Furu Wei
- Microsoft Research, Redmond, WA, USA
| | | | | | | | | | | | - Soohee Lee
- Providence Research Network, Renton, WA, USA
| | | | | | | | - Brian Piening
- Providence Genomics, Portland, OR, USA
- Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA
| | - Carlo Bifulco
- Providence Genomics, Portland, OR, USA.
- Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA.
| | - Sheng Wang
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
- Department of Surgery, University of Washington, Seattle, WA, USA.
| | | |
Collapse
|
8
|
Akhondi-Asl A, Yang Y, Luchette M, Burns JP, Mehta NM, Geva A. Comparing the Quality of Domain-Specific Versus General Language Models for Artificial Intelligence-Generated Differential Diagnoses in PICU Patients. Pediatr Crit Care Med 2024; 25:e273-e282. [PMID: 38329382 DOI: 10.1097/pcc.0000000000003468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
OBJECTIVES Generative language models (LMs) are being evaluated in a variety of tasks in healthcare, but pediatric critical care studies are scant. Our objective was to evaluate the utility of generative LMs in the pediatric critical care setting and to determine whether domain-adapted LMs can outperform much larger general-domain LMs in generating a differential diagnosis from the admission notes of PICU patients. DESIGN Single-center retrospective cohort study. SETTING Quaternary 40-bed PICU. PATIENTS Notes from all patients admitted to the PICU between January 2012 and April 2023 were used for model development. One hundred thirty randomly selected admission notes were used for evaluation. INTERVENTIONS None. MEASUREMENTS AND MAIN RESULTS Five experts in critical care used a 5-point Likert scale to independently evaluate the overall quality of differential diagnoses: 1) written by the clinician in the original notes, 2) generated by two general LMs (BioGPT-Large and LLaMa-65B), and 3) generated by two fine-tuned models (fine-tuned BioGPT-Large and fine-tuned LLaMa-7B). Differences among differential diagnoses were compared using mixed methods regression models. We used 1,916,538 notes from 32,454 unique patients for model development and validation. The mean quality scores of the differential diagnoses generated by the clinicians and fine-tuned LLaMa-7B, the best-performing LM, were 3.43 and 2.88, respectively (absolute difference 0.54 units [95% CI, 0.37-0.72], p < 0.001). Fine-tuned LLaMa-7B performed better than LLaMa-65B (absolute difference 0.23 unit [95% CI, 0.06-0.41], p = 0.009) and BioGPT-Large (absolute difference 0.86 unit [95% CI, 0.69-1.0], p < 0.001). The differential diagnosis generated by clinicians and fine-tuned LLaMa-7B were ranked as the highest quality in 144 (55%) and 74 cases (29%), respectively. CONCLUSIONS A smaller LM fine-tuned using notes of PICU patients outperformed much larger models trained on general-domain data. Currently, LMs remain inferior but may serve as an adjunct to human clinicians in real-world tasks using real-world data.
Collapse
Affiliation(s)
- Alireza Akhondi-Asl
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
- Department of Anaesthesia, Harvard Medical School, Boston, MA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
| | - Youyang Yang
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
- Department of Anaesthesia, Harvard Medical School, Boston, MA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
| | - Matthew Luchette
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
- Department of Anaesthesia, Harvard Medical School, Boston, MA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
| | - Jeffrey P Burns
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
- Department of Anaesthesia, Harvard Medical School, Boston, MA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
| | - Nilesh M Mehta
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
| | - Alon Geva
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
- Department of Anaesthesia, Harvard Medical School, Boston, MA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
| |
Collapse
|
9
|
Bryant AK, Zamora‐Resendiz R, Dai X, Morrow D, Lin Y, Jungles KM, Rae JM, Tate A, Pearson AN, Jiang R, Fritsche L, Lawrence TS, Zou W, Schipper M, Ramnath N, Yoo S, Crivelli S, Green MD. Artificial intelligence to unlock real-world evidence in clinical oncology: A primer on recent advances. Cancer Med 2024; 13:e7253. [PMID: 38899720 PMCID: PMC11187737 DOI: 10.1002/cam4.7253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 02/05/2024] [Accepted: 04/28/2024] [Indexed: 06/21/2024] Open
Abstract
PURPOSE Real world evidence is crucial to understanding the diffusion of new oncologic therapies, monitoring cancer outcomes, and detecting unexpected toxicities. In practice, real world evidence is challenging to collect rapidly and comprehensively, often requiring expensive and time-consuming manual case-finding and annotation of clinical text. In this Review, we summarise recent developments in the use of artificial intelligence to collect and analyze real world evidence in oncology. METHODS We performed a narrative review of the major current trends and recent literature in artificial intelligence applications in oncology. RESULTS Artificial intelligence (AI) approaches are increasingly used to efficiently phenotype patients and tumors at large scale. These tools also may provide novel biological insights and improve risk prediction through multimodal integration of radiographic, pathological, and genomic datasets. Custom language processing pipelines and large language models hold great promise for clinical prediction and phenotyping. CONCLUSIONS Despite rapid advances, continued progress in computation, generalizability, interpretability, and reliability as well as prospective validation are needed to integrate AI approaches into routine clinical care and real-time monitoring of novel therapies.
Collapse
Affiliation(s)
- Alex K. Bryant
- Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
- Department of Radiation Oncology, Veterans Affairs Ann Arbor Healthcare SystemAnn ArborMichiganUSA
| | - Rafael Zamora‐Resendiz
- Applied Mathematics and Computational Research Division, Lawrence Berkeley National LaboratoryBerkeleyCaliforniaUSA
| | - Xin Dai
- Computational Science Initiative, Brookhaven National LaboratoryUptonNew YorkUSA
| | - Destinee Morrow
- Applied Mathematics and Computational Research Division, Lawrence Berkeley National LaboratoryBerkeleyCaliforniaUSA
| | - Yuewei Lin
- Computational Science Initiative, Brookhaven National LaboratoryUptonNew YorkUSA
| | - Kassidy M. Jungles
- Department of PharmacologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
| | - James M. Rae
- Department of PharmacologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
- Department of Internal MedicineUniversity of Michigan School of MedicineAnn ArborMichiganUSA
| | - Akshay Tate
- Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
| | - Ashley N. Pearson
- Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
| | - Ralph Jiang
- Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
- Department of StatisticsUniversity of MichiganAnn ArborMichiganUSA
| | - Lars Fritsche
- Department of StatisticsUniversity of MichiganAnn ArborMichiganUSA
| | - Theodore S. Lawrence
- Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
| | - Weiping Zou
- Department of StatisticsUniversity of MichiganAnn ArborMichiganUSA
- Center of Excellence for Cancer Immunology and ImmunotherapyUniversity of Michigan Rogel Cancer CenterAnn ArborMichiganUSA
- Department of PathologyUniversity of MichiganAnn ArborMichiganUSA
- Graduate Program in ImmunologyUniversity of MichiganAnn ArborMichiganUSA
| | - Matthew Schipper
- Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
- Department of PharmacologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
| | - Nithya Ramnath
- Division of Hematology Oncology, Department of MedicineUniversity of Michigan School of MedicineAnn ArborMichiganUSA
- Division of Hematology Oncology, Department of MedicineVeterans Affairs Ann Arbor Healthcare SystemAnn ArborMichiganUSA
| | - Shinjae Yoo
- Computational Science Initiative, Brookhaven National LaboratoryUptonNew YorkUSA
| | - Silvia Crivelli
- Applied Mathematics and Computational Research Division, Lawrence Berkeley National LaboratoryBerkeleyCaliforniaUSA
| | - Michael D. Green
- Department of Radiation OncologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
- Department of Radiation Oncology, Veterans Affairs Ann Arbor Healthcare SystemAnn ArborMichiganUSA
- Graduate Program in ImmunologyUniversity of MichiganAnn ArborMichiganUSA
- Graduate Program in Cancer BiologyUniversity of MichiganAnn ArborMichiganUSA
- Department of Microbiology and ImmunologyUniversity of Michigan School of MedicineAnn ArborMichiganUSA
| |
Collapse
|
10
|
Pais C, Liu J, Voigt R, Gupta V, Wade E, Bayati M. Large language models for preventing medication direction errors in online pharmacies. Nat Med 2024; 30:1574-1582. [PMID: 38664535 PMCID: PMC11186789 DOI: 10.1038/s41591-024-02933-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Accepted: 03/20/2024] [Indexed: 05/04/2024]
Abstract
Errors in pharmacy medication directions, such as incorrect instructions for dosage or frequency, can increase patient safety risk substantially by raising the chances of adverse drug events. This study explores how integrating domain knowledge with large language models (LLMs)-capable of sophisticated text interpretation and generation-can reduce these errors. We introduce MEDIC (medication direction copilot), a system that emulates the reasoning of pharmacists by prioritizing precise communication of core clinical components of a prescription, such as dosage and frequency. It fine-tunes a first-generation LLM using 1,000 expert-annotated and augmented directions from Amazon Pharmacy to extract the core components and assembles them into complete directions using pharmacy logic and safety guardrails. We compared MEDIC against two LLM-based benchmarks: one leveraging 1.5 million medication directions and the other using state-of-the-art LLMs. On 1,200 expert-reviewed prescriptions, the two benchmarks respectively recorded 1.51 (confidence interval (CI) 1.03, 2.31) and 4.38 (CI 3.13, 6.64) times more near-miss events-errors caught and corrected before reaching the patient-than MEDIC. Additionally, we tested MEDIC by deploying within the production system of an online pharmacy, and during this experimental period, it reduced near-miss events by 33% (CI 26%, 40%). This study shows that LLMs, with domain expertise and safeguards, improve the accuracy and efficiency of pharmacy operations.
Collapse
Affiliation(s)
| | | | | | - Vin Gupta
- Amazon, Seattle, WA, USA
- Department of Health Metrics Sciences, University of Washington, Seattle, WA, USA
| | | | - Mohsen Bayati
- Amazon, Seattle, WA, USA
- Operations, Information and Technology at Graduate School of Business, Stanford University, Stanford, CA, USA
| |
Collapse
|
11
|
Glicksberg BS, Timsina P, Patel D, Sawant A, Vaid A, Raut G, Charney AW, Apakama D, Carr BG, Freeman R, Nadkarni GN, Klang E. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J Am Med Inform Assoc 2024:ocae103. [PMID: 38771093 DOI: 10.1093/jamia/ocae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 04/22/2024] [Indexed: 05/22/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) and large language models (LLMs) can play a critical role in emergency room operations by augmenting decision-making about patient admission. However, there are no studies for LLMs using real-world data and scenarios, in comparison to and being informed by traditional supervised machine learning (ML) models. We evaluated the performance of GPT-4 for predicting patient admissions from emergency department (ED) visits. We compared performance to traditional ML models both naively and when informed by few-shot examples and/or numerical probabilities. METHODS We conducted a retrospective study using electronic health records across 7 NYC hospitals. We trained Bio-Clinical-BERT and XGBoost (XGB) models on unstructured and structured data, respectively, and created an ensemble model reflecting ML performance. We then assessed GPT-4 capabilities in many scenarios: through Zero-shot, Few-shot with and without retrieval-augmented generation (RAG), and with and without ML numerical probabilities. RESULTS The Ensemble ML model achieved an area under the receiver operating characteristic curve (AUC) of 0.88, an area under the precision-recall curve (AUPRC) of 0.72 and an accuracy of 82.9%. The naïve GPT-4's performance (0.79 AUC, 0.48 AUPRC, and 77.5% accuracy) showed substantial improvement when given limited, relevant data to learn from (ie, RAG) and underlying ML probabilities (0.87 AUC, 0.71 AUPRC, and 83.1% accuracy). Interestingly, RAG alone boosted performance to near peak levels (0.82 AUC, 0.56 AUPRC, and 81.3% accuracy). CONCLUSIONS The naïve LLM had limited performance but showed significant improvement in predicting ED admissions when supplemented with real-world examples to learn from, particularly through RAG, and/or numerical probabilities from traditional ML models. Its peak performance, although slightly lower than the pure ML model, is noteworthy given its potential for providing reasoning behind predictions. Further refinement of LLMs with real-world data is necessary for successful integration as decision-support tools in care settings.
Collapse
Affiliation(s)
- Benjamin S Glicksberg
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Prem Timsina
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Dhaval Patel
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Ashwin Sawant
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Akhil Vaid
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Ganesh Raut
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Alexander W Charney
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Donald Apakama
- Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Brendan G Carr
- Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Robert Freeman
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| |
Collapse
|
12
|
Murugan M, Yuan B, Venner E, Ballantyne CM, Robinson KM, Coons JC, Wang L, Empey PE, Gibbs RA. Empowering personalized pharmacogenomics with generative AI solutions. J Am Med Inform Assoc 2024; 31:1356-1366. [PMID: 38447590 PMCID: PMC11105140 DOI: 10.1093/jamia/ocae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 02/06/2024] [Accepted: 02/19/2024] [Indexed: 03/08/2024] Open
Abstract
OBJECTIVE This study evaluates an AI assistant developed using OpenAI's GPT-4 for interpreting pharmacogenomic (PGx) testing results, aiming to improve decision-making and knowledge sharing in clinical genetics and to enhance patient care with equitable access. MATERIALS AND METHODS The AI assistant employs retrieval-augmented generation (RAG), which combines retrieval and generative techniques, by harnessing a knowledge base (KB) that comprises data from the Clinical Pharmacogenetics Implementation Consortium (CPIC). It uses context-aware GPT-4 to generate tailored responses to user queries from this KB, further refined through prompt engineering and guardrails. RESULTS Evaluated against a specialized PGx question catalog, the AI assistant showed high efficacy in addressing user queries. Compared with OpenAI's ChatGPT 3.5, it demonstrated better performance, especially in provider-specific queries requiring specialized data and citations. Key areas for improvement include enhancing accuracy, relevancy, and representative language in responses. DISCUSSION The integration of context-aware GPT-4 with RAG significantly enhanced the AI assistant's utility. RAG's ability to incorporate domain-specific CPIC data, including recent literature, proved beneficial. Challenges persist, such as the need for specialized genetic/PGx models to improve accuracy and relevancy and addressing ethical, regulatory, and safety concerns. CONCLUSION This study underscores generative AI's potential for transforming healthcare provider support and patient accessibility to complex pharmacogenomic information. While careful implementation of large language models like GPT-4 is necessary, it is clear that they can substantially improve understanding of pharmacogenomic data. With further development, these tools could augment healthcare expertise, provider productivity, and the delivery of equitable, patient-centered healthcare services.
Collapse
Affiliation(s)
- Mullai Murugan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
| | - Bo Yuan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States
| | - Eric Venner
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States
| | - Christie M Ballantyne
- Sections of Cardiology and Cardiovascular Research, Department of Medicine, Baylor College of Medicine, Houston, TX, United States
| | | | - James C Coons
- School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, United States
- Department of Pharmacy, UPMC Presbyterian-Shadyside Hospital, Pittsburgh, PA, United States
| | - Liwen Wang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
| | - Philip E Empey
- School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, United States
- Institute for Precision Medicine, UPMC/University of Pittsburgh, Pittsburgh, PA, United States
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States
| |
Collapse
|
13
|
Bitterman DS, Downing A, Maués J, Lustberg M. Promise and Perils of Large Language Models for Cancer Survivorship and Supportive Care. J Clin Oncol 2024; 42:1607-1611. [PMID: 38452323 PMCID: PMC11095890 DOI: 10.1200/jco.23.02439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 12/07/2023] [Accepted: 01/17/2024] [Indexed: 03/09/2024] Open
Abstract
A call to action to bring stakeholders together to plan for the future of LLM-enhanced cancer survivorship.
Collapse
Affiliation(s)
- Danielle S. Bitterman
- Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA
| | | | | | - Maryam Lustberg
- Department of Medical Oncology, Yale School of Medicine, New Haven, CT
| |
Collapse
|
14
|
Friedman AB, Delgado MK, Weissman GE. Artificial Intelligence for Emergency Care Triage-Much Promise, but Still Much to Learn. JAMA Netw Open 2024; 7:e248857. [PMID: 38713470 DOI: 10.1001/jamanetworkopen.2024.8857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 05/08/2024] Open
Affiliation(s)
- Ari B Friedman
- Department of Emergency Medicine, University of Pennsylvania, Philadelphia
- Department of Medical Ethics and Health Policy, University of Pennsylvania, Philadelphia
- The Leonard Davis Institute, University of Pennsylvania, Philadelphia
| | - M Kit Delgado
- Department of Emergency Medicine, University of Pennsylvania, Philadelphia
- The Leonard Davis Institute, University of Pennsylvania, Philadelphia
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia
- Center for Health Care Transformation and Innovation, University of Pennsylvania, Philadelphia
| | - Gary E Weissman
- The Leonard Davis Institute, University of Pennsylvania, Philadelphia
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia
- Palliative and Advanced Illness Research Center, University of Pennsylvania Perelman School of Medicine, Philadelphia
- Pulmonary, Allergy, and Critical Care Division, University of Pennsylvania Perelman School of Medicine, Philadelphia
| |
Collapse
|
15
|
Alter IL, Chan K, Lechien J, Rameau A. An introduction to machine learning and generative artificial intelligence for otolaryngologists-head and neck surgeons: a narrative review. Eur Arch Otorhinolaryngol 2024; 281:2723-2731. [PMID: 38393353 DOI: 10.1007/s00405-024-08512-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Accepted: 01/25/2024] [Indexed: 02/25/2024]
Abstract
PURPOSE Despite the robust expansion of research surrounding artificial intelligence (AI) and machine learning (ML) and their applications to medicine, these methodologies often remain opaque and inaccessible to many otolaryngologists. Especially, with the increasing ubiquity of large-language models (LLMs), such as ChatGPT and their potential implementation in clinical practice, clinicians may benefit from a baseline understanding of some aspects of AI. In this narrative review, we seek to clarify underlying concepts, illustrate applications to otolaryngology, and highlight future directions and limitations of these tools. METHODS Recent literature regarding AI principles and otolaryngologic applications of ML and LLMs was reviewed via search in PubMed and Google Scholar. RESULTS Significant recent strides have been made in otolaryngology research utilizing AI and ML, across all subspecialties, including neurotology, head and neck oncology, laryngology, rhinology, and sleep surgery. Potential applications suggested by recent publications include screening and diagnosis, predictive tools, clinical decision support, and clinical workflow improvement via LLMs. Ongoing concerns regarding AI in medicine include ethical concerns around bias and data sharing, as well as the "black box" problem and limitations in explainability. CONCLUSIONS Potential implementations of AI in otolaryngology are rapidly expanding. While implementation in clinical practice remains theoretical for most of these tools, their potential power to influence the practice of otolaryngology is substantial. LEVEL OF EVIDENCE: 4
Collapse
Affiliation(s)
- Isaac L Alter
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, New York, NY, 10022, USA
| | - Karly Chan
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, New York, NY, 10022, USA
| | - Jérome Lechien
- Department of Otorhinolaryngology, Head and Neck Surgery, Hôpital Foch, School of Medicine, UFR Simone Veil, Université Versailles Saint-Quentin-en-Yvelines (Paris Saclay University), Paris, France
- Department of Human Anatomy and Experimental Oncology, Faculty of Medicine, UMONS Research Institute for Health and Sciences Technology, University of Mons (UMons), Mons, Belgium
| | - Anaïs Rameau
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medical College, 240 E 59 St, New York, NY, 10022, USA.
| |
Collapse
|
16
|
Wu J, Wu X, Qiu Z, Li M, Lin S, Zhang Y, Zheng Y, Yuan C, Yang J. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inform Assoc 2024:ocae079. [PMID: 38684792 DOI: 10.1093/jamia/ocae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 03/14/2024] [Accepted: 04/02/2024] [Indexed: 05/02/2024] Open
Abstract
OBJECTIVES Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance. MATERIALS AND METHODS The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381 149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT (GPT-3.5), GPT-4, Baichuan2-7B, Baichuan2-13B, and QWEN-72B in CNMLE-2022 and further investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 distinct perspectives. RESULTS Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE framework, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT's performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70, affirming the effectiveness and robustness of the framework. It also enabled a smaller Baichuan2-13B to pass the examination, showcasing the great potential in low-resource settings. DISCUSSION AND CONCLUSION This study shed light on the optimal practices to enhance the capabilities of LLMs in non-English medical scenarios. By synergizing medical knowledge through in-context learning, LLMs can extend clinical insight beyond language barriers in healthcare, significantly reducing language-related disparities of LLM applications and ensuring global benefit in this field.
Collapse
Affiliation(s)
- Jiageng Wu
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Xian Wu
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Zhaopeng Qiu
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Minghui Li
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Shixu Lin
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Yingying Zhang
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Yefeng Zheng
- Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China
| | - Changzheng Yuan
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Department of Nutrition, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States
| | - Jie Yang
- School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, United States
| |
Collapse
|
17
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.26.24306390. [PMID: 38712148 PMCID: PMC11071576 DOI: 10.1101/2024.04.26.24306390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Background The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Ellen Wright Clayton
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
| | - Bradley A. Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
- Department of Biostatistics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| |
Collapse
|
18
|
Gilbert S, Kather JN, Hogan A. Augmented non-hallucinating large language models as medical information curators. NPJ Digit Med 2024; 7:100. [PMID: 38654142 DOI: 10.1038/s41746-024-01081-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 03/14/2024] [Indexed: 04/25/2024] Open
Affiliation(s)
- Stephen Gilbert
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany.
| | - Jakob Nikolas Kather
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany
| | - Aidan Hogan
- Department of Computer Science, Universidad de Chile, Santiago, Chile
- Millennium Institute for Foundational Research on Data, DCC, Universidad de Chile, Santiago, Chile
| |
Collapse
|
19
|
Qu Y, Wei C, Du P, Che W, Zhang C, Ouyang W, Bian Y, Xu F, Hu B, Du K, Wu H, Liu J, Liu Q. Integration of cognitive tasks into artificial general intelligence test for large models. iScience 2024; 27:109550. [PMID: 38595796 PMCID: PMC11001637 DOI: 10.1016/j.isci.2024.109550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024] Open
Abstract
During the evolution of large models, performance evaluation is necessary for assessing their capabilities. However, current model evaluations mainly rely on specific tasks and datasets, lacking a united framework for assessing the multidimensional intelligence of large models. In this perspective, we advocate for a comprehensive framework of cognitive science-inspired artificial general intelligence (AGI) tests, including crystallized, fluid, social, and embodied intelligence. The AGI tests consist of well-designed cognitive tests adopted from human intelligence tests, and then naturally encapsulates into an immersive virtual community. We propose increasing the complexity of AGI testing tasks commensurate with advancements in large models and emphasizing the necessity for the interpretation of test results to avoid false negatives and false positives. We believe that cognitive science-inspired AGI tests will effectively guide the targeted improvement of large models in specific dimensions of intelligence and accelerate the integration of large models into human society.
Collapse
Affiliation(s)
- Youzhi Qu
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | - Chen Wei
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | - Penghui Du
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | - Wenxin Che
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | - Chi Zhang
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | | | | | - Feiyang Xu
- iFLYTEK AI Research, Hefei 230088, China
| | - Bin Hu
- School of Medical Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Kai Du
- Institute for Artificial Intelligence, Peking University, Beijing 100871, China
| | - Haiyan Wu
- Centre for Cognitive and Brain Sciences and Department of Psychology, University of Macau, Macau 999078, China
| | - Jia Liu
- Department of Psychology, Tsinghua University, Beijing 100084, China
| | - Quanying Liu
- Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| |
Collapse
|
20
|
Frosolini A, Catarzi L, Benedetti S, Latini L, Chisci G, Franz L, Gennaro P, Gabriele G. The Role of Large Language Models (LLMs) in Providing Triage for Maxillofacial Trauma Cases: A Preliminary Study. Diagnostics (Basel) 2024; 14:839. [PMID: 38667484 PMCID: PMC11048758 DOI: 10.3390/diagnostics14080839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 04/10/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND In the evolving field of maxillofacial surgery, integrating advanced technologies like Large Language Models (LLMs) into medical practices, especially for trauma triage, presents a promising yet largely unexplored potential. This study aimed to evaluate the feasibility of using LLMs for triaging complex maxillofacial trauma cases by comparing their performance against the expertise of a tertiary referral center. METHODS Utilizing a comprehensive review of patient records in a tertiary referral center over a year-long period, standardized prompts detailing patient demographics, injury characteristics, and medical histories were created. These prompts were used to assess the triage suggestions of ChatGPT 4.0 and Google GEMINI against the center's recommendations, supplemented by evaluating the AI's performance using the QAMAI and AIPI questionnaires. RESULTS The results in 10 cases of major maxillofacial trauma indicated moderate agreement rates between LLM recommendations and the referral center, with some variances in the suggestion of appropriate examinations (70% ChatGPT and 50% GEMINI) and treatment plans (60% ChatGPT and 45% GEMINI). Notably, the study found no statistically significant differences in several areas of the questionnaires, except in the diagnosis accuracy (GEMINI: 3.30, ChatGPT: 2.30; p = 0.032) and relevance of the recommendations (GEMINI: 2.90, ChatGPT: 3.50; p = 0.021). A Spearman correlation analysis highlighted significant correlations within the two questionnaires, specifically between the QAMAI total score and AIPI treatment scores (rho = 0.767, p = 0.010). CONCLUSIONS This exploratory investigation underscores the potential of LLMs in enhancing clinical decision making for maxillofacial trauma cases, indicating a need for further research to refine their application in healthcare settings.
Collapse
Affiliation(s)
- Andrea Frosolini
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Lisa Catarzi
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Simone Benedetti
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Linda Latini
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Glauco Chisci
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Leonardo Franz
- Phoniatris and Audiology Unit, Department of Neuroscience DNS, University of Padova, 35122 Treviso, Italy;
- Artificial Intelligence in Medicine and Innovation in Clinical Research and Methodology (PhD Program), Department of Clinical and Experimental Sciences, University of Brescia, 25121 Brescia, Italy
| | - Paolo Gennaro
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Guido Gabriele
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| |
Collapse
|
21
|
Ma C, Tan W, He R, Yan B. Pretraining a foundation model for generalizable fluorescence microscopy-based image restoration. Nat Methods 2024:10.1038/s41592-024-02244-3. [PMID: 38609490 DOI: 10.1038/s41592-024-02244-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 03/13/2024] [Indexed: 04/14/2024]
Abstract
Fluorescence microscopy-based image restoration has received widespread attention in the life sciences and has led to significant progress, benefiting from deep learning technology. However, most current task-specific methods have limited generalizability to different fluorescence microscopy-based image restoration problems. Here, we seek to improve generalizability and explore the potential of applying a pretrained foundation model to fluorescence microscopy-based image restoration. We provide a universal fluorescence microscopy-based image restoration (UniFMIR) model to address different restoration problems, and show that UniFMIR offers higher image restoration precision, better generalization and increased versatility. Demonstrations on five tasks and 14 datasets covering a wide range of microscopy imaging modalities and biological samples demonstrate that the pretrained UniFMIR can effectively transfer knowledge to a specific situation via fine-tuning, uncover clear nanoscale biomolecular structures and facilitate high-quality imaging. This work has the potential to inspire and trigger new research highlights for fluorescence microscopy-based image restoration.
Collapse
Affiliation(s)
- Chenxi Ma
- School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China
| | - Weimin Tan
- School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China
| | - Ruian He
- School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China
| | - Bo Yan
- School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China.
| |
Collapse
|
22
|
Kim C, Gadgil SU, DeGrave AJ, Omiye JA, Cai ZR, Daneshjou R, Lee SI. Transparent medical image AI via an image-text foundation model grounded in medical literature. Nat Med 2024; 30:1154-1165. [PMID: 38627560 DOI: 10.1038/s41591-024-02887-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 02/27/2024] [Indexed: 04/21/2024]
Abstract
Building trustworthy and transparent image-based medical artificial intelligence (AI) systems requires the ability to interrogate data and models at all stages of the development pipeline, from training models to post-deployment monitoring. Ideally, the data and associated AI systems could be described using terms already familiar to physicians, but this requires medical datasets densely annotated with semantically meaningful concepts. In the present study, we present a foundation model approach, named MONET (medical concept retriever), which learns how to connect medical images with text and densely scores images on concept presence to enable important tasks in medical AI development and deployment such as data auditing, model auditing and model interpretation. Dermatology provides a demanding use case for the versatility of MONET, due to the heterogeneity in diseases, skin tones and imaging modalities. We trained MONET based on 105,550 dermatological images paired with natural language descriptions from a large collection of medical literature. MONET can accurately annotate concepts across dermatology images as verified by board-certified dermatologists, competitively with supervised models built on previously concept-annotated dermatology datasets of clinical images. We demonstrate how MONET enables AI transparency across the entire AI system development pipeline, from building inherently interpretable models to dataset and model auditing, including a case study dissecting the results of an AI clinical trial.
Collapse
Affiliation(s)
- Chanwoo Kim
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
| | - Soham U Gadgil
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
| | - Alex J DeGrave
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
- Medical Scientist Training Program, University of Washington, Seattle, WA, USA
| | - Jesutofunmi A Omiye
- Department of Dermatology, Stanford School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA
| | - Zhuo Ran Cai
- Program for Clinical Research and Technology, Stanford University, Stanford, CA, USA
| | - Roxana Daneshjou
- Department of Dermatology, Stanford School of Medicine, Stanford, CA, USA.
- Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA.
| | - Su-In Lee
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA.
| |
Collapse
|
23
|
Yurkovich JT, Evans SJ, Rappaport N, Boore JL, Lovejoy JC, Price ND, Hood LE. The transition from genomics to phenomics in personalized population health. Nat Rev Genet 2024; 25:286-302. [PMID: 38093095 DOI: 10.1038/s41576-023-00674-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/03/2023] [Indexed: 03/21/2024]
Abstract
Modern health care faces several serious challenges, including an ageing population and its inherent burden of chronic diseases, rising costs and marginal quality metrics. By assessing and optimizing the health trajectory of each individual using a data-driven personalized approach that reflects their genetics, behaviour and environment, we can start to address these challenges. This assessment includes longitudinal phenome measures, such as the blood proteome and metabolome, gut microbiome composition and function, and lifestyle and behaviour through wearables and questionnaires. Here, we review ongoing large-scale genomics and longitudinal phenomics efforts and the powerful insights they provide into wellness. We describe our vision for the transformation of the current health care from disease-oriented to data-driven, wellness-oriented and personalized population health.
Collapse
Affiliation(s)
- James T Yurkovich
- Phenome Health, Seattle, WA, USA
- Center for Phenomic Health, The Buck Institute for Research on Aging, Novato, CA, USA
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX, USA
| | - Simon J Evans
- Phenome Health, Seattle, WA, USA
- Center for Phenomic Health, The Buck Institute for Research on Aging, Novato, CA, USA
| | - Noa Rappaport
- Center for Phenomic Health, The Buck Institute for Research on Aging, Novato, CA, USA
- Institute for Systems Biology, Seattle, WA, USA
| | - Jeffrey L Boore
- Phenome Health, Seattle, WA, USA
- Center for Phenomic Health, The Buck Institute for Research on Aging, Novato, CA, USA
| | - Jennifer C Lovejoy
- Phenome Health, Seattle, WA, USA
- Center for Phenomic Health, The Buck Institute for Research on Aging, Novato, CA, USA
- Institute for Systems Biology, Seattle, WA, USA
| | - Nathan D Price
- Institute for Systems Biology, Seattle, WA, USA
- Thorne HealthTech, New York, NY, USA
- Department of Bioengineering, University of Washington, Seattle, WA, USA
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
| | - Leroy E Hood
- Phenome Health, Seattle, WA, USA.
- Center for Phenomic Health, The Buck Institute for Research on Aging, Novato, CA, USA.
- Institute for Systems Biology, Seattle, WA, USA.
- Department of Bioengineering, University of Washington, Seattle, WA, USA.
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA.
- Department of Immunology, University of Washington, Seattle, WA, USA.
| |
Collapse
|
24
|
Truhn D, Eckardt JN, Ferber D, Kather JN. Large language models and multimodal foundation models for precision oncology. NPJ Precis Oncol 2024; 8:72. [PMID: 38519519 PMCID: PMC10959931 DOI: 10.1038/s41698-024-00573-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 03/12/2024] [Indexed: 03/25/2024] Open
Abstract
The technological progress in artificial intelligence (AI) has massively accelerated since 2022, with far-reaching implications for oncology and cancer research. Large language models (LLMs) now perform at human-level competency in text processing. Notably, both text and image processing networks are increasingly based on transformer neural networks. This convergence enables the development of multimodal AI models that take diverse types of data as an input simultaneously, marking a qualitative shift from specialized niche models which were prevalent in the 2010s. This editorial summarizes these developments, which are expected to impact precision oncology in the coming years.
Collapse
Affiliation(s)
- Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany
| | - Jan-Niklas Eckardt
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany
- Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
| | - Dyke Ferber
- National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany
- Department of Medical Oncology, Heidelberg University Hospital, Heidelberg, Germany
| | - Jakob Nikolas Kather
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany.
- Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany.
- National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany.
- Department of Medical Oncology, Heidelberg University Hospital, Heidelberg, Germany.
| |
Collapse
|
25
|
Ge J, Chen IY, Pletcher MJ, Lai JC. Prompt Engineering for Generative Artificial Intelligence in Gastroenterology and Hepatology. Am J Gastroenterol 2024:00000434-990000000-01003. [PMID: 38294157 DOI: 10.14309/ajg.0000000000002689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 12/28/2023] [Indexed: 02/01/2024]
Affiliation(s)
- Jin Ge
- Division of Gastroenterology and Hepatology, Department of Medicine, University of California, San Francisco, San Francisco, California, USA
| | - Irene Y Chen
- UCSF and UC Berkeley Joint Program in Computational Precision Health, Berkeley, California, USA
| | - Mark J Pletcher
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, California, USA
| | - Jennifer C Lai
- Division of Gastroenterology and Hepatology, Department of Medicine, University of California, San Francisco, San Francisco, California, USA
| |
Collapse
|
26
|
Sorin V, Glicksberg BS, Artsi Y, Barash Y, Konen E, Nadkarni GN, Klang E. Utilizing large language models in breast cancer management: systematic review. J Cancer Res Clin Oncol 2024; 150:140. [PMID: 38504034 PMCID: PMC10950983 DOI: 10.1007/s00432-024-05678-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Accepted: 03/01/2024] [Indexed: 03/21/2024]
Abstract
PURPOSE Despite advanced technologies in breast cancer management, challenges remain in efficiently interpreting vast clinical data for patient-specific insights. We reviewed the literature on how large language models (LLMs) such as ChatGPT might offer solutions in this field. METHODS We searched MEDLINE for relevant studies published before December 22, 2023. Keywords included: "large language models", "LLM", "GPT", "ChatGPT", "OpenAI", and "breast". The risk bias was evaluated using the QUADAS-2 tool. RESULTS Six studies evaluating either ChatGPT-3.5 or GPT-4, met our inclusion criteria. They explored clinical notes analysis, guideline-based question-answering, and patient management recommendations. Accuracy varied between studies, ranging from 50 to 98%. Higher accuracy was seen in structured tasks like information retrieval. Half of the studies used real patient data, adding practical clinical value. Challenges included inconsistent accuracy, dependency on the way questions are posed (prompt-dependency), and in some cases, missing critical clinical information. CONCLUSION LLMs hold potential in breast cancer care, especially in textual information extraction and guideline-driven clinical question-answering. Yet, their inconsistent accuracy underscores the need for careful validation of these models, and the importance of ongoing supervision.
Collapse
Affiliation(s)
- Vera Sorin
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Affiliated to the Sackler School of Medicine, Tel-Aviv University, Emek Haela St. 1, 52621, Ramat Gan, Israel.
- DeepVision Lab, Chaim Sheba Medical Center, Tel Hashomer, Israel.
| | - Benjamin S Glicksberg
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Yaara Artsi
- Azrieli Faculty of Medicine, Bar-Ilan University, Zefat, Israel
| | - Yiftach Barash
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Affiliated to the Sackler School of Medicine, Tel-Aviv University, Emek Haela St. 1, 52621, Ramat Gan, Israel
- DeepVision Lab, Chaim Sheba Medical Center, Tel Hashomer, Israel
| | - Eli Konen
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Affiliated to the Sackler School of Medicine, Tel-Aviv University, Emek Haela St. 1, 52621, Ramat Gan, Israel
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
27
|
Ge J, Sun S, Owens J, Galvez V, Gologorskaya O, Lai JC, Pletcher MJ, Lai K. Development of a liver disease-specific large language model chat interface using retrieval-augmented generation. Hepatology 2024:01515467-990000000-00791. [PMID: 38451962 DOI: 10.1097/hep.0000000000000834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 02/24/2024] [Indexed: 03/09/2024]
Abstract
BACKGROUND AND AIMS Large language models (LLMs) have significant capabilities in clinical information processing tasks. Commercially available LLMs, however, are not optimized for clinical uses and are prone to generating hallucinatory information. Retrieval-augmented generation (RAG) is an enterprise architecture that allows the embedding of customized data into LLMs. This approach "specializes" the LLMs and is thought to reduce hallucinations. APPROACH AND RESULTS We developed "LiVersa," a liver disease-specific LLM, by using our institution's protected health information-complaint text embedding and LLM platform, "Versa." We conducted RAG on 30 publicly available American Association for the Study of Liver Diseases guidance documents to be incorporated into LiVersa. We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4. RESULTS We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4. CONCLUSIONS In this demonstration, we built disease-specific and protected health information-compliant LLMs using RAG. While LiVersa demonstrated higher accuracy in answering questions related to hepatology, there were some deficiencies due to limitations set by the number of documents used for RAG. LiVersa will likely require further refinement before potential live deployment. The LiVersa prototype, however, is a proof of concept for utilizing RAG to customize LLMs for clinical use cases.
Collapse
Affiliation(s)
- Jin Ge
- Department of Medicine, Division of Gastroenterology and Hepatology, University of California-San Francisco, San Francisco, California, USA
| | - Steve Sun
- UCSF Health Information Technology, University of California-San Francisco, San Francisco, California, USA
| | - Joseph Owens
- UCSF Health Information Technology, University of California-San Francisco, San Francisco, California, USA
| | - Victor Galvez
- UCSF Health Information Technology, University of California-San Francisco, San Francisco, California, USA
| | - Oksana Gologorskaya
- UCSF Health Information Technology, University of California-San Francisco, San Francisco, California, USA
- Bakar Computational Health Sciences Institute, University of California-San Francisco, San Francisco, California, USA
| | - Jennifer C Lai
- Department of Medicine, Division of Gastroenterology and Hepatology, University of California-San Francisco, San Francisco, California, USA
| | - Mark J Pletcher
- Department of Epidemiology and Biostatistics, University of California-San Francisco, San Francisco, California, USA
| | - Ki Lai
- UCSF Health Information Technology, University of California-San Francisco, San Francisco, California, USA
| |
Collapse
|
28
|
Raghu Subramanian C, Yang DA, Khanna R. Enhancing Health Care Communication With Large Language Models-The Role, Challenges, and Future Directions. JAMA Netw Open 2024; 7:e240347. [PMID: 38466311 DOI: 10.1001/jamanetworkopen.2024.0347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/12/2024] Open
Affiliation(s)
| | | | - Raman Khanna
- Division of Clinical Informatics and Digital Transformation, University of California, San Francisco
| |
Collapse
|
29
|
Obradovich N, Johnson T, Paulus MP. Managerial and Organizational Challenges in the Age of AI. JAMA Psychiatry 2024; 81:219-220. [PMID: 38265819 DOI: 10.1001/jamapsychiatry.2023.5247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/25/2024]
Abstract
This Viewpoint discusses the managerial and organizational challenges that could result from the use of artificial intelligence systems in psychiatric research and care.
Collapse
Affiliation(s)
- Nick Obradovich
- Laureate Institute for Brain Research, University of Tulsa, Tulsa, Oklahoma
| | - Tim Johnson
- Atkinson Graduate School of Management, Willamette University, Salem, Oregon
| | - Martin P Paulus
- Laureate Institute for Brain Research, University of Tulsa, Tulsa, Oklahoma
- Department of Psychiatry, University of California, San Diego
- Deputy Editor, JAMA Psychiatry
| |
Collapse
|
30
|
Valizadeh A, Moassefi M, Nakhostin-Ansari A, Heidari Some'eh S, Hosseini-Asl H, Saghab Torbati M, Aghajani R, Maleki Ghorbani Z, Menbari-Oskouie I, Aghajani F, Mirzamohamadi A, Ghafouri M, Faghani S, Memari AH. Automated diagnosis of autism with artificial intelligence: State of the art. Rev Neurosci 2024; 35:141-163. [PMID: 37678819 DOI: 10.1515/revneuro-2023-0050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 07/28/2023] [Indexed: 09/09/2023]
Abstract
Autism spectrum disorder (ASD) represents a panel of conditions that begin during the developmental period and result in impairments of personal, social, academic, or occupational functioning. Early diagnosis is directly related to a better prognosis. Unfortunately, the diagnosis of ASD requires a long and exhausting subjective process. We aimed to review the state of the art for automated autism diagnosis and recognition in this research. In February 2022, we searched multiple databases and sources of gray literature for eligible studies. We used an adapted version of the QUADAS-2 tool to assess the risk of bias in the studies. A brief report of the methods and results of each study is presented. Data were synthesized for each modality separately using the Split Component Synthesis (SCS) method. We assessed heterogeneity using the I 2 statistics and evaluated publication bias using trim and fill tests combined with ln DOR. Confidence in cumulative evidence was assessed using the GRADE approach for diagnostic studies. We included 344 studies from 186,020 participants (51,129 are estimated to be unique) for nine different modalities in this review, from which 232 reported sufficient data for meta-analysis. The area under the curve was in the range of 0.71-0.90 for all the modalities. The studies on EEG data provided the best accuracy, with the area under the curve ranging between 0.85 and 0.93. We found that the literature is rife with bias and methodological/reporting flaws. Recommendations are provided for future research to provide better studies and fill in the current knowledge gaps.
Collapse
Affiliation(s)
- Amir Valizadeh
- Neuroscience Institute, Tehran University of Medical Sciences, PO: 1419733141, Tehran, Iran
| | - Mana Moassefi
- Neuroscience Institute, Tehran University of Medical Sciences, PO: 1419733141, Tehran, Iran
| | - Amin Nakhostin-Ansari
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
| | - Soheil Heidari Some'eh
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
- Students' Scientific Research Center, Tehran University of Medical Sciences, PO: 1417755331, Tehran, Iran
| | - Hossein Hosseini-Asl
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
- Students' Scientific Research Center, Tehran University of Medical Sciences, PO: 1417755331, Tehran, Iran
| | | | - Reyhaneh Aghajani
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
- Students' Scientific Research Center, Tehran University of Medical Sciences, PO: 1417755331, Tehran, Iran
| | - Zahra Maleki Ghorbani
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
- Students' Scientific Research Center, Tehran University of Medical Sciences, PO: 1417755331, Tehran, Iran
| | - Iman Menbari-Oskouie
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
| | - Faezeh Aghajani
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
- Research Development Center, Arash Women's Hospital, Tehran University of Medical Sciences, PO: 14695542, Tehran, Iran
| | - Alireza Mirzamohamadi
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
- Students' Scientific Research Center, Tehran University of Medical Sciences, PO: 1417755331, Tehran, Iran
| | - Mohammad Ghafouri
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
| | - Shahriar Faghani
- Shariati Hospital, Department of Radiology, Tehran University of Medical Sciences, PO: 1411713135, Tehran, Iran
- Interdisciplinary Neuroscience Research Program (INRP), Tehran University of Medical Sciences, PO: 1416634793, Tehran, Iran
| | - Amir Hossein Memari
- Sports Medicine Research Center, Neuroscience Institute, Tehran University of Medical Sciences, PO: 14395578, Tehran, Iran
| |
Collapse
|
31
|
Shahab O, El Kurdi B, Shaukat A, Nadkarni G, Soroush A. Large language models: a primer and gastroenterology applications. Therap Adv Gastroenterol 2024; 17:17562848241227031. [PMID: 38390029 PMCID: PMC10883116 DOI: 10.1177/17562848241227031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Accepted: 01/02/2024] [Indexed: 02/24/2024] Open
Abstract
Over the past year, the emergence of state-of-the-art large language models (LLMs) in tools like ChatGPT has ushered in a rapid acceleration in artificial intelligence (AI) innovation. These powerful AI models can generate tailored and high-quality text responses to instructions and questions without the need for labor-intensive task-specific training data or complex software engineering. As the technology continues to mature, LLMs hold immense potential for transforming clinical workflows, enhancing patient outcomes, improving medical education, and optimizing medical research. In this review, we provide a practical discussion of LLMs, tailored to gastroenterologists. We highlight the technical foundations of LLMs, emphasizing their key strengths and limitations as well as how to interact with them safely and effectively. We discuss some potential LLM use cases for clinical gastroenterology practice, education, and research. Finally, we review critical barriers to implementation and ongoing work to address these issues. This review aims to equip gastroenterologists with a foundational understanding of LLMs to facilitate a more active clinician role in the development and implementation of this rapidly emerging technology.
Collapse
Affiliation(s)
- Omer Shahab
- Division of Gastroenterology, Department of Medicine, VHC Health, Arlington, VA, USA
| | - Bara El Kurdi
- Division of Gastroenterology and Hepatology, Department of Medicine, Virginia Tech Carilion School of Medicine, Roanoke, VA, USA
| | - Aasma Shaukat
- Division of Gastroenterology, Department of Medicine, NYU Grossman School of Medicine, New York, NY, USA VA
- New York Harbor Veterans Affairs Healthcare System New York City, New York, NY, USA
| | - Girish Nadkarni
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ali Soroush
- Division of Data-Driven and Digital Medicine, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY 10029-6574, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Henry D. Janowitz Division of Gastroenterology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
32
|
Han C, Kim DW, Kim S, Chan You S, Park JY, Bae S, Yoon D. Evaluation of GPT-4 for 10-year cardiovascular risk prediction: Insights from the UK Biobank and KoGES data. iScience 2024; 27:109022. [PMID: 38357664 PMCID: PMC10865411 DOI: 10.1016/j.isci.2024.109022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 11/28/2023] [Accepted: 01/22/2024] [Indexed: 02/16/2024] Open
Abstract
Cardiovascular disease (CVD) remains a pressing global health concern. While traditional risk prediction methods such as the Framingham and American College of Cardiology/American Heart Association (ACC/AHA) risk scores have been widely used in the practice, artificial intelligence (AI), especially GPT-4, offers new opportunities. Utilizing large scale of multi-center data from 47,468 UK Biobank participants and 5,718 KoGES participants, this study quantitatively evaluated the predictive capabilities of GPT-4 in comparison with traditional models. Our results suggest that the GPT-based score showed commendably comparable performance in CVD prediction when compared to traditional models (AUROC on UKB: 0.725 for GPT-4, 0.733 for ACC/AHA, 0.728 for Framingham; KoGES: 0.664 for GPT-4, 0.674 for ACC/AHA, 0.675 for Framingham). Even with omission of certain variables, GPT-4's performance was robust, demonstrating its adaptability to data-scarce situations. In conclusion, this study emphasizes the promising role of GPT-4 in predicting CVD risks across varied ethnic datasets, pointing toward its expansive future applications in the medical practice.
Collapse
Affiliation(s)
- Changho Han
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Yongin, Republic of Korea
| | - Dong Won Kim
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Yongin, Republic of Korea
| | - Songsoo Kim
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Yongin, Republic of Korea
| | - Seng Chan You
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Yongin, Republic of Korea
- Institute for Innovation in Digital Healthcare, Severance Hospital, Seoul, Republic of Korea
| | - Jin Young Park
- Center for Digital Health, Yongin Severance Hospital, Yonsei University Health System, Yongin, Republic of Korea
- Department of Psychiatry, Yongin Severance Hospital, Yonsei University College of Medicine, Yongin, Republic of Korea
- Institute of Behavioral Science in Medicine, Yonsei University College of Medicine, Yonsei University Health System, Seoul, Republic of Korea
| | - SungA Bae
- Center for Digital Health, Yongin Severance Hospital, Yonsei University Health System, Yongin, Republic of Korea
- Department of Cardiology, Yongin Severance Hospital, Yonsei University College of Medicine, Yongin, Republic of Korea
| | - Dukyong Yoon
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Yongin, Republic of Korea
- Institute for Innovation in Digital Healthcare, Severance Hospital, Seoul, Republic of Korea
- Center for Digital Health, Yongin Severance Hospital, Yonsei University Health System, Yongin, Republic of Korea
| |
Collapse
|
33
|
Bey R, Cohen A, Trebossen V, Dura B, Geoffroy PA, Jean C, Landman B, Petit-Jean T, Chatellier G, Sallah K, Tannier X, Bourmaud A, Delorme R. Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality. NPJ MENTAL HEALTH RESEARCH 2024; 3:6. [PMID: 38609541 PMCID: PMC10955903 DOI: 10.1038/s44184-023-00046-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Accepted: 12/06/2023] [Indexed: 04/14/2024]
Abstract
There is an urgent need to monitor the mental health of large populations, especially during crises such as the COVID-19 pandemic, to timely identify the most at-risk subgroups and to design targeted prevention campaigns. We therefore developed and validated surveillance indicators related to suicidality: the monthly number of hospitalisations caused by suicide attempts and the prevalence among them of five known risks factors. They were automatically computed analysing the electronic health records of fifteen university hospitals of the Paris area, France, using natural language processing algorithms based on artificial intelligence. We evaluated the relevance of these indicators conducting a retrospective cohort study. Considering 2,911,920 records contained in a common data warehouse, we tested for changes after the pandemic outbreak in the slope of the monthly number of suicide attempts by conducting an interrupted time-series analysis. We segmented the assessment time in two sub-periods: before (August 1, 2017, to February 29, 2020) and during (March 1, 2020, to June 31, 2022) the COVID-19 pandemic. We detected 14,023 hospitalisations caused by suicide attempts. Their monthly number accelerated after the COVID-19 outbreak with an estimated trend variation reaching 3.7 (95%CI 2.1-5.3), mainly driven by an increase among girls aged 8-17 (trend variation 1.8, 95%CI 1.2-2.5). After the pandemic outbreak, acts of domestic, physical and sexual violence were more often reported (prevalence ratios: 1.3, 95%CI 1.16-1.48; 1.3, 95%CI 1.10-1.64 and 1.7, 95%CI 1.48-1.98), fewer patients died (p = 0.007) and stays were shorter (p < 0.001). Our study demonstrates that textual clinical data collected in multiple hospitals can be jointly analysed to compute timely indicators describing mental health conditions of populations. Our findings also highlight the need to better take into account the violence imposed on women, especially at early ages and in the aftermath of the COVID-19 pandemic.
Collapse
Affiliation(s)
- Romain Bey
- Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Ariel Cohen
- Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France.
| | - Vincent Trebossen
- Child and Adolescent Psychiatry Department, Robert Debré University Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Basile Dura
- Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Pierre-Alexis Geoffroy
- Département de psychiatrie et d'addictologie, GHU Paris Nord, DMU neurosciences, Bichat - Claude Bernard Hospital, Assistance Publique-Hôpitaux de Paris, 75018, Paris, France
- GHU Paris - psychiatry & neurosciences, 1, rue Cabanis, 75014, Paris, France
- NeuroDiderot, Inserm, FHU I2-D2, université Paris Cité, 75019, Paris, France
- CNRS UPR 3212, Institute for cellular and integrative neurosciences, 67000, Strasbourg, France
| | - Charline Jean
- Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
- Université Paris-Est Créteil, INSERM, IMRB U955, Créteil, France
- Service Santé Publique & URC, Hôpital Henri Mondor, Assistance Publique-Hôpitaux de Paris, Créteil, France
| | - Benjamin Landman
- Child and Adolescent Psychiatry Department, Robert Debré University Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Thomas Petit-Jean
- Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Gilles Chatellier
- Innovation and Data unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
- Université Paris Cité, Paris, France
| | - Kankoe Sallah
- URC PNVS, CIC-EC 1425, INSERM, Bichat - Claude Bernard Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Xavier Tannier
- Sorbonne Université, Inserm, Université Sorbonne Paris Nord, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances pour la e-Santé (LIMICS), Paris, France
| | - Aurelie Bourmaud
- Université Paris Cité, Paris, France
- Clinical Epidemiology Unit, Robert Debré University Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
- CIC 1426, Inserm, Paris, France
| | - Richard Delorme
- Child and Adolescent Psychiatry Department, Robert Debré University Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
- Human Genetics and Cognitive Functions, Institut Pasteur, Paris, France
| |
Collapse
|
34
|
Boonstra MJ, Weissenbacher D, Moore JH, Gonzalez-Hernandez G, Asselbergs FW. Artificial intelligence: revolutionizing cardiology with large language models. Eur Heart J 2024; 45:332-345. [PMID: 38170821 PMCID: PMC10834163 DOI: 10.1093/eurheartj/ehad838] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 12/01/2023] [Accepted: 12/05/2023] [Indexed: 01/05/2024] Open
Abstract
Natural language processing techniques are having an increasing impact on clinical care from patient, clinician, administrator, and research perspective. Among others are automated generation of clinical notes and discharge letters, medical term coding for billing, medical chatbots both for patients and clinicians, data enrichment in the identification of disease symptoms or diagnosis, cohort selection for clinical trial, and auditing purposes. In the review, an overview of the history in natural language processing techniques developed with brief technical background is presented. Subsequently, the review will discuss implementation strategies of natural language processing tools, thereby specifically focusing on large language models, and conclude with future opportunities in the application of such techniques in the field of cardiology.
Collapse
Affiliation(s)
- Machteld J Boonstra
- Department of Cardiology, Amsterdam Cardiovascular Sciences, Amsterdam University Medical Centre, University of Amsterdam, Amsterdam, Netherlands
| | - Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | | | - Folkert W Asselbergs
- Department of Cardiology, Amsterdam Cardiovascular Sciences, Amsterdam University Medical Centre, University of Amsterdam, Amsterdam, Netherlands
- Institute of Health Informatics, University College London, London, UK
- The National Institute for Health Research University College London Hospitals Biomedical Research Centre, University College London, London, UK
| |
Collapse
|
35
|
Liu X, Wu J, Shao A, Shen W, Ye P, Wang Y, Ye J, Jin K, Yang J. Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study. J Med Internet Res 2024; 26:e51926. [PMID: 38252483 PMCID: PMC10845019 DOI: 10.2196/51926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 10/07/2023] [Accepted: 11/30/2023] [Indexed: 01/23/2024] Open
Abstract
BACKGROUND Benefiting from rich knowledge and the exceptional ability to understand text, large language models like ChatGPT have shown great potential in English clinical environments. However, the performance of ChatGPT in non-English clinical settings, as well as its reasoning, have not been explored in depth. OBJECTIVE This study aimed to evaluate ChatGPT's diagnostic performance and inference abilities for retinal vascular diseases in a non-English clinical environment. METHODS In this cross-sectional study, we collected 1226 fundus fluorescein angiography reports and corresponding diagnoses written in Chinese and tested ChatGPT with 4 prompting strategies (direct diagnosis or diagnosis with a step-by-step reasoning process and in Chinese or English). RESULTS Compared with ChatGPT using Chinese prompts for direct diagnosis that achieved an F1-score of 70.47%, ChatGPT using English prompts for direct diagnosis achieved the best diagnostic performance (80.05%), which was inferior to ophthalmologists (89.35%) but close to ophthalmologist interns (82.69%). As for its inference abilities, although ChatGPT can derive a reasoning process with a low error rate (0.4 per report) for both Chinese and English prompts, ophthalmologists identified that the latter brought more reasoning steps with less incompleteness (44.31%), misinformation (1.96%), and hallucinations (0.59%) (all P<.001). Also, analysis of the robustness of ChatGPT with different language prompts indicated significant differences in the recall (P=.03) and F1-score (P=.04) between Chinese and English prompts. In short, when prompted in English, ChatGPT exhibited enhanced diagnostic and inference capabilities for retinal vascular disease classification based on Chinese fundus fluorescein angiography reports. CONCLUSIONS ChatGPT can serve as a helpful medical assistant to provide diagnosis in non-English clinical environments, but there are still performance gaps, language disparities, and errors compared to professionals, which demonstrate the potential limitations and the need to continually explore more robust large language models in ophthalmology practice.
Collapse
Affiliation(s)
- Xiaocong Liu
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
- School of Public Health, Zhejiang University School of Medicine, Zhejiang, China
| | - Jiageng Wu
- School of Public Health, Zhejiang University School of Medicine, Zhejiang, China
| | - An Shao
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Wenyue Shen
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Panpan Ye
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Yao Wang
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Juan Ye
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Kai Jin
- Eye Center, The Second Affiliated Hospital, Zhejiang University, Zhejiang, China
| | - Jie Yang
- School of Public Health, Zhejiang University School of Medicine, Zhejiang, China
| |
Collapse
|
36
|
Ong JCL, Seng BJJ, Law JZF, Low LL, Kwa ALH, Giacomini KM, Ting DSW. Artificial intelligence, ChatGPT, and other large language models for social determinants of health: Current state and future directions. Cell Rep Med 2024; 5:101356. [PMID: 38232690 PMCID: PMC10829781 DOI: 10.1016/j.xcrm.2023.101356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 10/12/2023] [Accepted: 12/10/2023] [Indexed: 01/19/2024]
Abstract
This perspective highlights the importance of addressing social determinants of health (SDOH) in patient health outcomes and health inequity, a global problem exacerbated by the COVID-19 pandemic. We provide a broad discussion on current developments in digital health and artificial intelligence (AI), including large language models (LLMs), as transformative tools in addressing SDOH factors, offering new capabilities for disease surveillance and patient care. Simultaneously, we bring attention to challenges, such as data standardization, infrastructure limitations, digital literacy, and algorithmic bias, that could hinder equitable access to AI benefits. For LLMs, we highlight potential unique challenges and risks including environmental impact, unfair labor practices, inadvertent disinformation or "hallucinations," proliferation of bias, and infringement of copyrights. We propose the need for a multitiered approach to digital inclusion as an SDOH and the development of ethical and responsible AI practice frameworks globally and provide suggestions on bridging the gap from development to implementation of equitable AI technologies.
Collapse
Affiliation(s)
- Jasmine Chiat Ling Ong
- Division of Pharmacy, Singapore General Hospital, Singapore, Singapore; SingHealth Duke-NUS Medicine Academic Clinical Programme, Singapore, Singapore
| | - Benjamin Jun Jie Seng
- MOHH Holdings (Singapore) Pte., Ltd., Singapore, Singapore; SingHealth Duke-NUS Family Medicine Academic Clinical Programme, Singapore, Singapore
| | | | - Lian Leng Low
- SingHealth Duke-NUS Family Medicine Academic Clinical Programme, Singapore, Singapore; Population Health and Integrated Care Office, Singapore General Hospital, Singapore, Singapore; Centre for Population Health Research and Implementation, SingHealth Regional Health System, Singapore, Singapore; Outram Community Hospital, SingHealth Community Hospitals, Singapore, Singapore
| | - Andrea Lay Hoon Kwa
- Division of Pharmacy, Singapore General Hospital, Singapore, Singapore; SingHealth Duke-NUS Medicine Academic Clinical Programme, Singapore, Singapore; Emerging Infectious Diseases, Duke-NUS Medical School, Singapore, Singapore
| | - Kathleen M Giacomini
- Department of Bioengineering and Therapeutic Sciences, Schools of Pharmacy and Medicine, University of California, San Francisco, San Francisco, CA, USA
| | - Daniel Shu Wei Ting
- Artificial Intelligence and Digital Innovation Research Group, Singapore Eye Research, Singapore, Singapore; Duke-NUS Medical School, National University of Singapore, Singapore, Singapore; Byers Eye Institute, Stanford University, Stanford, CA, USA.
| |
Collapse
|
37
|
Schonfeld E, Mordekai N, Berg A, Johnstone T, Shah A, Shah V, Haider G, Marianayagam NJ, Veeravagu A. Machine Learning in Neurosurgery: Toward Complex Inputs, Actionable Predictions, and Generalizable Translations. Cureus 2024; 16:e51963. [PMID: 38333513 PMCID: PMC10851045 DOI: 10.7759/cureus.51963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2023] [Accepted: 01/08/2024] [Indexed: 02/10/2024] Open
Abstract
Machine learning can predict neurosurgical diagnosis and outcomes, power imaging analysis, and perform robotic navigation and tumor labeling. State-of-the-art models can reconstruct and generate images, predict surgical events from video, and assist in intraoperative decision-making. In this review, we will detail the neurosurgical applications of machine learning, ranging from simple to advanced models, and their potential to transform patient care. As machine learning techniques, outputs, and methods become increasingly complex, their performance is often more impactful yet increasingly difficult to evaluate. We aim to introduce these advancements to the neurosurgical audience while suggesting major potential roadblocks to their safe and effective translation. Unlike the previous generation of machine learning in neurosurgery, the safe translation of recent advancements will be contingent on neurosurgeons' involvement in model development and validation.
Collapse
Affiliation(s)
- Ethan Schonfeld
- Neurosurgery, Stanford University School of Medicine, Stanford, USA
| | | | - Alex Berg
- Neurosurgery, Stanford University School of Medicine, Stanford, USA
| | - Thomas Johnstone
- Neurosurgery, Stanford University School of Medicine, Stanford, USA
| | - Aaryan Shah
- School of Humanities and Sciences, Stanford University, Stanford, USA
| | - Vaibhavi Shah
- Neurosurgery, Stanford University School of Medicine, Stanford, USA
| | - Ghani Haider
- Neurosurgery, Stanford University School of Medicine, Stanford, USA
| | | | - Anand Veeravagu
- Neurosurgery, Stanford University School of Medicine, Stanford, USA
| |
Collapse
|
38
|
Wu X, Zhang B. ChatGPT promotes healthcare: current applications and potential challenges. Int J Surg 2024; 110:606-608. [PMID: 37816164 PMCID: PMC10793836 DOI: 10.1097/js9.0000000000000802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Accepted: 09/17/2023] [Indexed: 10/12/2023]
Affiliation(s)
| | - Bin Zhang
- Department of Radiology, The First Affiliated Hospital of Jinan University, Guangdong, Guangzhou, People’s Republic of China
| |
Collapse
|
39
|
Zack T, Lehman E, Suzgun M, Rodriguez JA, Celi LA, Gichoya J, Jurafsky D, Szolovits P, Bates DW, Abdulnour REE, Butte AJ, Alsentzer E. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health 2024; 6:e12-e22. [PMID: 38123252 DOI: 10.1016/s2589-7500(23)00225-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 09/30/2023] [Accepted: 10/26/2023] [Indexed: 12/23/2023]
Abstract
BACKGROUND Large language models (LLMs) such as GPT-4 hold great promise as transformative tools in health care, ranging from automating administrative tasks to augmenting clinical decision making. However, these models also pose a danger of perpetuating biases and delivering incorrect medical diagnoses, which can have a direct, harmful impact on medical care. We aimed to assess whether GPT-4 encodes racial and gender biases that impact its use in health care. METHODS Using the Azure OpenAI application interface, this model evaluation study tested whether GPT-4 encodes racial and gender biases and examined the impact of such biases on four potential applications of LLMs in the clinical domain-namely, medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment. We conducted experiments with prompts designed to resemble typical use of GPT-4 within clinical and medical education applications. We used clinical vignettes from NEJM Healer and from published research on implicit bias in health care. GPT-4 estimates of the demographic distribution of medical conditions were compared with true US prevalence estimates. Differential diagnosis and treatment planning were evaluated across demographic groups using standard statistical tests for significance between groups. FINDINGS We found that GPT-4 did not appropriately model the demographic diversity of medical conditions, consistently producing clinical vignettes that stereotype demographic presentations. The differential diagnoses created by GPT-4 for standardised clinical vignettes were more likely to include diagnoses that stereotype certain races, ethnicities, and genders. Assessment and plans created by the model showed significant association between demographic attributes and recommendations for more expensive procedures as well as differences in patient perception. INTERPRETATION Our findings highlight the urgent need for comprehensive and transparent bias assessments of LLM tools such as GPT-4 for intended use cases before they are integrated into clinical care. We discuss the potential sources of these biases and potential mitigation strategies before clinical implementation. FUNDING Priscilla Chan and Mark Zuckerberg.
Collapse
Affiliation(s)
- Travis Zack
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA, USA
| | - Eric Lehman
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Mirac Suzgun
- Department of Computer Science, Stanford University, Stanford, CA, USA; Stanford Law School, Stanford University, Stanford, CA, USA
| | - Jorge A Rodriguez
- Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Leo Anthony Celi
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA; Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA; Department of Biostatistics, Harvard T H Chan School of Public Health, Boston, MA, USA
| | - Judy Gichoya
- Department of Radiology, Emory University, Atlanta, GA, USA
| | - Dan Jurafsky
- Department of Computer Science, Stanford University, Stanford, CA, USA; Department of Linguistics, Stanford University, Stanford, CA, USA
| | - Peter Szolovits
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - David W Bates
- Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA; Department of Health Policy and Management, Harvard T H Chan School of Public Health, Boston, MA, USA
| | - Raja-Elie E Abdulnour
- Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
| | - Atul J Butte
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA; Center for Data-Driven Insights and Innovation, University of California, Office of the President, Oakland, CA, USA
| | - Emily Alsentzer
- Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
40
|
Wang C, Liu S, Li A, Liu J. Text Dialogue Analysis for Primary Screening of Mild Cognitive Impairment: Development and Validation Study. J Med Internet Res 2023; 25:e51501. [PMID: 38157230 PMCID: PMC10787336 DOI: 10.2196/51501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Revised: 09/28/2023] [Accepted: 11/27/2023] [Indexed: 01/03/2024] Open
Abstract
BACKGROUND Artificial intelligence models tailored to diagnose cognitive impairment have shown excellent results. However, it is unclear whether large linguistic models can rival specialized models by text alone. OBJECTIVE In this study, we explored the performance of ChatGPT for primary screening of mild cognitive impairment (MCI) and standardized the design steps and components of the prompts. METHODS We gathered a total of 174 participants from the DementiaBank screening and classified 70% of them into the training set and 30% of them into the test set. Only text dialogues were kept. Sentences were cleaned using a macro code, followed by a manual check. The prompt consisted of 5 main parts, including character setting, scoring system setting, indicator setting, output setting, and explanatory information setting. Three dimensions of variables from published studies were included: vocabulary (ie, word frequency and word ratio, phrase frequency and phrase ratio, and lexical complexity), syntax and grammar (ie, syntactic complexity and grammatical components), and semantics (ie, semantic density and semantic coherence). We used R 4.3.0. for the analysis of variables and diagnostic indicators. RESULTS Three additional indicators related to the severity of MCI were incorporated into the final prompt for the model. These indicators were effective in discriminating between MCI and cognitively normal participants: tip-of-the-tongue phenomenon (P<.001), difficulty with complex ideas (P<.001), and memory issues (P<.001). The final GPT-4 model achieved a sensitivity of 0.8636, a specificity of 0.9487, and an area under the curve of 0.9062 on the training set; on the test set, the sensitivity, specificity, and area under the curve reached 0.7727, 0.8333, and 0.8030, respectively. CONCLUSIONS ChatGPT was effective in the primary screening of participants with possible MCI. Improved standardization of prompts by clinicians would also improve the performance of the model. It is important to note that ChatGPT is not a substitute for a clinician making a diagnosis.
Collapse
Affiliation(s)
- Changyu Wang
- Department of Medical Informatics, West China Medical School, Sichuan University, Chengdu, China
- West China College of Stomatology, Sichuan University, Chengdu, China
| | - Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Aiqing Li
- Department of Neurology, West China Hospital, Sichuan University, Chengdu, China
| | - Jialin Liu
- Department of Medical Informatics, West China Medical School, Sichuan University, Chengdu, China
- Information Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Otolaryngology-Head and Neck Surgery, West China Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
41
|
Koranteng E, Rao A, Flores E, Lev M, Landman A, Dreyer K, Succi M. Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care. JMIR MEDICAL EDUCATION 2023; 9:e51199. [PMID: 38153778 PMCID: PMC10884892 DOI: 10.2196/51199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 10/01/2023] [Accepted: 10/14/2023] [Indexed: 12/29/2023]
Abstract
The growing presence of large language models (LLMs) in health care applications holds significant promise for innovative advancements in patient care. However, concerns about ethical implications and potential biases have been raised by various stakeholders. Here, we evaluate the ethics of LLMs in medicine along 2 key axes: empathy and equity. We outline the importance of these factors in novel models of care and develop frameworks for addressing these alongside LLM deployment.
Collapse
Affiliation(s)
| | - Arya Rao
- Harvard Medical School, Boston, MA, United States
| | - Efren Flores
- Harvard Medical School, Boston, MA, United States
| | - Michael Lev
- Harvard Medical School, Boston, MA, United States
| | - Adam Landman
- Harvard Medical School, Boston, MA, United States
| | - Keith Dreyer
- Harvard Medical School, Boston, MA, United States
| | - Marc Succi
- Massachusetts General Hospital, Boston, United States
| |
Collapse
|
42
|
Beaulieu-Jones BK, Villamar MF, Scordis P, Bartmann AP, Ali W, Wissel BD, Alsentzer E, de Jong J, Patra A, Kohane I. Predicting seizure recurrence after an initial seizure-like episode from routine clinical notes using large language models: a retrospective cohort study. Lancet Digit Health 2023; 5:e882-e894. [PMID: 38000873 PMCID: PMC10695164 DOI: 10.1016/s2589-7500(23)00179-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 08/08/2023] [Accepted: 08/31/2023] [Indexed: 11/26/2023]
Abstract
BACKGROUND The evaluation and management of first-time seizure-like events in children can be difficult because these episodes are not always directly observed and might be epileptic seizures or other conditions (seizure mimics). We aimed to evaluate whether machine learning models using real-world data could predict seizure recurrence after an initial seizure-like event. METHODS This retrospective cohort study compared models trained and evaluated on two separate datasets between Jan 1, 2010, and Jan 1, 2020: electronic medical records (EMRs) at Boston Children's Hospital and de-identified, patient-level, administrative claims data from the IBM MarketScan research database. The study population comprised patients with an initial diagnosis of either epilepsy or convulsions before the age of 21 years, based on International Classification of Diseases, Clinical Modification (ICD-CM) codes. We compared machine learning-based predictive modelling using structured data (logistic regression and XGBoost) with emerging techniques in natural language processing by use of large language models. FINDINGS The primary cohort comprised 14 021 patients at Boston Children's Hospital matching inclusion criteria with an initial seizure-like event and the comparison cohort comprised 15 062 patients within the IBM MarketScan research database. Seizure recurrence based on a composite expert-derived definition occurred in 57% of patients at Boston Children's Hospital and 63% of patients within IBM MarketScan. Large language models with additional domain-specific and location-specific pre-training on patients excluded from the study (F1-score 0·826 [95% CI 0·817-0·835], AUC 0·897 [95% CI 0·875-0·913]) performed best. All large language models, including the base model without additional pre-training (F1-score 0·739 [95% CI 0·738-0·741], AUROC 0·846 [95% CI 0·826-0·861]) outperformed models trained with structured data. With structured data only, XGBoost outperformed logistic regression and XGBoost models trained with the Boston Children's Hospital EMR (logistic regression: F1-score 0·650 [95% CI 0·643-0·657], AUC 0·694 [95% CI 0·685-0·705], XGBoost: F1-score 0·679 [0·676-0·683], AUC 0·725 [0·717-0·734]) performed similarly to models trained on the IBM MarketScan database (logistic regression: F1-score 0·596 [0·590-0·601], AUC 0·670 [0·664-0·675], XGBoost: F1-score 0·678 [0·668-0·687], AUC 0·710 [0·703-0·714]). INTERPRETATION Physician's clinical notes about an initial seizure-like event include substantial signals for prediction of seizure recurrence, and additional domain-specific and location-specific pre-training can significantly improve the performance of clinical large language models, even for specialised cohorts. FUNDING UCB, National Institute of Neurological Disorders and Stroke (US National Institutes of Health).
Collapse
Affiliation(s)
- Brett K Beaulieu-Jones
- Department of Medicine, University of Chicago, Chicago, IL, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Mauricio F Villamar
- Department of Neurology, The Warren Alpert Medical School of Brown University, Providence, RI, USA
| | | | | | | | - Benjamin D Wissel
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Emily Alsentzer
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | | | | | - Isaac Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
43
|
Overgaard SM, Graham MG, Brereton T, Pencina MJ, Halamka JD, Vidal DE, Economou-Zavlanos NJ. Implementing quality management systems to close the AI translation gap and facilitate safe, ethical, and effective health AI solutions. NPJ Digit Med 2023; 6:218. [PMID: 38007604 PMCID: PMC10676432 DOI: 10.1038/s41746-023-00968-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 11/15/2023] [Indexed: 11/27/2023] Open
|
44
|
Peng C, Yang X, Chen A, Smith KE, PourNejatian N, Costa AB, Martin C, Flores MG, Zhang Y, Magoc T, Lipori G, Mitchell DA, Ospina NS, Ahmed MM, Hogan WR, Shenkman EA, Guo Y, Bian J, Wu Y. A study of generative large language model for medical research and healthcare. NPJ Digit Med 2023; 6:210. [PMID: 37973919 PMCID: PMC10654385 DOI: 10.1038/s41746-023-00958-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 11/01/2023] [Indexed: 11/19/2023] Open
Abstract
There are enormous enthusiasm and concerns in applying large language models (LLMs) to healthcare. Yet current assumptions are based on general-purpose LLMs such as ChatGPT, which are not developed for medical use. This study develops a generative clinical LLM, GatorTronGPT, using 277 billion words of text including (1) 82 billion words of clinical text from 126 clinical departments and approximately 2 million patients at the University of Florida Health and (2) 195 billion words of diverse general English text. We train GatorTronGPT using a GPT-3 architecture with up to 20 billion parameters and evaluate its utility for biomedical natural language processing (NLP) and healthcare text generation. GatorTronGPT improves biomedical natural language processing. We apply GatorTronGPT to generate 20 billion words of synthetic text. Synthetic NLP models trained using synthetic text generated by GatorTronGPT outperform models trained using real-world clinical text. Physicians' Turing test using 1 (worst) to 9 (best) scale shows that there are no significant differences in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p < 0.001). This study provides insights into the opportunities and challenges of LLMs for medical research and healthcare.
Collapse
Affiliation(s)
- Cheng Peng
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Xi Yang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Aokun Chen
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | | | | | | | | | | | - Ying Zhang
- Research Computing, University of Florida, Gainesville, FL, USA
| | - Tanja Magoc
- Integrated Data Repository Research Services, University of Florida, Gainesville, FL, USA
| | - Gloria Lipori
- Integrated Data Repository Research Services, University of Florida, Gainesville, FL, USA
- Lillian S. Wells Department of Neurosurgery, Clinical and Translational Science Institute, University of Florida, Gainesville, FL, USA
| | - Duane A Mitchell
- Lillian S. Wells Department of Neurosurgery, Clinical and Translational Science Institute, University of Florida, Gainesville, FL, USA
| | - Naykky S Ospina
- Division of Endocrinology, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Mustafa M Ahmed
- Division of Cardiovascular Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
| | - William R Hogan
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Elizabeth A Shenkman
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Yi Guo
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Yonghui Wu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA.
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.
| |
Collapse
|
45
|
Ge J, Sun S, Owens J, Galvez V, Gologorskaya O, Lai JC, Pletcher MJ, Lai K. Development of a Liver Disease-Specific Large Language Model Chat Interface using Retrieval Augmented Generation. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.11.10.23298364. [PMID: 37986764 PMCID: PMC10659484 DOI: 10.1101/2023.11.10.23298364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Background Large language models (LLMs) have significant capabilities in clinical information processing tasks. Commercially available LLMs, however, are not optimized for clinical uses and are prone to generating incorrect or hallucinatory information. Retrieval-augmented generation (RAG) is an enterprise architecture that allows embedding of customized data into LLMs. This approach "specializes" the LLMs and is thought to reduce hallucinations. Methods We developed "LiVersa," a liver disease-specific LLM, by using our institution's protected health information (PHI)-complaint text embedding and LLM platform, "Versa." We conducted RAG on 30 publicly available American Association for the Study of Liver Diseases (AASLD) guidelines and guidance documents to be incorporated into LiVersa. We evaluated LiVersa's performance by comparing its responses versus those of trainees from a previously published knowledge assessment study regarding hepatitis B (HBV) treatment and hepatocellular carcinoma (HCC) surveillance. Results LiVersa answered all 10 questions correctly when forced to provide a "yes" or "no" answer. Full detailed responses with justifications and rationales, however, were not completely correct for three of the questions. Discussions In this study, we demonstrated the ability to build disease-specific and PHI-compliant LLMs using RAG. While our LLM, LiVersa, demonstrated more specificity in answering questions related to clinical hepatology - there were some knowledge deficiencies due to limitations set by the number and types of documents used for RAG. The LiVersa prototype, however, is a proof of concept for utilizing RAG to customize LLMs for clinical uses and a potential strategy to realize personalized medicine in the future.
Collapse
Affiliation(s)
- Jin Ge
- Division of Gastroenterology and Hepatology, Department of Medicine, University of California – San Francisco, San Francisco, CA
| | - Steve Sun
- UCSF Health Information Technology, University of California – San Francisco, San Francisco, CA
| | - Joseph Owens
- UCSF Health Information Technology, University of California – San Francisco, San Francisco, CA
| | - Victor Galvez
- UCSF Health Information Technology, University of California – San Francisco, San Francisco, CA
| | - Oksana Gologorskaya
- UCSF Health Information Technology, University of California – San Francisco, San Francisco, CA
- Bakar Computational Health Sciences Institute, University of California – San Francisco, San Francisco, CA
| | - Jennifer C. Lai
- Division of Gastroenterology and Hepatology, Department of Medicine, University of California – San Francisco, San Francisco, CA
| | - Mark J. Pletcher
- Department of Epidemiology and Biostatistics, University of California – San Francisco, San Francisco, CA
| | - Ki Lai
- UCSF Health Information Technology, University of California – San Francisco, San Francisco, CA
| |
Collapse
|
46
|
Decker H, Trang K, Ramirez J, Colley A, Pierce L, Coleman M, Bongiovanni T, Melton GB, Wick E. Large Language Model-Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures. JAMA Netw Open 2023; 6:e2336997. [PMID: 37812419 PMCID: PMC10562939 DOI: 10.1001/jamanetworkopen.2023.36997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Accepted: 08/29/2023] [Indexed: 10/10/2023] Open
Abstract
Importance Informed consent is a critical component of patient care before invasive procedures, yet it is frequently inadequate. Electronic consent forms have the potential to facilitate patient comprehension if they provide information that is readable, accurate, and complete; it is not known if large language model (LLM)-based chatbots may improve informed consent documentation by generating accurate and complete information that is easily understood by patients. Objective To compare the readability, accuracy, and completeness of LLM-based chatbot- vs surgeon-generated information on the risks, benefits, and alternatives (RBAs) of common surgical procedures. Design, Setting, and Participants This cross-sectional study compared randomly selected surgeon-generated RBAs used in signed electronic consent forms at an academic referral center in San Francisco with LLM-based chatbot-generated (ChatGPT-3.5, OpenAI) RBAs for 6 surgical procedures (colectomy, coronary artery bypass graft, laparoscopic cholecystectomy, inguinal hernia repair, knee arthroplasty, and spinal fusion). Main Outcomes and Measures Readability was measured using previously validated scales (Flesh-Kincaid grade level, Gunning Fog index, the Simple Measure of Gobbledygook, and the Coleman-Liau index). Scores range from 0 to greater than 20 to indicate the years of education required to understand a text. Accuracy and completeness were assessed using a rubric developed with recommendations from LeapFrog, the Joint Commission, and the American College of Surgeons. Both composite and RBA subgroup scores were compared. Results The total sample consisted of 36 RBAs, with 1 RBA generated by the LLM-based chatbot and 5 RBAs generated by a surgeon for each of the 6 surgical procedures. The mean (SD) readability score for the LLM-based chatbot RBAs was 12.9 (2.0) vs 15.7 (4.0) for surgeon-generated RBAs (P = .10). The mean (SD) composite completeness and accuracy score was lower for surgeons' RBAs at 1.6 (0.5) than for LLM-based chatbot RBAs at 2.2 (0.4) (P < .001). The LLM-based chatbot scores were higher than the surgeon-generated scores for descriptions of the benefits of surgery (2.3 [0.7] vs 1.4 [0.7]; P < .001) and alternatives to surgery (2.7 [0.5] vs 1.4 [0.7]; P < .001). There was no significant difference in chatbot vs surgeon RBA scores for risks of surgery (1.7 [0.5] vs 1.7 [0.4]; P = .38). Conclusions and Relevance The findings of this cross-sectional study suggest that despite not being perfect, LLM-based chatbots have the potential to enhance informed consent documentation. If an LLM were embedded in electronic health records in a manner compliant with the Health Insurance Portability and Accountability Act, it could be used to provide personalized risk information while easing documentation burden for physicians.
Collapse
Affiliation(s)
- Hannah Decker
- Department of Surgery, University of California, San Francisco
| | - Karen Trang
- Department of Surgery, University of California, San Francisco
| | - Joel Ramirez
- Department of Surgery, University of California, San Francisco
| | - Alexis Colley
- Department of Surgery, University of California, San Francisco
| | - Logan Pierce
- Department of Medicine, University of California, San Francisco
| | - Melissa Coleman
- Department of Surgery, University of California, San Francisco
| | | | - Genevieve B. Melton
- Department of Surgery, Institute for Health Informatics, and Center for Learning Health System Sciences, University of Minnesota, Minneapolis
| | - Elizabeth Wick
- Department of Surgery, University of California, San Francisco
| |
Collapse
|
47
|
Jung KH. Uncover This Tech Term: Foundation Model. Korean J Radiol 2023; 24:1038-1041. [PMID: 37793672 PMCID: PMC10550749 DOI: 10.3348/kjr.2023.0790] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 08/23/2023] [Indexed: 10/06/2023] Open
Affiliation(s)
- Kyu-Hwan Jung
- Department of Medical Device Management and Research, Samsung Advanced Institute for Health Sciences and Technology, Sungkyunkwan University, Seoul, Republic of Korea
- Dataset Science Research Institute, Research Institute for Future Medicine, Samsung Medical Center, Seoul, Republic of Korea.
| |
Collapse
|
48
|
Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, Nadkarni G, Klang E. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 2023; 13:16492. [PMID: 37779171 PMCID: PMC10543445 DOI: 10.1038/s41598-023-43436-9] [Citation(s) in RCA: 30] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 09/23/2023] [Indexed: 10/03/2023] Open
Abstract
The United States Medical Licensing Examination (USMLE) has been a subject of performance study for artificial intelligence (AI) models. However, their performance on questions involving USMLE soft skills remains unexplored. This study aimed to evaluate ChatGPT and GPT-4 on USMLE questions involving communication skills, ethics, empathy, and professionalism. We used 80 USMLE-style questions involving soft skills, taken from the USMLE website and the AMBOSS question bank. A follow-up query was used to assess the models' consistency. The performance of the AI models was compared to that of previous AMBOSS users. GPT-4 outperformed ChatGPT, correctly answering 90% compared to ChatGPT's 62.5%. GPT-4 showed more confidence, not revising any responses, while ChatGPT modified its original answers 82.5% of the time. The performance of GPT-4 was higher than that of AMBOSS's past users. Both AI models, notably GPT-4, showed capacity for empathy, indicating AI's potential to meet the complex interpersonal, ethical, and professional demands intrinsic to the practice of medicine.
Collapse
Affiliation(s)
- Dana Brin
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel.
- Faculty of Medicine, Tel-Aviv University, Tel-Aviv, Israel.
| | - Vera Sorin
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- Faculty of Medicine, Tel-Aviv University, Tel-Aviv, Israel
| | - Akhil Vaid
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ali Soroush
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Benjamin S Glicksberg
- Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Alexander W Charney
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Girish Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eyal Klang
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- Faculty of Medicine, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
49
|
Cheung ATM, Nasir-Moin M, Oermann EK. ChatGPT and the Law of the Horse. THE AMERICAN JOURNAL OF BIOETHICS : AJOB 2023; 23:55-57. [PMID: 37812113 DOI: 10.1080/15265161.2023.2250279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
|
50
|
Robinson ML, Garibaldi BT, Lindquist MA. When Clinical Prediction Is Steering the Ship, Beware the Drift of Its Wake. Ann Intern Med 2023; 176:1424-1425. [PMID: 37812777 DOI: 10.7326/m23-2345] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/11/2023] Open
Affiliation(s)
- Matthew L Robinson
- Division of Infectious Diseases, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Brian T Garibaldi
- Division of Pulmonary and Critical Care Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Martin A Lindquist
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland
| |
Collapse
|