Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lederman A, Lederman R, Verspoor K. Tasks as needs: reframing the paradigm of clinical natural language processing research for real-world decision support. J Am Med Inform Assoc 2022;29:1810-1817. [PMID: 35848784 PMCID: PMC9471702 DOI: 10.1093/jamia/ocac121] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2022] [Revised: 06/06/2022] [Accepted: 07/04/2022] [Indexed: 12/13/2022] Open

For:	Lederman A, Lederman R, Verspoor K. Tasks as needs: reframing the paradigm of clinical natural language processing research for real-world decision support. J Am Med Inform Assoc 2022;29:1810-1817. [PMID: 35848784 PMCID: PMC9471702 DOI: 10.1093/jamia/ocac121] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2022] [Revised: 06/06/2022] [Accepted: 07/04/2022] [Indexed: 12/13/2022] Open

Number

Cited by Other Article(s)

Akhondi-Asl A, Yang Y, Luchette M, Burns JP, Mehta NM, Geva A. Comparing the Quality of Domain-Specific Versus General Language Models for Artificial Intelligence-Generated Differential Diagnoses in PICU Patients. Pediatr Crit Care Med 2024;25:e273-e282. [PMID: 38329382 DOI: 10.1097/pcc.0000000000003468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]

Abstract

OBJECTIVES

Generative language models (LMs) are being evaluated in a variety of tasks in healthcare, but pediatric critical care studies are scant. Our objective was to evaluate the utility of generative LMs in the pediatric critical care setting and to determine whether domain-adapted LMs can outperform much larger general-domain LMs in generating a differential diagnosis from the admission notes of PICU patients.

DESIGN

Single-center retrospective cohort study.

SETTING

Quaternary 40-bed PICU.

PATIENTS

Notes from all patients admitted to the PICU between January 2012 and April 2023 were used for model development. One hundred thirty randomly selected admission notes were used for evaluation.

INTERVENTIONS

None.

MEASUREMENTS AND MAIN RESULTS

Five experts in critical care used a 5-point Likert scale to independently evaluate the overall quality of differential diagnoses: 1) written by the clinician in the original notes, 2) generated by two general LMs (BioGPT-Large and LLaMa-65B), and 3) generated by two fine-tuned models (fine-tuned BioGPT-Large and fine-tuned LLaMa-7B). Differences among differential diagnoses were compared using mixed methods regression models. We used 1,916,538 notes from 32,454 unique patients for model development and validation. The mean quality scores of the differential diagnoses generated by the clinicians and fine-tuned LLaMa-7B, the best-performing LM, were 3.43 and 2.88, respectively (absolute difference 0.54 units [95% CI, 0.37-0.72], p < 0.001). Fine-tuned LLaMa-7B performed better than LLaMa-65B (absolute difference 0.23 unit [95% CI, 0.06-0.41], p = 0.009) and BioGPT-Large (absolute difference 0.86 unit [95% CI, 0.69-1.0], p < 0.001). The differential diagnosis generated by clinicians and fine-tuned LLaMa-7B were ranked as the highest quality in 144 (55%) and 74 cases (29%), respectively.

CONCLUSIONS

A smaller LM fine-tuned using notes of PICU patients outperformed much larger models trained on general-domain data. Currently, LMs remain inferior but may serve as an adjunct to human clinicians in real-world tasks using real-world data.

Collapse

Affiliation(s)

Alireza Akhondi-Asl Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA Department of Anaesthesia, Harvard Medical School, Boston, MA Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
Youyang Yang Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA Department of Anaesthesia, Harvard Medical School, Boston, MA Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
Matthew Luchette Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA Department of Anaesthesia, Harvard Medical School, Boston, MA Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
Jeffrey P Burns Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA Department of Anaesthesia, Harvard Medical School, Boston, MA Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
Nilesh M Mehta Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
Alon Geva Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA Department of Anaesthesia, Harvard Medical School, Boston, MA Computational Health Informatics Program, Boston Children's Hospital, Boston, MA

Collapse

Majdik ZP, Graham SS, Shiva Edward JC, Rodriguez SN, Karnes MS, Jensen JT, Barbour JB, Rousseau JF. Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study. JMIR AI 2024;3:e52095. [PMID: 38875593 PMCID: PMC11140272 DOI: 10.2196/52095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 12/13/2023] [Accepted: 03/30/2024] [Indexed: 06/16/2024]

Abstract

BACKGROUND

Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.

OBJECTIVE

This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.

METHODS

A random sample of 200 disclosure statements was prepared for annotation. All "PERSON" and "ORG" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.

RESULTS

Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.

CONCLUSIONS

Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture's intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.

Collapse

McManus KF, Stringer JM, Corson N, Fodeh S, Steinhardt S, Levin FL, Shotqara AS, D’Auria J, Fielstein EM, Gobbel GT, Scott J, Trafton JA, Taddei TH, Erdos J, Tamang SR. Deploying a national clinical text processing infrastructure. J Am Med Inform Assoc 2024;31:727-731. [PMID: 38146986 PMCID: PMC10873837 DOI: 10.1093/jamia/ocad249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Revised: 11/26/2023] [Accepted: 12/13/2023] [Indexed: 12/27/2023] Open

Affiliation(s)

Kimberly F McManus Department of Veterans Affairs, Office of the CTO, Washington, DC 20571, United States
Johnathon Michael Stringer Division of Immunology and Rheumatology, Department of Medicine, Stanford University, Stanford, CA 94304, United States
Neal Corson Department of Veterans Affairs, San Diego, CA 92108, United States
Samah Fodeh Department of Veterans Affairs, West Haven, CT 06516, United States Yale School of Medicine, New Haven, CT 06510, United States
Steven Steinhardt Evergreen Design LLC, Guilford, CT 06437, United States
Forrest L Levin Evergreen Design LLC, Guilford, CT 06437, United States
Asqar S Shotqara Department of Veterans Affairs, Center for Innovation to Implementation (Ci2i), Palo Alto, CA 94304, United States
Joseph D’Auria Product Engineering, Department of Veterans Affairs, Austin, TX 78741, United States
Elliot M Fielstein Department of Veterans Affairs, Office of Mental Health and Suicide Prevention, Veterans Health Administration, Nashville, TN 37212, United States Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
Glenn T Gobbel Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
John Scott Department of Veterans Affairs, Clinical Informatics and Data Management Office, Veterans Health Administration, Washington, DC 20571, United States
Jodie A Trafton Department of Veterans Affairs, Office of Mental Health and Suicide Prevention, Program Evaluation Resource Center, Palo Alto, CA 94304, United States
Tamar H Taddei Department of Veterans Affairs, West Haven, CT 06516, United States Yale School of Medicine, New Haven, CT 06510, United States
Joseph Erdos Department of Veterans Affairs, West Haven, CT 06516, United States Yale School of Medicine, New Haven, CT 06510, United States
Suzanne R Tamang Division of Immunology and Rheumatology, Department of Medicine, Stanford University, Stanford, CA 94304, United States Department of Veterans Affairs, Office of Mental Health and Suicide Prevention, Program Evaluation Resource Center, Palo Alto, CA 94304, United States

Collapse

Unlu O, Shin J, Mailly CJ, Oates MF, Tucci MR, Varugheese M, Wagholikar K, Wang F, Scirica BM, Blood AJ, Aronson SJ. Retrieval Augmented Generation Enabled Generative Pre-Trained Transformer 4 (GPT-4) Performance for Clinical Trial Screening. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.02.08.24302376. [PMID: 38370719 PMCID: PMC10871450 DOI: 10.1101/2024.02.08.24302376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]

Abstract

Background

Subject screening is a key aspect of all clinical trials; however, traditionally, it is a labor-intensive and error-prone task, demanding significant time and resources. With the advent of large language models (LLMs) and related technologies, a paradigm shift in natural language processing capabilities offers a promising avenue for increasing both quality and efficiency of screening efforts. This study aimed to test the Retrieval-Augmented Generation (RAG) process enabled Generative Pretrained Transformer Version 4 (GPT-4) to accurately identify and report on inclusion and exclusion criteria for a clinical trial.

Methods

The Co-Operative Program for Implementation of Optimal Therapy in Heart Failure (COPILOT-HF) trial aims to recruit patients with symptomatic heart failure. As part of the screening process, a list of potentially eligible patients is created through an electronic health record (EHR) query. Currently, structured data in the EHR can only be used to determine 5 out of 6 inclusion and 5 out of 17 exclusion criteria. Trained, but non-licensed, study staff complete manual chart review to determine patient eligibility and record their assessment of the inclusion and exclusion criteria. We obtained the structured assessments completed by the study staff and clinical notes for the past two years and developed a workflow of clinical note-based question answering system powered by RAG architecture and GPT-4 that we named RECTIFIER (RAG-Enabled Clinical Trial Infrastructure for Inclusion Exclusion Review). We used notes from 100 patients as a development dataset, 282 patients as a validation dataset, and 1894 patients as a test set. An expert clinician completed a blinded review of patients' charts to answer the eligibility questions and determine the "gold standard" answers. We calculated the sensitivity, specificity, accuracy, and Matthews correlation coefficient (MCC) for each question and screening method. We also performed bootstrapping to calculate the confidence intervals for each statistic.

Results

Both RECTIFIER and study staff answers closely aligned with the expert clinician answers across criteria with accuracy ranging between 97.9% and 100% (MCC 0.837 and 1) for RECTIFIER and 91.7% and 100% (MCC 0.644 and 1) for study staff. RECTIFIER performed better than study staff to determine the inclusion criteria of "symptomatic heart failure" with an accuracy of 97.9% vs 91.7% and an MCC of 0.924 vs 0.721, respectively. Overall, the sensitivity and specificity of determining eligibility for the RECTIFIER was 92.3% (CI) and 93.9% (CI), and study staff was 90.1% (CI) and 83.6% (CI), respectively.

Conclusion

GPT-4 based solutions have the potential to improve efficiency and reduce costs in clinical trial screening. When incorporating new tools such as RECTIFIER, it is important to consider the potential hazards of automating the screening process and set up appropriate mitigation strategies such as final clinician review before patient engagement.

Collapse

Affiliation(s)

Ozan Unlu Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA Department of Biomedical Informatics, Harvard Medical School, Boston, MA Harvard Medical School, Boston, MA
Jiyeon Shin Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA Mass General Brigham Personalized Medicine, Cambridge, MA
Charlotte J Mailly Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA Mass General Brigham Personalized Medicine, Cambridge, MA
Michael F Oates Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA Mass General Brigham Personalized Medicine, Cambridge, MA
Michela R Tucci Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
Matthew Varugheese Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
Kavishwar Wagholikar Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA Research Information Science and Computing, Mass General Brigham, Somerville, MA
Fei Wang Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA Mass General Brigham Personalized Medicine, Cambridge, MA
Benjamin M Scirica Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA Harvard Medical School, Boston, MA
Alexander J Blood Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA Harvard Medical School, Boston, MA
Samuel J Aronson Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA Mass General Brigham Personalized Medicine, Cambridge, MA

Collapse

Teh BW, Mikulska M, Averbuch D, de la Camara R, Hirsch HH, Akova M, Ostrosky-Zeichner L, Baddley JW, Tan BH, Mularoni A, Subramanian AK, La Hoz RM, Marinelli T, Boan P, Aguado JM, Grossi PA, Maertens J, Mueller NJ, Slavin MA. Consensus position statement on advancing the standardised reporting of infection events in immunocompromised patients. THE LANCET. INFECTIOUS DISEASES 2024;24:e59-e68. [PMID: 37683684 DOI: 10.1016/s1473-3099(23)00377-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2023] [Revised: 06/07/2023] [Accepted: 06/09/2023] [Indexed: 09/10/2023]

Affiliation(s)

Benjamin W Teh Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia; Sir Peter MacCallum Department of Oncology, University of Melbourne, VIC, Australia.
Malgorzata Mikulska Division of Infectious Diseases, Department of Health Sciences, University of Genoa, Genoa, Italy; IRCCS Ospedale Policlinico San Martino, Genoa, Italy
Dina Averbuch Pediatric Infectious Diseases, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel; Hadassah Medical Center, Jerusalem, Israel
Rafael de la Camara Department of Haematology, Hospital de la Princesa, Madrid, Spain
Hans H Hirsch Transplantation & Clinical Virology, Department of Biomedicine, University of Basel, Basel, Switzerland; Infectious Diseases & Hospital Epidemiology, University Hospital Basel, Basel, Switzerland
Murat Akova Department of Infectious Diseases, Hacettepe University School of Medicine, Ankara, Turkey
Luis Ostrosky-Zeichner Division of Infectious Diseases, McGovern Medical School, University of Texas, Houston, TX, USA
John W Baddley Department of Medicine, Division of Infectious Diseases, University of Maryland School of Medicine, Baltimore, MD, USA
Ban Hock Tan Department of Infectious Diseases, Singapore General Hospital, Singapore
Alessandra Mularoni Department of Infectious Diseases, Istituto Mediterraneo per i Trapianti e Terapie ad Alta Specializzazione (IRCCS), Palermo, Italy
Aruna K Subramanian Division of Infectious Diseases and Geographic Medicine, Stanford University School of Medicine, Stanford, CA, USA
Ricardo M La Hoz Division of Infectious Diseases and Geographic Medicine, University of Texas Southwestern Medical Center, Dallas, TX, USA
Tina Marinelli Department of Infectious Diseases, Royal Prince Alfred Hospital, Sydney, NSW, Australia
Peter Boan Department of Infectious Diseases, Fiona Stanley Hospital, Murdoch, WA, Australia; Department of Microbiology, PathWest Laboratory Medicine WA, Fiona Stanley Hospital, Murdoch, WA, Australia
Jose Maria Aguado Unit of Infectious Diseases, Hospital Universitario "12 de Octubre", Instituto de Investigación Sanitaria Hospital "12 de Octubre" (imas12), CIBERINFEC, Universidad Complutense, Madrid, Spain
Paolo A Grossi Infectious and Tropical Diseases Unit, Department of Medicine and Surgery, University of Insubria-ASST-Sette Laghi, Varese, Italy
Johan Maertens Department of Haematology, Universitaire Ziekenhuizen Leuven, Leuven, Belgium
Nicolas J Mueller Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, Zürich, Switzerland
Monica A Slavin Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia; Sir Peter MacCallum Department of Oncology, University of Melbourne, VIC, Australia; Victorian Infectious Diseases Service, Royal Melbourne Hospital, Parkville, VIC, Australia

Collapse

Schopow N, Osterhoff G, Baur D. Applications of the Natural Language Processing Tool ChatGPT in Clinical Practice: Comparative Study and Augmented Systematic Review. JMIR Med Inform 2023;11:e48933. [PMID: 38015610 DOI: 10.2196/48933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 06/20/2023] [Accepted: 08/25/2023] [Indexed: 11/29/2023] Open

Abstract

BACKGROUND

This research integrates a comparative analysis of the performance of human researchers and OpenAI's ChatGPT in systematic review tasks and describes an assessment of the application of natural language processing (NLP) models in clinical practice through a review of 5 studies.

OBJECTIVE

This study aimed to evaluate the reliability between ChatGPT and human researchers in extracting key information from clinical articles, and to investigate the practical use of NLP in clinical settings as evidenced by selected studies.

METHODS

The study design comprised a systematic review of clinical articles executed independently by human researchers and ChatGPT. The level of agreement between and within raters for parameter extraction was assessed using the Fleiss and Cohen κ statistics.

RESULTS

The comparative analysis revealed a high degree of concordance between ChatGPT and human researchers for most parameters, with less agreement for study design, clinical task, and clinical implementation. The review identified 5 significant studies that demonstrated the diverse applications of NLP in clinical settings. These studies' findings highlight the potential of NLP to improve clinical efficiency and patient outcomes in various contexts, from enhancing allergy detection and classification to improving quality metrics in psychotherapy treatments for veterans with posttraumatic stress disorder.

CONCLUSIONS

Our findings underscore the potential of NLP models, including ChatGPT, in performing systematic reviews and other clinical tasks. Despite certain limitations, NLP models present a promising avenue for enhancing health care efficiency and accuracy. Future studies must focus on broadening the range of clinical applications and exploring the ethical considerations of implementing NLP applications in health care settings.

Collapse

King AJ, Angus DC, Cooper GF, Mowery DL, Seaman JB, Potter KM, Bukowski LA, Al-Khafaji A, Gunn SR, Kahn JM. A voice-based digital assistant for intelligent prompting of evidence-based practices during ICU rounds. J Biomed Inform 2023;146:104483. [PMID: 37657712 PMCID: PMC10591951 DOI: 10.1016/j.jbi.2023.104483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 07/21/2023] [Accepted: 08/29/2023] [Indexed: 09/03/2023]

Gao Y, Dligach D, Miller T, Churpek MM, Afshar M. Overview of the Problem List Summarization (ProbSum) 2023 Shared Task on Summarizing Patients' Active Diagnoses and Problems from Electronic Health Record Progress Notes. PROCEEDINGS OF THE CONFERENCE. ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING 2023;2023:461-467. [PMID: 37583489 PMCID: PMC10426335 DOI: 10.18653/v1/2023.bionlp-1.43] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/17/2023]

Gao Y, Dligach D, Miller T, Churpek MM, Uzuner O, Afshar M. Progress Note Understanding - Assessment and Plan Reasoning: Overview of the 2022 N2C2 Track 3 shared task. J Biomed Inform 2023;142:104346. [PMID: 37061012 PMCID: PMC11178099 DOI: 10.1016/j.jbi.2023.104346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 03/16/2023] [Accepted: 03/21/2023] [Indexed: 04/17/2023]

Afshar M, Adelaine S, Resnik F, Mundt MP, Long J, Leaf M, Ampian T, Wills GJ, Schnapp B, Chao M, Brown R, Joyce C, Sharma B, Dligach D, Burnside ES, Mahoney J, Churpek MM, Patterson BW, Liao F. Deployment of Real-time Natural Language Processing and Deep Learning Clinical Decision Support in the Electronic Health Record: Pipeline Implementation for an Opioid Misuse Screener in Hospitalized Adults. JMIR Med Inform 2023;11:e44977. [PMID: 37079367 PMCID: PMC10160938 DOI: 10.2196/44977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2022] [Revised: 02/01/2023] [Accepted: 03/26/2023] [Indexed: 04/21/2023] Open

Abstract

BACKGROUND

The clinical narrative in electronic health records (EHRs) carries valuable information for predictive analytics; however, its free-text form is difficult to mine and analyze for clinical decision support (CDS). Large-scale clinical natural language processing (NLP) pipelines have focused on data warehouse applications for retrospective research efforts. There remains a paucity of evidence for implementing NLP pipelines at the bedside for health care delivery.

OBJECTIVE

We aimed to detail a hospital-wide, operational pipeline to implement a real-time NLP-driven CDS tool and describe a protocol for an implementation framework with a user-centered design of the CDS tool.

METHODS

The pipeline integrated a previously trained open-source convolutional neural network model for screening opioid misuse that leveraged EHR notes mapped to standardized medical vocabularies in the Unified Medical Language System. A sample of 100 adult encounters were reviewed by a physician informaticist for silent testing of the deep learning algorithm before deployment. An end user interview survey was developed to examine the user acceptability of a best practice alert (BPA) to provide the screening results with recommendations. The planned implementation also included a human-centered design with user feedback on the BPA, an implementation framework with cost-effectiveness, and a noninferiority patient outcome analysis plan.

RESULTS

The pipeline was a reproducible workflow with a shared pseudocode for a cloud service to ingest, process, and store clinical notes as Health Level 7 messages from a major EHR vendor in an elastic cloud computing environment. Feature engineering of the notes used an open-source NLP engine, and the features were fed into the deep learning algorithm, with the results returned as a BPA in the EHR. On-site silent testing of the deep learning algorithm demonstrated a sensitivity of 93% (95% CI 66%-99%) and specificity of 92% (95% CI 84%-96%), similar to published validation studies. Before deployment, approvals were received across hospital committees for inpatient operations. Five interviews were conducted; they informed the development of an educational flyer and further modified the BPA to exclude certain patients and allow the refusal of recommendations. The longest delay in pipeline development was because of cybersecurity approvals, especially because of the exchange of protected health information between the Microsoft (Microsoft Corp) and Epic (Epic Systems Corp) cloud vendors. In silent testing, the resultant pipeline provided a BPA to the bedside within minutes of a provider entering a note in the EHR.

CONCLUSIONS

The components of the real-time NLP pipeline were detailed with open-source tools and pseudocode for other health systems to benchmark. The deployment of medical artificial intelligence systems in routine clinical care presents an important yet unfulfilled opportunity, and our protocol aimed to close the gap in the implementation of artificial intelligence-driven CDS.

TRIAL REGISTRATION

ClinicalTrials.gov NCT05745480; https://www.clinicaltrials.gov/ct2/show/NCT05745480.

Collapse

Gao Y, Dligach D, Miller T, Caskey J, Sharma B, Churpek MM, Afshar M. DR.BENCH: Diagnostic Reasoning Benchmark for Clinical Natural Language Processing. J Biomed Inform 2023;138:104286. [PMID: 36706848 PMCID: PMC9993808 DOI: 10.1016/j.jbi.2023.104286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 12/13/2022] [Accepted: 01/09/2023] [Indexed: 01/26/2023]

Abstract

The meaningful use of electronic health records (EHR) continues to progress in the digital era with clinical decision support systems augmented by artificial intelligence. A priority in improving provider experience is to overcome information overload and reduce the cognitive burden so fewer medical errors and cognitive biases are introduced during patient care. One major type of medical error is diagnostic error due to systematic or predictable errors in judgement that rely on heuristics. The potential for clinical natural language processing (cNLP) to model diagnostic reasoning in humans with forward reasoning from data to diagnosis and potentially reduce cognitive burden and medical error has not been investigated. Existing tasks to advance the science in cNLP have largely focused on information extraction and named entity recognition through classification tasks. We introduce a novel suite of tasks coined as Diagnostic Reasoning Benchmarks, Dr.Bench, as a new benchmark for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation. DR.BENCH is the first clinical suite of tasks designed to be a natural language generation framework to evaluate pre-trained language models for diagnostic reasoning. The goal of DR. BENCH is to advance the science in cNLP to support downstream applications in computerized diagnostic decision support and improve the efficiency and accuracy of healthcare providers during patient care. We fine-tune and evaluate the state-of-the-art generative models on DR.BENCH. Experiments show that with domain adaptation pre-training on medical knowledge, the model demonstrated opportunities for improvement when evaluated in DR. BENCH. We share DR. BENCH as a publicly available GitLab repository with a systematic approach to load and evaluate models for the cNLP community. We also discuss the carbon footprint produced during the experiments and encourage future work on DR.BENCH to report the carbon footprint.

Collapse

Taira RK, Garlid AO, Speier W. Design considerations for a hierarchical semantic compositional framework for medical natural language understanding. PLoS One 2023;18:e0282882. [PMID: 36928721 PMCID: PMC10019629 DOI: 10.1371/journal.pone.0282882] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 02/24/2023] [Indexed: 03/18/2023] Open

Joyce C, Markossian TW, Nikolaides J, Ramsey E, Thompson HM, Rojas JC, Sharma B, Dligach D, Oguss MK, Cooper RS, Afshar M. The Evaluation of a Clinical Decision Support Tool Using Natural Language Processing to Screen Hospitalized Adults for Unhealthy Substance Use: Protocol for a Quasi-Experimental Design. JMIR Res Protoc 2022;11:e42971. [PMID: 36534461 PMCID: PMC9808720 DOI: 10.2196/42971] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Revised: 12/01/2022] [Accepted: 12/05/2022] [Indexed: 12/12/2022] Open

Abstract

BACKGROUND

Automated and data-driven methods for screening using natural language processing (NLP) and machine learning may replace resource-intensive manual approaches in the usual care of patients hospitalized with conditions related to unhealthy substance use. The rigorous evaluation of tools that use artificial intelligence (AI) is necessary to demonstrate effectiveness before system-wide implementation. An NLP tool to use routinely collected data in the electronic health record was previously validated for diagnostic accuracy in a retrospective study for screening unhealthy substance use. Our next step is a noninferiority design incorporated into a research protocol for clinical implementation with prospective evaluation of clinical effectiveness in a large health system.

OBJECTIVE

This study aims to provide a study protocol to evaluate health outcomes and the costs and benefits of an AI-driven automated screener compared to manual human screening for unhealthy substance use.

METHODS

A pre-post design is proposed to evaluate 12 months of manual screening followed by 12 months of automated screening across surgical and medical wards at a single medical center. The preintervention period consists of usual care with manual screening by nurses and social workers and referrals to a multidisciplinary Substance Use Intervention Team (SUIT). Facilitated by a NLP pipeline in the postintervention period, clinical notes from the first 24 hours of hospitalization will be processed and scored by a machine learning model, and the SUIT will be similarly alerted to patients who flagged positive for substance misuse. Flowsheets within the electronic health record have been updated to capture rates of interventions for the primary outcome (brief intervention/motivational interviewing, medication-assisted treatment, naloxone dispensing, and referral to outpatient care). Effectiveness in terms of patient outcomes will be determined by noninferior rates of interventions (primary outcome), as well as rates of readmission within 6 months, average time to consult, and discharge rates against medical advice (secondary outcomes) in the postintervention period by a SUIT compared to the preintervention period. A separate analysis will be performed to assess the costs and benefits to the health system by using automated screening. Changes from the pre- to postintervention period will be assessed in covariate-adjusted generalized linear mixed-effects models.

RESULTS

The study will begin in September 2022. Monthly data monitoring and Data Safety Monitoring Board reporting are scheduled every 6 months throughout the study period. We anticipate reporting final results by June 2025.

CONCLUSIONS

The use of augmented intelligence for clinical decision support is growing with an increasing number of AI tools. We provide a research protocol for prospective evaluation of an automated NLP system for screening unhealthy substance use using a noninferiority design to demonstrate comprehensive screening that may be as effective as manual screening but less costly via automated solutions.

TRIAL REGISTRATION

ClinicalTrials.gov NCT03833804; https://clinicaltrials.gov/ct2/show/NCT03833804.

INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID)

DERR1-10.2196/42971.

Collapse