1
|
Akhondi-Asl A, Yang Y, Luchette M, Burns JP, Mehta NM, Geva A. Comparing the Quality of Domain-Specific Versus General Language Models for Artificial Intelligence-Generated Differential Diagnoses in PICU Patients. Pediatr Crit Care Med 2024; 25:e273-e282. [PMID: 38329382 DOI: 10.1097/pcc.0000000000003468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
OBJECTIVES Generative language models (LMs) are being evaluated in a variety of tasks in healthcare, but pediatric critical care studies are scant. Our objective was to evaluate the utility of generative LMs in the pediatric critical care setting and to determine whether domain-adapted LMs can outperform much larger general-domain LMs in generating a differential diagnosis from the admission notes of PICU patients. DESIGN Single-center retrospective cohort study. SETTING Quaternary 40-bed PICU. PATIENTS Notes from all patients admitted to the PICU between January 2012 and April 2023 were used for model development. One hundred thirty randomly selected admission notes were used for evaluation. INTERVENTIONS None. MEASUREMENTS AND MAIN RESULTS Five experts in critical care used a 5-point Likert scale to independently evaluate the overall quality of differential diagnoses: 1) written by the clinician in the original notes, 2) generated by two general LMs (BioGPT-Large and LLaMa-65B), and 3) generated by two fine-tuned models (fine-tuned BioGPT-Large and fine-tuned LLaMa-7B). Differences among differential diagnoses were compared using mixed methods regression models. We used 1,916,538 notes from 32,454 unique patients for model development and validation. The mean quality scores of the differential diagnoses generated by the clinicians and fine-tuned LLaMa-7B, the best-performing LM, were 3.43 and 2.88, respectively (absolute difference 0.54 units [95% CI, 0.37-0.72], p < 0.001). Fine-tuned LLaMa-7B performed better than LLaMa-65B (absolute difference 0.23 unit [95% CI, 0.06-0.41], p = 0.009) and BioGPT-Large (absolute difference 0.86 unit [95% CI, 0.69-1.0], p < 0.001). The differential diagnosis generated by clinicians and fine-tuned LLaMa-7B were ranked as the highest quality in 144 (55%) and 74 cases (29%), respectively. CONCLUSIONS A smaller LM fine-tuned using notes of PICU patients outperformed much larger models trained on general-domain data. Currently, LMs remain inferior but may serve as an adjunct to human clinicians in real-world tasks using real-world data.
Collapse
Affiliation(s)
- Alireza Akhondi-Asl
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
- Department of Anaesthesia, Harvard Medical School, Boston, MA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
| | - Youyang Yang
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
- Department of Anaesthesia, Harvard Medical School, Boston, MA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
| | - Matthew Luchette
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
- Department of Anaesthesia, Harvard Medical School, Boston, MA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
| | - Jeffrey P Burns
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
- Department of Anaesthesia, Harvard Medical School, Boston, MA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
| | - Nilesh M Mehta
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
| | - Alon Geva
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA
- Perioperative and Critical Care-Center for Outcomes (PC-CORE), Boston Children's Hospital, Boston, MA
- Department of Anaesthesia, Harvard Medical School, Boston, MA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA
| |
Collapse
|
2
|
Majdik ZP, Graham SS, Shiva Edward JC, Rodriguez SN, Karnes MS, Jensen JT, Barbour JB, Rousseau JF. Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study. JMIR AI 2024; 3:e52095. [PMID: 38875593 PMCID: PMC11140272 DOI: 10.2196/52095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 12/13/2023] [Accepted: 03/30/2024] [Indexed: 06/16/2024]
Abstract
BACKGROUND Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. OBJECTIVE This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. METHODS A random sample of 200 disclosure statements was prepared for annotation. All "PERSON" and "ORG" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. RESULTS Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. CONCLUSIONS Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture's intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.
Collapse
Affiliation(s)
- Zoltan P Majdik
- Department of Communication, North Dakota State University, Fargo, ND, United States
| | - S Scott Graham
- Department of Rhetoric & Writing, The University of Texas at Austin, Austin, TX, United States
| | - Jade C Shiva Edward
- Department of Rhetoric & Writing, The University of Texas at Austin, Austin, TX, United States
| | - Sabrina N Rodriguez
- Department of Neurology, The Dell Medical School, The University of Texas at Austin, Austin, TX, United States
| | - Martha S Karnes
- Department of Rhetoric & Writing, University of Arkansas Little Rock, Little Rock, AR, United States
| | - Jared T Jensen
- Department of Rhetoric & Writing, The University of Texas at Austin, Austin, TX, United States
| | - Joshua B Barbour
- Department of Communication, The University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Justin F Rousseau
- Statistical Planning and Analysis Section, Department of Neurology, The University of Texas Southwestern Medical Center, Dallas, TX, United States
- Peter O'Donnell Jr. Brain Institute, The University of Texas Southwestern Medical Center, Dallas, TX, United States
| |
Collapse
|
3
|
McManus KF, Stringer JM, Corson N, Fodeh S, Steinhardt S, Levin FL, Shotqara AS, D’Auria J, Fielstein EM, Gobbel GT, Scott J, Trafton JA, Taddei TH, Erdos J, Tamang SR. Deploying a national clinical text processing infrastructure. J Am Med Inform Assoc 2024; 31:727-731. [PMID: 38146986 PMCID: PMC10873837 DOI: 10.1093/jamia/ocad249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Revised: 11/26/2023] [Accepted: 12/13/2023] [Indexed: 12/27/2023] Open
Abstract
OBJECTIVES Clinical text processing offers a promising avenue for improving multiple aspects of healthcare, though operational deployment remains a substantial challenge. This case report details the implementation of a national clinical text processing infrastructure within the Department of Veterans Affairs (VA). METHODS Two foundational use cases, cancer case management and suicide and overdose prevention, illustrate how text processing can be practically implemented at scale for diverse clinical applications using shared services. RESULTS Insights from these use cases underline both commonalities and differences, providing a replicable model for future text processing applications. CONCLUSIONS This project enables more efficient initiation, testing, and future deployment of text processing models, streamlining the integration of these use cases into healthcare operations. This project implementation is in a large integrated health delivery system in the United States, but we expect the lessons learned to be relevant to any health system, including smaller local and regional health systems in the United States.
Collapse
Affiliation(s)
- Kimberly F McManus
- Department of Veterans Affairs, Office of the CTO, Washington, DC 20571, United States
| | - Johnathon Michael Stringer
- Division of Immunology and Rheumatology, Department of Medicine, Stanford University, Stanford, CA 94304, United States
| | - Neal Corson
- Department of Veterans Affairs, San Diego, CA 92108, United States
| | - Samah Fodeh
- Department of Veterans Affairs, West Haven, CT 06516, United States
- Yale School of Medicine, New Haven, CT 06510, United States
| | | | | | - Asqar S Shotqara
- Department of Veterans Affairs, Center for Innovation to Implementation (Ci2i), Palo Alto, CA 94304, United States
| | - Joseph D’Auria
- Product Engineering, Department of Veterans Affairs, Austin, TX 78741, United States
| | - Elliot M Fielstein
- Department of Veterans Affairs, Office of Mental Health and Suicide Prevention, Veterans Health Administration, Nashville, TN 37212, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Glenn T Gobbel
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - John Scott
- Department of Veterans Affairs, Clinical Informatics and Data Management Office, Veterans Health Administration, Washington, DC 20571, United States
| | - Jodie A Trafton
- Department of Veterans Affairs, Office of Mental Health and Suicide Prevention, Program Evaluation Resource Center, Palo Alto, CA 94304, United States
| | - Tamar H Taddei
- Department of Veterans Affairs, West Haven, CT 06516, United States
- Yale School of Medicine, New Haven, CT 06510, United States
| | - Joseph Erdos
- Department of Veterans Affairs, West Haven, CT 06516, United States
- Yale School of Medicine, New Haven, CT 06510, United States
| | - Suzanne R Tamang
- Division of Immunology and Rheumatology, Department of Medicine, Stanford University, Stanford, CA 94304, United States
- Department of Veterans Affairs, Office of Mental Health and Suicide Prevention, Program Evaluation Resource Center, Palo Alto, CA 94304, United States
| |
Collapse
|
4
|
Unlu O, Shin J, Mailly CJ, Oates MF, Tucci MR, Varugheese M, Wagholikar K, Wang F, Scirica BM, Blood AJ, Aronson SJ. Retrieval Augmented Generation Enabled Generative Pre-Trained Transformer 4 (GPT-4) Performance for Clinical Trial Screening. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.02.08.24302376. [PMID: 38370719 PMCID: PMC10871450 DOI: 10.1101/2024.02.08.24302376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Background Subject screening is a key aspect of all clinical trials; however, traditionally, it is a labor-intensive and error-prone task, demanding significant time and resources. With the advent of large language models (LLMs) and related technologies, a paradigm shift in natural language processing capabilities offers a promising avenue for increasing both quality and efficiency of screening efforts. This study aimed to test the Retrieval-Augmented Generation (RAG) process enabled Generative Pretrained Transformer Version 4 (GPT-4) to accurately identify and report on inclusion and exclusion criteria for a clinical trial. Methods The Co-Operative Program for Implementation of Optimal Therapy in Heart Failure (COPILOT-HF) trial aims to recruit patients with symptomatic heart failure. As part of the screening process, a list of potentially eligible patients is created through an electronic health record (EHR) query. Currently, structured data in the EHR can only be used to determine 5 out of 6 inclusion and 5 out of 17 exclusion criteria. Trained, but non-licensed, study staff complete manual chart review to determine patient eligibility and record their assessment of the inclusion and exclusion criteria. We obtained the structured assessments completed by the study staff and clinical notes for the past two years and developed a workflow of clinical note-based question answering system powered by RAG architecture and GPT-4 that we named RECTIFIER (RAG-Enabled Clinical Trial Infrastructure for Inclusion Exclusion Review). We used notes from 100 patients as a development dataset, 282 patients as a validation dataset, and 1894 patients as a test set. An expert clinician completed a blinded review of patients' charts to answer the eligibility questions and determine the "gold standard" answers. We calculated the sensitivity, specificity, accuracy, and Matthews correlation coefficient (MCC) for each question and screening method. We also performed bootstrapping to calculate the confidence intervals for each statistic. Results Both RECTIFIER and study staff answers closely aligned with the expert clinician answers across criteria with accuracy ranging between 97.9% and 100% (MCC 0.837 and 1) for RECTIFIER and 91.7% and 100% (MCC 0.644 and 1) for study staff. RECTIFIER performed better than study staff to determine the inclusion criteria of "symptomatic heart failure" with an accuracy of 97.9% vs 91.7% and an MCC of 0.924 vs 0.721, respectively. Overall, the sensitivity and specificity of determining eligibility for the RECTIFIER was 92.3% (CI) and 93.9% (CI), and study staff was 90.1% (CI) and 83.6% (CI), respectively. Conclusion GPT-4 based solutions have the potential to improve efficiency and reduce costs in clinical trial screening. When incorporating new tools such as RECTIFIER, it is important to consider the potential hazards of automating the screening process and set up appropriate mitigation strategies such as final clinician review before patient engagement.
Collapse
Affiliation(s)
- Ozan Unlu
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
- Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA
- Harvard Medical School, Boston, MA
| | - Jiyeon Shin
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
- Mass General Brigham Personalized Medicine, Cambridge, MA
| | - Charlotte J Mailly
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
- Mass General Brigham Personalized Medicine, Cambridge, MA
| | - Michael F Oates
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
- Mass General Brigham Personalized Medicine, Cambridge, MA
| | - Michela R Tucci
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
| | - Matthew Varugheese
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
| | - Kavishwar Wagholikar
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA
| | - Fei Wang
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
- Mass General Brigham Personalized Medicine, Cambridge, MA
| | - Benjamin M Scirica
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
- Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA
- Harvard Medical School, Boston, MA
| | - Alexander J Blood
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
- Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA
- Harvard Medical School, Boston, MA
| | - Samuel J Aronson
- Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA
- Mass General Brigham Personalized Medicine, Cambridge, MA
| |
Collapse
|
5
|
Teh BW, Mikulska M, Averbuch D, de la Camara R, Hirsch HH, Akova M, Ostrosky-Zeichner L, Baddley JW, Tan BH, Mularoni A, Subramanian AK, La Hoz RM, Marinelli T, Boan P, Aguado JM, Grossi PA, Maertens J, Mueller NJ, Slavin MA. Consensus position statement on advancing the standardised reporting of infection events in immunocompromised patients. THE LANCET. INFECTIOUS DISEASES 2024; 24:e59-e68. [PMID: 37683684 DOI: 10.1016/s1473-3099(23)00377-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2023] [Revised: 06/07/2023] [Accepted: 06/09/2023] [Indexed: 09/10/2023]
Abstract
Patients can be immunocompromised from a diverse range of disease and treatment factors, including malignancies, autoimmune disorders and their treatments, and organ and stem-cell transplantation. Infections are a leading cause of morbidity and mortality in immunocompromised patients, and the disease treatment landscape is continually evolving. Despite being a critical but preventable and curable adverse event, the reporting of infection events in randomised trials lacks sufficient detail while inconsistency of categorisation and definition of infections in observational and registry studies limits comparability and future pooling of data. A core reporting dataset consisting of category, site, severity, organism, and endpoints was developed as a minimum standard for reporting of infection events in immunocompromised patients across study types. Further additional information is recommended depending on study type. The standardised reporting of infectious events and attributable complications in immunocompromised patients will improve diagnostic, treatment, and prevention approaches and facilitate future research in this patient group.
Collapse
Affiliation(s)
- Benjamin W Teh
- Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia; Sir Peter MacCallum Department of Oncology, University of Melbourne, VIC, Australia.
| | - Malgorzata Mikulska
- Division of Infectious Diseases, Department of Health Sciences, University of Genoa, Genoa, Italy; IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Dina Averbuch
- Pediatric Infectious Diseases, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel; Hadassah Medical Center, Jerusalem, Israel
| | | | - Hans H Hirsch
- Transplantation & Clinical Virology, Department of Biomedicine, University of Basel, Basel, Switzerland; Infectious Diseases & Hospital Epidemiology, University Hospital Basel, Basel, Switzerland
| | - Murat Akova
- Department of Infectious Diseases, Hacettepe University School of Medicine, Ankara, Turkey
| | - Luis Ostrosky-Zeichner
- Division of Infectious Diseases, McGovern Medical School, University of Texas, Houston, TX, USA
| | - John W Baddley
- Department of Medicine, Division of Infectious Diseases, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Ban Hock Tan
- Department of Infectious Diseases, Singapore General Hospital, Singapore
| | - Alessandra Mularoni
- Department of Infectious Diseases, Istituto Mediterraneo per i Trapianti e Terapie ad Alta Specializzazione (IRCCS), Palermo, Italy
| | - Aruna K Subramanian
- Division of Infectious Diseases and Geographic Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Ricardo M La Hoz
- Division of Infectious Diseases and Geographic Medicine, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Tina Marinelli
- Department of Infectious Diseases, Royal Prince Alfred Hospital, Sydney, NSW, Australia
| | - Peter Boan
- Department of Infectious Diseases, Fiona Stanley Hospital, Murdoch, WA, Australia; Department of Microbiology, PathWest Laboratory Medicine WA, Fiona Stanley Hospital, Murdoch, WA, Australia
| | - Jose Maria Aguado
- Unit of Infectious Diseases, Hospital Universitario "12 de Octubre", Instituto de Investigación Sanitaria Hospital "12 de Octubre" (imas12), CIBERINFEC, Universidad Complutense, Madrid, Spain
| | - Paolo A Grossi
- Infectious and Tropical Diseases Unit, Department of Medicine and Surgery, University of Insubria-ASST-Sette Laghi, Varese, Italy
| | - Johan Maertens
- Department of Haematology, Universitaire Ziekenhuizen Leuven, Leuven, Belgium
| | - Nicolas J Mueller
- Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, Zürich, Switzerland
| | - Monica A Slavin
- Department of Infectious Diseases, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia; Sir Peter MacCallum Department of Oncology, University of Melbourne, VIC, Australia; Victorian Infectious Diseases Service, Royal Melbourne Hospital, Parkville, VIC, Australia
| |
Collapse
|
6
|
Schopow N, Osterhoff G, Baur D. Applications of the Natural Language Processing Tool ChatGPT in Clinical Practice: Comparative Study and Augmented Systematic Review. JMIR Med Inform 2023; 11:e48933. [PMID: 38015610 DOI: 10.2196/48933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 06/20/2023] [Accepted: 08/25/2023] [Indexed: 11/29/2023] Open
Abstract
BACKGROUND This research integrates a comparative analysis of the performance of human researchers and OpenAI's ChatGPT in systematic review tasks and describes an assessment of the application of natural language processing (NLP) models in clinical practice through a review of 5 studies. OBJECTIVE This study aimed to evaluate the reliability between ChatGPT and human researchers in extracting key information from clinical articles, and to investigate the practical use of NLP in clinical settings as evidenced by selected studies. METHODS The study design comprised a systematic review of clinical articles executed independently by human researchers and ChatGPT. The level of agreement between and within raters for parameter extraction was assessed using the Fleiss and Cohen κ statistics. RESULTS The comparative analysis revealed a high degree of concordance between ChatGPT and human researchers for most parameters, with less agreement for study design, clinical task, and clinical implementation. The review identified 5 significant studies that demonstrated the diverse applications of NLP in clinical settings. These studies' findings highlight the potential of NLP to improve clinical efficiency and patient outcomes in various contexts, from enhancing allergy detection and classification to improving quality metrics in psychotherapy treatments for veterans with posttraumatic stress disorder. CONCLUSIONS Our findings underscore the potential of NLP models, including ChatGPT, in performing systematic reviews and other clinical tasks. Despite certain limitations, NLP models present a promising avenue for enhancing health care efficiency and accuracy. Future studies must focus on broadening the range of clinical applications and exploring the ethical considerations of implementing NLP applications in health care settings.
Collapse
Affiliation(s)
- Nikolas Schopow
- Department for Orthopedics, Trauma Surgery and Plastic Surgery, University Hospital Leipzig, Leipzig, Germany
| | - Georg Osterhoff
- Department for Orthopedics, Trauma Surgery and Plastic Surgery, University Hospital Leipzig, Leipzig, Germany
| | - David Baur
- Department for Orthopedics, Trauma Surgery and Plastic Surgery, University Hospital Leipzig, Leipzig, Germany
| |
Collapse
|
7
|
King AJ, Angus DC, Cooper GF, Mowery DL, Seaman JB, Potter KM, Bukowski LA, Al-Khafaji A, Gunn SR, Kahn JM. A voice-based digital assistant for intelligent prompting of evidence-based practices during ICU rounds. J Biomed Inform 2023; 146:104483. [PMID: 37657712 PMCID: PMC10591951 DOI: 10.1016/j.jbi.2023.104483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 07/21/2023] [Accepted: 08/29/2023] [Indexed: 09/03/2023]
Abstract
OBJECTIVE To evaluate the technical feasibility and potential value of a digital assistant that prompts intensive care unit (ICU) rounding teams to use evidence-based practices based on analysis of their real-time discussions. METHODS We evaluated a novel voice-based digital assistant which audio records and processes the ICU care team's rounding discussions to determine which evidence-based practices are applicable to the patient but have yet to be addressed by the team. The system would then prompt the team to consider indicated but not yet delivered practices, thereby reducing cognitive burden compared to traditional rigid rounding checklists. In a retrospective analysis, we applied automatic transcription, natural language processing, and a rule-based expert system to generate personalized prompts for each patient in 106 audio-recorded ICU rounding discussions. To assess technical feasibility, we compared the system's prompts to those created by experienced critical care nurses who directly observed rounds. To assess potential value, we also compared the system's prompts to a hypothetical paper checklist containing all evidence-based practices. RESULTS The positive predictive value, negative predictive value, true positive rate, and true negative rate of the system's prompts were 0.45 ± 0.06, 0.83 ± 0.04, 0.68 ± 0.07, and 0.66 ± 0.04, respectively. If implemented in lieu of a paper checklist, the system would generate 56% fewer prompts per patient, with 50%±17% greater precision. CONCLUSION A voice-based digital assistant can reduce prompts per patient compared to traditional approaches for improving evidence uptake on ICU rounds. Additional work is needed to evaluate field performance and team acceptance.
Collapse
Affiliation(s)
- Andrew J King
- Department of Critical Care Medicine, University of Pittsburgh School of Medicine, Scaife Hall Suite 600, 3550 Terrace Street, Pittsburgh, PA 15261, USA.
| | - Derek C Angus
- Department of Critical Care Medicine, University of Pittsburgh School of Medicine, Scaife Hall Suite 600, 3550 Terrace Street, Pittsburgh, PA 15261, USA.
| | - Gregory F Cooper
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Offices at Baum 4th Floor, 5607 Baum Blvd, Pittsburgh, PA 15206, USA.
| | - Danielle L Mowery
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania School of Medicine, Blockley Hall 8th Floor, 423 Guardian Drive, Philadelphia, PA 19104, USA.
| | - Jennifer B Seaman
- Department of Acute & Tertiary Care, University of Pittsburgh School of Nursing, 336 Victoria Building, 3500 Victoria Street, Pittsburgh, PA 15261, USA.
| | - Kelly M Potter
- Department of Critical Care Medicine, University of Pittsburgh School of Medicine, Scaife Hall Suite 600, 3550 Terrace Street, Pittsburgh, PA 15261, USA.
| | - Leigh A Bukowski
- Department of Critical Care Medicine, University of Pittsburgh School of Medicine, Scaife Hall Suite 600, 3550 Terrace Street, Pittsburgh, PA 15261, USA.
| | - Ali Al-Khafaji
- Department of Critical Care Medicine, University of Pittsburgh School of Medicine, Scaife Hall Suite 600, 3550 Terrace Street, Pittsburgh, PA 15261, USA.
| | - Scott R Gunn
- Department of Critical Care Medicine, University of Pittsburgh School of Medicine, Scaife Hall Suite 600, 3550 Terrace Street, Pittsburgh, PA 15261, USA.
| | - Jeremy M Kahn
- Department of Critical Care Medicine, University of Pittsburgh School of Medicine, Scaife Hall Suite 600, 3550 Terrace Street, Pittsburgh, PA 15261, USA.
| |
Collapse
|
8
|
Gao Y, Dligach D, Miller T, Churpek MM, Afshar M. Overview of the Problem List Summarization (ProbSum) 2023 Shared Task on Summarizing Patients' Active Diagnoses and Problems from Electronic Health Record Progress Notes. PROCEEDINGS OF THE CONFERENCE. ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING 2023; 2023:461-467. [PMID: 37583489 PMCID: PMC10426335 DOI: 10.18653/v1/2023.bionlp-1.43] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/17/2023]
Abstract
The BioNLP Workshop 2023 initiated the launch of a shared task on Problem List Summarization (ProbSum) in January 2023. The aim of this shared task is to attract future research efforts in building NLP models for real-world diagnostic decision support applications, where a system generating relevant and accurate diagnoses will augment the healthcare providers' decision-making process and improve the quality of care for patients. The goal for participants is to develop models that generated a list of diagnoses and problems using input from the daily care notes collected from the hospitalization of critically ill patients. Eight teams submitted their final systems to the shared task leaderboard. In this paper, we describe the tasks, datasets, evaluation metrics, and baseline systems. Additionally, the techniques and results of the evaluation of the different approaches tried by the participating teams are summarized.
Collapse
|
9
|
Gao Y, Dligach D, Miller T, Churpek MM, Uzuner O, Afshar M. Progress Note Understanding - Assessment and Plan Reasoning: Overview of the 2022 N2C2 Track 3 shared task. J Biomed Inform 2023; 142:104346. [PMID: 37061012 PMCID: PMC11178099 DOI: 10.1016/j.jbi.2023.104346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 03/16/2023] [Accepted: 03/21/2023] [Indexed: 04/17/2023]
Abstract
Daily progress notes are a common note type in the electronic health record (EHR) where healthcare providers document the patient's daily progress and treatment plans. The EHR is designed to document all the care provided to patients, but it also enables note bloat with extraneous information that distracts from the diagnoses and treatment plans. Applications of natural language processing (NLP) in the EHR is a growing field with the majority of methods in information extraction. Few tasks use NLP methods for downstream diagnostic decision support. We introduced the 2022 National NLP Clinical Challenge (N2C2) Track 3: Progress Note Understanding - Assessment and Plan Reasoning as one step towards a new suite of tasks. The Assessment and Plan Reasoning task focuses on the most critical components of progress notes, Assessment and Plan subsections where health problems and diagnoses are contained. The goal of the task was to develop and evaluate NLP systems that automatically predict causal relations between the overall status of the patient contained in the Assessment section and its relation to each component of the Plan section which contains the diagnoses and treatment plans. The goal of the task was to identify and prioritize diagnoses as the first steps in diagnostic decision support to find the most relevant information in long documents like daily progress notes. We present the results of the 2022 N2C2 Track 3 and provide a description of the data, evaluation, participation and system performance.
Collapse
Affiliation(s)
- Yanjun Gao
- ICU Data Science Lab, Department of Medicine, University of Wisconsin Madison, United States of America.
| | - Dmitriy Dligach
- Department of Computer Science, Loyola University Chicago, United States of America
| | - Timothy Miller
- Boston Children's Hospital, Harvard University, United States of America
| | - Matthew M Churpek
- ICU Data Science Lab, Department of Medicine, University of Wisconsin Madison, United States of America
| | - Ozlem Uzuner
- Department of Information Sciences and Technology, George Mason University, United States of America
| | - Majid Afshar
- ICU Data Science Lab, Department of Medicine, University of Wisconsin Madison, United States of America
| |
Collapse
|
10
|
Afshar M, Adelaine S, Resnik F, Mundt MP, Long J, Leaf M, Ampian T, Wills GJ, Schnapp B, Chao M, Brown R, Joyce C, Sharma B, Dligach D, Burnside ES, Mahoney J, Churpek MM, Patterson BW, Liao F. Deployment of Real-time Natural Language Processing and Deep Learning Clinical Decision Support in the Electronic Health Record: Pipeline Implementation for an Opioid Misuse Screener in Hospitalized Adults. JMIR Med Inform 2023; 11:e44977. [PMID: 37079367 PMCID: PMC10160938 DOI: 10.2196/44977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2022] [Revised: 02/01/2023] [Accepted: 03/26/2023] [Indexed: 04/21/2023] Open
Abstract
BACKGROUND The clinical narrative in electronic health records (EHRs) carries valuable information for predictive analytics; however, its free-text form is difficult to mine and analyze for clinical decision support (CDS). Large-scale clinical natural language processing (NLP) pipelines have focused on data warehouse applications for retrospective research efforts. There remains a paucity of evidence for implementing NLP pipelines at the bedside for health care delivery. OBJECTIVE We aimed to detail a hospital-wide, operational pipeline to implement a real-time NLP-driven CDS tool and describe a protocol for an implementation framework with a user-centered design of the CDS tool. METHODS The pipeline integrated a previously trained open-source convolutional neural network model for screening opioid misuse that leveraged EHR notes mapped to standardized medical vocabularies in the Unified Medical Language System. A sample of 100 adult encounters were reviewed by a physician informaticist for silent testing of the deep learning algorithm before deployment. An end user interview survey was developed to examine the user acceptability of a best practice alert (BPA) to provide the screening results with recommendations. The planned implementation also included a human-centered design with user feedback on the BPA, an implementation framework with cost-effectiveness, and a noninferiority patient outcome analysis plan. RESULTS The pipeline was a reproducible workflow with a shared pseudocode for a cloud service to ingest, process, and store clinical notes as Health Level 7 messages from a major EHR vendor in an elastic cloud computing environment. Feature engineering of the notes used an open-source NLP engine, and the features were fed into the deep learning algorithm, with the results returned as a BPA in the EHR. On-site silent testing of the deep learning algorithm demonstrated a sensitivity of 93% (95% CI 66%-99%) and specificity of 92% (95% CI 84%-96%), similar to published validation studies. Before deployment, approvals were received across hospital committees for inpatient operations. Five interviews were conducted; they informed the development of an educational flyer and further modified the BPA to exclude certain patients and allow the refusal of recommendations. The longest delay in pipeline development was because of cybersecurity approvals, especially because of the exchange of protected health information between the Microsoft (Microsoft Corp) and Epic (Epic Systems Corp) cloud vendors. In silent testing, the resultant pipeline provided a BPA to the bedside within minutes of a provider entering a note in the EHR. CONCLUSIONS The components of the real-time NLP pipeline were detailed with open-source tools and pseudocode for other health systems to benchmark. The deployment of medical artificial intelligence systems in routine clinical care presents an important yet unfulfilled opportunity, and our protocol aimed to close the gap in the implementation of artificial intelligence-driven CDS. TRIAL REGISTRATION ClinicalTrials.gov NCT05745480; https://www.clinicaltrials.gov/ct2/show/NCT05745480.
Collapse
Affiliation(s)
- Majid Afshar
- University of Wisconsin - Madison, Madison, WI, United States
| | | | - Felice Resnik
- University of Wisconsin - Madison, Madison, WI, United States
| | - Marlon P Mundt
- University of Wisconsin - Madison, Madison, WI, United States
| | - John Long
- University of Wisconsin - Madison, Madison, WI, United States
| | - Margaret Leaf
- University of Wisconsin - Madison, Madison, WI, United States
| | - Theodore Ampian
- University of Wisconsin - Madison, Madison, WI, United States
| | - Graham J Wills
- University of Wisconsin - Madison, Madison, WI, United States
| | | | - Michael Chao
- University of Wisconsin - Madison, Madison, WI, United States
| | - Randy Brown
- University of Wisconsin - Madison, Madison, WI, United States
| | - Cara Joyce
- Loyola University Chicago, Chicago, IL, United States
| | - Brihat Sharma
- University of Wisconsin - Madison, Madison, WI, United States
| | | | | | - Jane Mahoney
- University of Wisconsin - Madison, Madison, WI, United States
| | | | | | - Frank Liao
- University of Wisconsin - Madison, Madison, WI, United States
| |
Collapse
|
11
|
Gao Y, Dligach D, Miller T, Caskey J, Sharma B, Churpek MM, Afshar M. DR.BENCH: Diagnostic Reasoning Benchmark for Clinical Natural Language Processing. J Biomed Inform 2023; 138:104286. [PMID: 36706848 PMCID: PMC9993808 DOI: 10.1016/j.jbi.2023.104286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 12/13/2022] [Accepted: 01/09/2023] [Indexed: 01/26/2023]
Abstract
The meaningful use of electronic health records (EHR) continues to progress in the digital era with clinical decision support systems augmented by artificial intelligence. A priority in improving provider experience is to overcome information overload and reduce the cognitive burden so fewer medical errors and cognitive biases are introduced during patient care. One major type of medical error is diagnostic error due to systematic or predictable errors in judgement that rely on heuristics. The potential for clinical natural language processing (cNLP) to model diagnostic reasoning in humans with forward reasoning from data to diagnosis and potentially reduce cognitive burden and medical error has not been investigated. Existing tasks to advance the science in cNLP have largely focused on information extraction and named entity recognition through classification tasks. We introduce a novel suite of tasks coined as Diagnostic Reasoning Benchmarks, Dr.Bench, as a new benchmark for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation. DR.BENCH is the first clinical suite of tasks designed to be a natural language generation framework to evaluate pre-trained language models for diagnostic reasoning. The goal of DR. BENCH is to advance the science in cNLP to support downstream applications in computerized diagnostic decision support and improve the efficiency and accuracy of healthcare providers during patient care. We fine-tune and evaluate the state-of-the-art generative models on DR.BENCH. Experiments show that with domain adaptation pre-training on medical knowledge, the model demonstrated opportunities for improvement when evaluated in DR. BENCH. We share DR. BENCH as a publicly available GitLab repository with a systematic approach to load and evaluate models for the cNLP community. We also discuss the carbon footprint produced during the experiments and encourage future work on DR.BENCH to report the carbon footprint.
Collapse
Affiliation(s)
- Yanjun Gao
- ICU Data Science Lab, Department of Medicine, University of Wisconsin Madison, 1685 Highland Ave, Madison, 53792, WI, USA.
| | - Dmitriy Dligach
- Department of Computer Science, Loyola University Chicago, 1032 W Sheridan Rd, Chicago, 60660, IL, USA
| | - Timothy Miller
- Boston Children's Hospital, Harvard University, 300 Longwood Ave, Boston, 02115, MA, USA
| | - John Caskey
- ICU Data Science Lab, Department of Medicine, University of Wisconsin Madison, 1685 Highland Ave, Madison, 53792, WI, USA
| | - Brihat Sharma
- ICU Data Science Lab, Department of Medicine, University of Wisconsin Madison, 1685 Highland Ave, Madison, 53792, WI, USA
| | - Matthew M Churpek
- ICU Data Science Lab, Department of Medicine, University of Wisconsin Madison, 1685 Highland Ave, Madison, 53792, WI, USA
| | - Majid Afshar
- ICU Data Science Lab, Department of Medicine, University of Wisconsin Madison, 1685 Highland Ave, Madison, 53792, WI, USA
| |
Collapse
|
12
|
Taira RK, Garlid AO, Speier W. Design considerations for a hierarchical semantic compositional framework for medical natural language understanding. PLoS One 2023; 18:e0282882. [PMID: 36928721 PMCID: PMC10019629 DOI: 10.1371/journal.pone.0282882] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 02/24/2023] [Indexed: 03/18/2023] Open
Abstract
Medical natural language processing (NLP) systems are a key enabling technology for transforming Big Data from clinical report repositories to information used to support disease models and validate intervention methods. However, current medical NLP systems fall considerably short when faced with the task of logically interpreting clinical text. In this paper, we describe a framework inspired by mechanisms of human cognition in an attempt to jump the NLP performance curve. The design centers on a hierarchical semantic compositional model (HSCM), which provides an internal substrate for guiding the interpretation process. The paper describes insights from four key cognitive aspects: semantic memory, semantic composition, semantic activation, and hierarchical predictive coding. We discuss the design of a generative semantic model and an associated semantic parser used to transform a free-text sentence into a logical representation of its meaning. The paper discusses supportive and antagonistic arguments for the key features of the architecture as a long-term foundational framework.
Collapse
Affiliation(s)
- Ricky K. Taira
- Medical and Imaging Informatics (MII) Group, Department of Radiological Sciences, University of California, Los Angeles, Los Angeles, California, United States of America
- * E-mail:
| | - Anders O. Garlid
- Medical and Imaging Informatics (MII) Group, Department of Radiological Sciences, University of California, Los Angeles, Los Angeles, California, United States of America
| | - William Speier
- Medical and Imaging Informatics (MII) Group, Department of Radiological Sciences, University of California, Los Angeles, Los Angeles, California, United States of America
- Department of Bioengineering, University of California, Los Angeles, Los Angeles, California, United States of America
| |
Collapse
|
13
|
Joyce C, Markossian TW, Nikolaides J, Ramsey E, Thompson HM, Rojas JC, Sharma B, Dligach D, Oguss MK, Cooper RS, Afshar M. The Evaluation of a Clinical Decision Support Tool Using Natural Language Processing to Screen Hospitalized Adults for Unhealthy Substance Use: Protocol for a Quasi-Experimental Design. JMIR Res Protoc 2022; 11:e42971. [PMID: 36534461 PMCID: PMC9808720 DOI: 10.2196/42971] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Revised: 12/01/2022] [Accepted: 12/05/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Automated and data-driven methods for screening using natural language processing (NLP) and machine learning may replace resource-intensive manual approaches in the usual care of patients hospitalized with conditions related to unhealthy substance use. The rigorous evaluation of tools that use artificial intelligence (AI) is necessary to demonstrate effectiveness before system-wide implementation. An NLP tool to use routinely collected data in the electronic health record was previously validated for diagnostic accuracy in a retrospective study for screening unhealthy substance use. Our next step is a noninferiority design incorporated into a research protocol for clinical implementation with prospective evaluation of clinical effectiveness in a large health system. OBJECTIVE This study aims to provide a study protocol to evaluate health outcomes and the costs and benefits of an AI-driven automated screener compared to manual human screening for unhealthy substance use. METHODS A pre-post design is proposed to evaluate 12 months of manual screening followed by 12 months of automated screening across surgical and medical wards at a single medical center. The preintervention period consists of usual care with manual screening by nurses and social workers and referrals to a multidisciplinary Substance Use Intervention Team (SUIT). Facilitated by a NLP pipeline in the postintervention period, clinical notes from the first 24 hours of hospitalization will be processed and scored by a machine learning model, and the SUIT will be similarly alerted to patients who flagged positive for substance misuse. Flowsheets within the electronic health record have been updated to capture rates of interventions for the primary outcome (brief intervention/motivational interviewing, medication-assisted treatment, naloxone dispensing, and referral to outpatient care). Effectiveness in terms of patient outcomes will be determined by noninferior rates of interventions (primary outcome), as well as rates of readmission within 6 months, average time to consult, and discharge rates against medical advice (secondary outcomes) in the postintervention period by a SUIT compared to the preintervention period. A separate analysis will be performed to assess the costs and benefits to the health system by using automated screening. Changes from the pre- to postintervention period will be assessed in covariate-adjusted generalized linear mixed-effects models. RESULTS The study will begin in September 2022. Monthly data monitoring and Data Safety Monitoring Board reporting are scheduled every 6 months throughout the study period. We anticipate reporting final results by June 2025. CONCLUSIONS The use of augmented intelligence for clinical decision support is growing with an increasing number of AI tools. We provide a research protocol for prospective evaluation of an automated NLP system for screening unhealthy substance use using a noninferiority design to demonstrate comprehensive screening that may be as effective as manual screening but less costly via automated solutions. TRIAL REGISTRATION ClinicalTrials.gov NCT03833804; https://clinicaltrials.gov/ct2/show/NCT03833804. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) DERR1-10.2196/42971.
Collapse
Affiliation(s)
- Cara Joyce
- Department of Computer Science, Loyola University Chicago, Chicago, IL, United States
| | - Talar W Markossian
- Department of Public Health Sciences, Loyola University Chicago, Maywood, IL, United States
| | - Jenna Nikolaides
- Department of Psychiatry, Rush University Medical Center, Chicago, IL, United States
| | - Elisabeth Ramsey
- Department of Psychiatry, Rush University Medical Center, Chicago, IL, United States
| | - Hale M Thompson
- Department of Psychiatry, Rush University Medical Center, Chicago, IL, United States
| | - Juan C Rojas
- Department of Psychiatry, Rush University Medical Center, Chicago, IL, United States
| | - Brihat Sharma
- Department of Psychiatry, Rush University Medical Center, Chicago, IL, United States
| | - Dmitriy Dligach
- Department of Computer Science, Loyola University Chicago, Chicago, IL, United States
| | - Madeline K Oguss
- Department of Medicine, University of Wisconsin-Madison, Madison, WI, United States
| | - Richard S Cooper
- Department of Public Health Sciences, Loyola University Chicago, Maywood, IL, United States
| | - Majid Afshar
- Department of Medicine, University of Wisconsin-Madison, Madison, WI, United States
| |
Collapse
|