1
|
Yoon W, Chen S, Gao Y, Zhao Z, Dligach D, Bitterman DS, Afshar M, Miller T. LCD Benchmark: Long Clinical Document Benchmark on Mortality Prediction for Language Models. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.26.24304920. [PMID: 38585973 PMCID: PMC10996733 DOI: 10.1101/2024.03.26.24304920] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Objective The application of Natural Language Processing (NLP) in the clinical domain is important due to the rich unstructured information in clinical documents, which often remains inaccessible in structured data. When applying NLP methods to a certain domain, the role of benchmark datasets is crucial as benchmark datasets not only guide the selection of best-performing models but also enable the assessment of the reliability of the generated outputs. Despite the recent availability of language models (LMs) capable of longer context, benchmark datasets targeting long clinical document classification tasks are absent. Materials and Methods To address this issue, we propose LCD benchmark, a benchmark for the task of predicting 30-day out-of-hospital mortality using discharge notes of MIMIC-IV and statewide death data. We evaluated this benchmark dataset using baseline models, from bag-of-words and CNN to instruction-tuned large language models. Additionally, we provide a comprehensive analysis of the model outputs, including manual review and visualization of model weights, to offer insights into their predictive capabilities and limitations. Results and Discussion Baseline models showed 28.9% for best-performing supervised models and 32.2% for GPT-4 in F1-metrics. Notes in our dataset have a median word count of 1687. Our analysis of the model outputs showed that our dataset is challenging for both models and human experts, but the models can find meaningful signals from the text. Conclusion We expect our LCD benchmark to be a resource for the development of advanced supervised models, or prompting methods, tailored for clinical text. The benchmark dataset is available at https://github.com/Machine-Learning-for-Medical-Language/long-clinical-doc.
Collapse
|
2
|
Malecki SL, Loffler A, Tamming D, Dyrby Johansen N, Biering-Sørensen T, Fralick M, Sohail S, Shi J, Roberts SB, Colacci M, Ismail M, Razak F, Verma AA. Development and external validation of tools for categorizing diagnosis codes in international hospital data. Int J Med Inform 2024; 189:105508. [PMID: 38851134 DOI: 10.1016/j.ijmedinf.2024.105508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 03/17/2024] [Accepted: 05/27/2024] [Indexed: 06/10/2024]
Abstract
BACKGROUND The Clinical Classification Software Refined (CCSR) is a tool that groups many thousands of International Classification of Diseases 10th Revision (ICD-10) diagnosis codes into approximately 500 clinically meaningful categories, simplifying analyses. However, CCSR was developed for use in the United States and may not work well with other country-specific ICD-10 coding systems. METHOD We developed an algorithm for semi-automated matching of Canadian ICD-10 codes (ICD-10-CA) to CCSR categories using discharge diagnoses from adult admissions at 7 hospitals between Apr 1, 2010 and Dec 31, 2020, and manually validated the results. We then externally validated our approach using inpatient hospital encounters in Denmark from 2017 to 2018. KEY RESULTS There were 383,972 Canadian hospital admissions with 5,186 distinct ICD-10-CA diagnosis codes and 1,855,837 Danish encounters with 4,612 ICD-10 diagnosis codes. Only 46.6% of Canadian codes and 49.4% of Danish codes could be directly categorized using the official CCSR tool. Our algorithm facilitated the mapping of 98.5% of all Canadian codes and 97.7% of Danish codes. Validation of our algorithm by clinicians demonstrated excellent accuracy (97.1% and 97.0% in Canadian and Danish data, respectively). Without our algorithm, many common conditions did not match directly to a CCSR category, such as 96.6% of hospital admissions for heart failure. CONCLUSION The GEMINI CCSR matching algorithm (available as an open-source package at https://github.com/GEMINI-Medicine/gemini-ccsr) improves the categorization of Canadian and Danish ICD-10 codes into clinically coherent categories compared to the original CCSR tool. We expect this approach to generalize well to other countries and enable a wide range of research and quality measurement applications.
Collapse
Affiliation(s)
- Sarah L Malecki
- Department of Medicine, University of Toronto, Toronto, ON, Canada.
| | - Anne Loffler
- St. Michael's Hospital, University of Toronto, Toronto, ON, Canada
| | - Daniel Tamming
- St. Michael's Hospital, University of Toronto, Toronto, ON, Canada
| | - Niklas Dyrby Johansen
- Department of Cardiology, Copenhagen University Hospital - Herlev and Gentofte, Copenhagen, Denmark; Center for Translational Cardiology and Pragmatic Randomized Trials, Department of Biomedical Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
| | - Tor Biering-Sørensen
- Department of Cardiology, Copenhagen University Hospital - Herlev and Gentofte, Copenhagen, Denmark; Center for Translational Cardiology and Pragmatic Randomized Trials, Department of Biomedical Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
| | - Michael Fralick
- Division of General Internal Medicine, Sinai Health System, ON, Toronto, Canada; Institute of Health Policy, Management, and Evaluation, University of Toronto, Toronto, ON, Canada
| | - Shahmir Sohail
- Department of Medicine, University of Toronto, Toronto, ON, Canada
| | - Jessica Shi
- St. Michael's Hospital, University of Toronto, Toronto, ON, Canada
| | - Surain B Roberts
- St. Michael's Hospital, University of Toronto, Toronto, ON, Canada
| | - Michael Colacci
- Department of Medicine, University of Toronto, Toronto, ON, Canada; St. Michael's Hospital, University of Toronto, Toronto, ON, Canada; Institute of Health Policy, Management, and Evaluation, University of Toronto, Toronto, ON, Canada
| | - Marwa Ismail
- St. Michael's Hospital, University of Toronto, Toronto, ON, Canada
| | - Fahad Razak
- Department of Medicine, University of Toronto, Toronto, ON, Canada; St. Michael's Hospital, University of Toronto, Toronto, ON, Canada; Institute of Health Policy, Management, and Evaluation, University of Toronto, Toronto, ON, Canada
| | - Amol A Verma
- Department of Medicine, University of Toronto, Toronto, ON, Canada; St. Michael's Hospital, University of Toronto, Toronto, ON, Canada; Institute of Health Policy, Management, and Evaluation, University of Toronto, Toronto, ON, Canada; Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
3
|
Mishra AK, Chong B, Arunachalam SP, Oberg AL, Majumder S. Machine Learning Models for Pancreatic Cancer Risk Prediction Using Electronic Health Record Data-A Systematic Review and Assessment. Am J Gastroenterol 2024:00000434-990000000-01167. [PMID: 38752654 DOI: 10.14309/ajg.0000000000002870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 05/06/2024] [Indexed: 06/20/2024]
Abstract
INTRODUCTION Accurate risk prediction can facilitate screening and early detection of pancreatic cancer (PC). We conducted a systematic review to critically evaluate effectiveness of machine learning (ML) and artificial intelligence (AI) techniques applied to electronic health records (EHR) for PC risk prediction. METHODS Ovid MEDLINE(R), Ovid EMBASE, Ovid Cochrane Central Register of Controlled Trials, Ovid Cochrane Database of Systematic Reviews, Scopus, and Web of Science were searched for articles that utilized ML/AI techniques to predict PC, published between January 1, 2012, and February 1, 2024. Study selection and data extraction were conducted by 2 independent reviewers. Critical appraisal and data extraction were performed using the CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies checklist. Risk of bias and applicability were examined using prediction model risk of bias assessment tool. RESULTS Thirty studies including 169,149 PC cases were identified. Logistic regression was the most frequent modeling method. Twenty studies utilized a curated set of known PC risk predictors or those identified by clinical experts. ML model discrimination performance (C-index) ranged from 0.57 to 1.0. Missing data were underreported, and most studies did not implement explainable-AI techniques or report exclusion time intervals. DISCUSSION AI/ML models for PC risk prediction using known risk factors perform reasonably well and may have near-term applications in identifying cohorts for targeted PC screening if validated in real-world data sets. The combined use of structured and unstructured EHR data using emerging AI models while incorporating explainable-AI techniques has the potential to identify novel PC risk factors, and this approach merits further study.
Collapse
Affiliation(s)
- Anup Kumar Mishra
- Department of Gastroenterology and Hepatology, Mayo Clinic, Rochester, Minnesota, USA
| | - Bradford Chong
- Department of Gastroenterology and Hepatology, Mayo Clinic, Rochester, Minnesota, USA
| | | | - Ann L Oberg
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota, USA
| | - Shounak Majumder
- Department of Gastroenterology and Hepatology, Mayo Clinic, Rochester, Minnesota, USA
| |
Collapse
|
4
|
Swaminathan A, Lopez I, Wang W, Srivastava U, Tran E, Bhargava-Shah A, Wu JY, Ren AL, Caoili K, Bui B, Alkhani L, Lee S, Mohit N, Seo N, Macedo N, Cheng W, Liu C, Thomas R, Chen JH, Gevaert O. Selective prediction for extracting unstructured clinical data. J Am Med Inform Assoc 2023; 31:188-197. [PMID: 37769323 PMCID: PMC10746316 DOI: 10.1093/jamia/ocad182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 08/21/2023] [Accepted: 08/24/2023] [Indexed: 09/30/2023] Open
Abstract
OBJECTIVE While there are currently approaches to handle unstructured clinical data, such as manual abstraction and structured proxy variables, these methods may be time-consuming, not scalable, and imprecise. This article aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction. MATERIALS AND METHODS We trained selective classifiers (logistic regression, random forest, support vector machine) to extract 5 variables from clinical notes: depression (n = 1563), glioblastoma (GBM, n = 659), rectal adenocarcinoma (DRA, n = 601), and abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601) of adenocarcinoma. We varied the cost of false positives (FP), false negatives (FN), and abstained notes and measured total misclassification cost. RESULTS The depression selective classifiers abstained on anywhere from 0% to 97% of notes, and the change in total misclassification cost ranged from -58% to 9%. Selective classifiers abstained on 5%-43% of notes across the GBM and colorectal cancer models. The GBM selective classifier abstained on 43% of notes, which led to improvements in sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier and when compared to structured proxy variables. DISCUSSION We showed that selective classifiers outperformed both non-selective classifiers and structured proxy variables for extracting data from unstructured clinical notes. CONCLUSION Selective prediction should be considered when abstaining is preferable to making an incorrect prediction.
Collapse
Affiliation(s)
- Akshay Swaminathan
- Stanford University School of Medicine, Stanford, CA, United States
- Cerebral Inc. Claymont, DE, United States
| | - Ivan Lopez
- Stanford University School of Medicine, Stanford, CA, United States
- Cerebral Inc. Claymont, DE, United States
| | - William Wang
- Department of Biology, Stanford University, Stanford, CA, United States
- Department of Bioengineering, Stanford University, Stanford, CA, United States
| | - Ujwal Srivastava
- Department of Computer Science, Stanford University, Stanford, CA, United States
| | - Edward Tran
- Department of Computer Science, Stanford University, Stanford, CA, United States
- Department of Management Science and Engineering, Stanford University, Stanford, CA, United States
| | | | - Janet Y Wu
- Stanford University School of Medicine, Stanford, CA, United States
| | - Alexander L Ren
- Stanford University School of Medicine, Stanford, CA, United States
| | - Kaitlin Caoili
- Stanford University School of Medicine, Stanford, CA, United States
| | - Brandon Bui
- Department of Human Biology, Stanford University, Stanford, CA, United States
| | - Layth Alkhani
- Department of Bioengineering, Stanford University, Stanford, CA, United States
- Department of Chemistry, Stanford University, Stanford, CA, United States
| | - Susan Lee
- Department of Computer Science, Stanford University, Stanford, CA, United States
| | - Nathan Mohit
- Department of Computer Science, Stanford University, Stanford, CA, United States
- Department of Human Biology, Stanford University, Stanford, CA, United States
| | - Noel Seo
- Department of Sociology, Stanford University, Stanford, CA, United States
| | - Nicholas Macedo
- Department of Biology, Stanford University, Stanford, CA, United States
- Department of Radiology, Stanford University School of Medicine, Stanford, CA, United States
| | - Winson Cheng
- Department of Computer Science, Stanford University, Stanford, CA, United States
- Department of Chemistry, Stanford University, Stanford, CA, United States
| | - Charles Liu
- Department of Surgery, Stanford University School of Medicine, Stanford, CA, United States
| | - Reena Thomas
- Department of Neurology and Neurological Sciences, Stanford Health Care, Stanford, CA, United States
| | - Jonathan H Chen
- Stanford Center for Biomedical Informatics Research, Stanford, CA, United States
- Division of Hospital Medicine, Stanford, CA, United States
- Clinical Excellence Research Center, Stanford, CA, United States
- Department of Medicine, Stanford, CA, United States
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Stanford, CA, United States
- Department of Medicine, Stanford, CA, United States
| |
Collapse
|
5
|
Liu Y, Deng Y, Wang H, Liu W, He X, Zeng H. A nomogram for predicting echocardiogram prescription in outpatients: an analysis of the NAMCS database. Front Cardiovasc Med 2023; 10:1183504. [PMID: 37908500 PMCID: PMC10613676 DOI: 10.3389/fcvm.2023.1183504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 09/19/2023] [Indexed: 11/02/2023] Open
Abstract
Background and objective Cardiovascular disease is the leading cause of morbidity and mortality globally. Echocardiography is a commonly used method for assessing the condition of patients with cardiovascular disease. However, little is known about the population characteristics of patients who are recommended for echocardiographic examinations. Methods The National Ambulatory Medical Care Survey was a cross-sectional survey previously undertaken in the USA. In this study, publicly accessible data from the National Ambulatory Medical Care Survey database (for 2007-2016 and 2018-2019; data for 2017 was not published) were utilized to create a nomogram based on significant risk predictors. The study was performed in accordance with the relevant guidelines and regulations stipulated in the National Ambulatory Medical Care Survey database. Patients were randomly assigned to one of two groups: training cohort or validation cohort. The latter was used to assess the reliability of the prediction nomogram. Decision curve analysis was performed to evaluate the net benefit. Propensity score matching analysis was used to evaluate the relevance of echocardiography to clinical decision-making. Results A total of 217,178 outpatients were enrolled. Multivariable logistic regression analysis demonstrated that hypertension, hyperlipidemia, coronary artery disease/ischemic heart disease/history of myocardial infarction, congestive heart failure, major reason for visit, metropolitan statistical area, cerebrovascular disease/history of stroke or transient ischemic attack, previously assessed, insurance, referred, diagnosis, and reason for visit were all predictors of echocardiogram prescription in outpatients. The reliability of the predictive nomogram was confirmed in the validation cohort. After propensity score matching, there was a significant difference in new cardiovascular agent prescriptions between the echocardiogram and no echocardiogram groups (P < 0.01). Conclusion In this cohort study, a nomogram based on the characteristics of outpatients was developed to predict the possibility of prescribing echocardiography. The echocardiogram group was more likely to be prescribed new cardiovascular agents. These findings may contribute to providing information about the gap between actual utilizations and guidelines and the actual outpatient practice, as well as meeting the needs of outpatients.
Collapse
Affiliation(s)
- Yujian Liu
- Division of Cardiology, Department of Internal Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
- Hubei Provincial Engineering Research Center of Vascular Interventional Therapy, Wuhan, China
| | - Yanhan Deng
- Department of Rheumatology and Immunology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Hongjie Wang
- Division of Cardiology, Department of Internal Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
- Hubei Provincial Engineering Research Center of Vascular Interventional Therapy, Wuhan, China
| | - Wanjun Liu
- Division of Cardiology, Department of Internal Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
- Hubei Provincial Engineering Research Center of Vascular Interventional Therapy, Wuhan, China
| | - Xingwei He
- Division of Cardiology, Department of Internal Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
- Hubei Provincial Engineering Research Center of Vascular Interventional Therapy, Wuhan, China
| | - Hesong Zeng
- Division of Cardiology, Department of Internal Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
- Hubei Provincial Engineering Research Center of Vascular Interventional Therapy, Wuhan, China
| |
Collapse
|
6
|
Amirahmadi A, Ohlsson M, Etminani K. Deep learning prediction models based on EHR trajectories: A systematic review. J Biomed Inform 2023; 144:104430. [PMID: 37380061 DOI: 10.1016/j.jbi.2023.104430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Revised: 06/08/2023] [Accepted: 06/17/2023] [Indexed: 06/30/2023]
Abstract
BACKGROUND Electronic health records (EHRs) are generated at an ever-increasing rate. EHR trajectories, the temporal aspect of health records, facilitate predicting patients' future health-related risks. It enables healthcare systems to increase the quality of care through early identification and primary prevention. Deep learning techniques have shown great capacity for analyzing complex data and have been successful for prediction tasks using complex EHR trajectories. This systematic review aims to analyze recent studies to identify challenges, knowledge gaps, and ongoing research directions. METHODS For this systematic review, we searched Scopus, PubMed, IEEE Xplore, and ACM databases from Jan 2016 to April 2022 using search terms centered around EHR, deep learning, and trajectories. Then the selected papers were analyzed according to publication characteristics, objectives, and their solutions regarding existing challenges, such as the model's capacity to deal with intricate data dependencies, data insufficiency, and explainability. RESULTS After removing duplicates and out-of-scope papers, 63 papers were selected, which showed rapid growth in the number of research in recent years. Predicting all diseases in the next visit and the onset of cardiovascular diseases were the most common targets. Different contextual and non-contextual representation learning methods are employed to retrieve important information from the sequence of EHR trajectories. Recurrent neural networks and the time-aware attention mechanism for modeling long-term dependencies, self-attentions, convolutional neural networks, graphs for representing inner visit relations, and attention scores for explainability were frequently used among the reviewed publications. CONCLUSIONS This systematic review demonstrated how recent breakthroughs in deep learning methods have facilitated the modeling of EHR trajectories. Research on improving the ability of graph neural networks, attention mechanisms, and cross-modal learning to analyze intricate dependencies among EHRs has shown good progress. There is a need to increase the number of publicly available EHR trajectory datasets to allow for easier comparison among different models. Also, very few developed models can handle all aspects of EHR trajectory data.
Collapse
Affiliation(s)
- Ali Amirahmadi
- Center for Applied Intelligent Systems Research, Halmstad University, Sweden.
| | - Mattias Ohlsson
- Center for Applied Intelligent Systems Research, Halmstad University, Sweden; Computational Biology & Biological Physics, Department of Astronomy and Theoretical Physics, Lund University, Sweden
| | - Kobra Etminani
- Center for Applied Intelligent Systems Research, Halmstad University, Sweden
| |
Collapse
|
7
|
Ben Miled Z, Dexter PR, Grout RW, Boustani M. Feature engineering from medical notes: A case study of dementia detection. Heliyon 2023; 9:e14636. [PMID: 37020943 PMCID: PMC10068125 DOI: 10.1016/j.heliyon.2023.e14636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 03/12/2023] [Accepted: 03/13/2023] [Indexed: 03/19/2023] Open
Abstract
Background and objectives Medical notes are narratives that describe the health of the patient in free text format. These notes can be more informative than structured data such as the history of medications or disease conditions. They are routinely collected and can be used to evaluate the patient's risk for developing chronic diseases such as dementia. This study investigates different methodologies for transforming routine care notes into dementia risk classifiers and evaluates the generalizability of these classifiers to new patients and new health care institutions. Methods The notes collected over the relevant history of the patient are lengthy. In this study, TF-ICF is used to select keywords with the highest discriminative ability between at risk dementia patients and healthy controls. The medical notes are then summarized in the form of occurrences of the selected keywords. Two different encodings of the summary are compared. The first encoding consists of the average of the vector embedding of each keyword occurrence as produced by the BERT or Clinical BERT pre-trained language models. The second encoding aggregates the keywords according to UMLS concepts and uses each concept as an exposure variable. For both encodings, misspellings of the selected keywords are also considered in an effort to improve the predictive performance of the classifiers. A neural network is developed over the first encoding and a gradient boosted trees model is applied to the second encoding. Patients from a single health care institution are used to develop all the classifiers which are then evaluated on held-out patients from the same health care institution as well as test patients from two other health care institutions. Results The results indicate that it is possible to identify patients at risk for dementia one year ahead of the onset of the disease using medical notes with an AUC of 75% when a gradient boosted trees model is used in conjunction with exposure variables derived from UMLS concepts. However, this performance is not maintained with an embedded feature space and when the classifier is applied to patients from other health care institutions. Moreover, an analysis of the top predictors of the gradient boosted trees model indicates that different features inform the classification depending on whether or not spelling variants of the keywords are included. Conclusion The present study demonstrates that medical notes can enable risk prediction models for complex chronic diseases such as dementia. However, additional research efforts are needed to improve the generalizability of these models. These efforts should take into consideration the length and localization of the medical notes; the availability of sufficient training data for each disease condition; and the variabilities resulting from different feature engineering techniques.
Collapse
Affiliation(s)
- Zina Ben Miled
- Department of Electrical and Computer Engineering, School of Engineering and Technology, Indiana University Purdue University at Indianapolis, 723 W. Michigan Street, Indianapolis, IN, 46202, USA
- Regenstrief Institute, Inc., 1101 W. 10th Street, Indianapolis, IN, 46202, USA
| | - Paul R. Dexter
- Regenstrief Institute, Inc., 1101 W. 10th Street, Indianapolis, IN, 46202, USA
- Indiana University School of Medicine, 340 W 10th St, Indianapolis, IN, 46202, USA
| | - Randall W. Grout
- Indiana University School of Medicine, 340 W 10th St, Indianapolis, IN, 46202, USA
| | - Malaz Boustani
- Regenstrief Institute, Inc., 1101 W. 10th Street, Indianapolis, IN, 46202, USA
- Indiana University School of Medicine, 340 W 10th St, Indianapolis, IN, 46202, USA
| |
Collapse
|
8
|
Park J, Artin MG, Lee KE, May BL, Park M, Hur C, Tatonetti NP. Structured deep embedding model to generate composite clinical indices from electronic health records for early detection of pancreatic cancer. PATTERNS (NEW YORK, N.Y.) 2023; 4:100636. [PMID: 36699740 PMCID: PMC9868652 DOI: 10.1016/j.patter.2022.100636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 08/18/2022] [Accepted: 10/24/2022] [Indexed: 12/12/2022]
Abstract
The high-dimensionality, complexity, and irregularity of electronic health records (EHR) data create significant challenges for both simplified and comprehensive health assessments, prohibiting an efficient extraction of actionable insights by clinicians. If we can provide human decision-makers with a simplified set of interpretable composite indices (i.e., combining information about groups of related measures into single representative values), it will facilitate effective clinical decision-making. In this study, we built a structured deep embedding model aimed at reducing the dimensionality of the input variables by grouping related measurements as determined by domain experts (e.g., clinicians). Our results suggest that composite indices representing liver function may consistently be the most important factor in the early detection of pancreatic cancer (PC). We propose our model as a basis for leveraging deep learning toward developing composite indices from EHR for predicting health outcomes, including but not limited to various cancers, with clinically meaningful interpretations.
Collapse
Affiliation(s)
- Jiheum Park
- Department of Medicine, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Michael G. Artin
- Hospital of the University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Kate E. Lee
- Duke University Medical Center, Durham, NC 27710, USA
| | - Benjamin L. May
- Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Michael Park
- Applied Info Partners, Inc, Worlds Fair Drive, Somerset, NJ 08873, USA
- X-Mechanics, Cresskill, NJ 07626, USA
| | - Chin Hur
- Department of Medicine, Columbia University Irving Medical Center, New York, NY 10032, USA
| | | |
Collapse
|
9
|
Gupta M, Gallamoza B, Cutrona N, Dhakal P, Poulain R, Beheshti R. An Extensive Data Processing Pipeline for MIMIC-IV. PROCEEDINGS OF MACHINE LEARNING RESEARCH 2022; 193:311-325. [PMID: 36686986 PMCID: PMC9854277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
An increasing amount of research is being devoted to applying machine learning methods to electronic health record (EHR) data for various clinical purposes. This growing area of research has exposed the challenges of the accessibility of EHRs. MIMIC is a popular, public, and free EHR dataset in a raw format that has been used in numerous studies. The absence of standardized preprocessing steps can be, however, a significant barrier to the wider adoption of this rare resource. Additionally, this absence can reduce the reproducibility of the developed tools and limit the ability to compare the results among similar studies. In this work, we provide a greatly customizable pipeline to extract, clean, and preprocess the data available in the fourth version of the MIMIC dataset (MIMIC-IV). The pipeline also presents an end-to-end wizard-like package supporting predictive model creations and evaluations. The pipeline covers a range of clinical prediction tasks which can be broadly classified into four categories - readmission, length of stay, mortality, and phenotype prediction. The tool is publicly available at https://github.com/healthylaife/MIMIC-IV-Data-Pipeline.
Collapse
|
10
|
Kiser AC, Eilbeck K, Ferraro JP, Skarda DE, Samore MH, Bucher B. Standard Vocabularies to Improve Machine Learning Model Transferability With Electronic Health Record Data: Retrospective Cohort Study Using Health Care-Associated Infection. JMIR Med Inform 2022; 10:e39057. [PMID: 36040784 PMCID: PMC9472055 DOI: 10.2196/39057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 08/09/2022] [Accepted: 08/15/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND With the widespread adoption of electronic healthcare records (EHRs) by US hospitals, there is an opportunity to leverage this data for the development of predictive algorithms to improve clinical care. A key barrier in model development and implementation includes the external validation of model discrimination, which is rare and often results in worse performance. One reason why machine learning models are not externally generalizable is data heterogeneity. A potential solution to address the substantial data heterogeneity between health care systems is to use standard vocabularies to map EHR data elements. The advantage of these vocabularies is a hierarchical relationship between elements, which allows the aggregation of specific clinical features to more general grouped concepts. OBJECTIVE This study aimed to evaluate grouping EHR data using standard vocabularies to improve the transferability of machine learning models for the detection of postoperative health care-associated infections across institutions with different EHR systems. METHODS Patients who underwent surgery from the University of Utah Health and Intermountain Healthcare from July 2014 to August 2017 with complete follow-up data were included. The primary outcome was a health care-associated infection within 30 days of the procedure. EHR data from 0-30 days after the operation were mapped to standard vocabularies and grouped using the hierarchical relationships of the vocabularies. Model performance was measured using the area under the receiver operating characteristic curve (AUC) and F1-score in internal and external validations. To evaluate model transferability, a difference-in-difference metric was defined as the difference in performance drop between internal and external validations for the baseline and grouped models. RESULTS A total of 5775 patients from the University of Utah and 15,434 patients from Intermountain Healthcare were included. The prevalence of selected outcomes was from 4.9% (761/15,434) to 5% (291/5775) for surgical site infections, from 0.8% (44/5775) to 1.1% (171/15,434) for pneumonia, from 2.6% (400/15,434) to 3% (175/5775) for sepsis, and from 0.8% (125/15,434) to 0.9% (50/5775) for urinary tract infections. In all outcomes, the grouping of data using standard vocabularies resulted in a reduced drop in AUC and F1-score in external validation compared to baseline features (all P<.001, except urinary tract infection AUC: P=.002). The difference-in-difference metrics ranged from 0.005 to 0.248 for AUC and from 0.075 to 0.216 for F1-score. CONCLUSIONS We demonstrated that grouping machine learning model features based on standard vocabularies improved model transferability between data sets across 2 institutions. Improving model transferability using standard vocabularies has the potential to improve the generalization of clinical prediction models across the health care system.
Collapse
Affiliation(s)
- Amber C Kiser
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT, United States
| | - Karen Eilbeck
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT, United States
| | - Jeffrey P Ferraro
- Department of Medicine, School of Medicine, University of Utah, Salt Lake City, UT, United States
| | - David E Skarda
- Center for Value-Based Surgery, Intermountain Healthcare, Salt Lake City, UT, United States.,Department of Surgery, School of Medicine, University of Utah, Salt Lake City, UT, United States
| | - Matthew H Samore
- Department of Medicine, School of Medicine, University of Utah, Salt Lake City, UT, United States.,Informatics, Decision-Enhancement and Analytic Sciences Center 2.0, Veterans Affairs Salt Lake City Health Care System, Salt Lake City, UT, United States
| | - Brian Bucher
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT, United States.,Department of Surgery, School of Medicine, University of Utah, Salt Lake City, UT, United States
| |
Collapse
|
11
|
Rasmy L, Nigo M, Kannadath BS, Xie Z, Mao B, Patel K, Zhou Y, Zhang W, Ross A, Xu H, Zhi D. Recurrent neural network models (CovRNN) for predicting outcomes of patients with COVID-19 on admission to hospital: model development and validation using electronic health record data. Lancet Digit Health 2022; 4:e415-e425. [PMID: 35466079 PMCID: PMC9023005 DOI: 10.1016/s2589-7500(22)00049-8] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Revised: 01/11/2022] [Accepted: 03/07/2022] [Indexed: 02/08/2023]
Abstract
BACKGROUND Predicting outcomes of patients with COVID-19 at an early stage is crucial for optimised clinical care and resource management, especially during a pandemic. Although multiple machine learning models have been proposed to address this issue, because of their requirements for extensive data preprocessing and feature engineering, they have not been validated or implemented outside of their original study site. Therefore, we aimed to develop accurate and transferrable predictive models of outcomes on hospital admission for patients with COVID-19. METHODS In this study, we developed recurrent neural network-based models (CovRNN) to predict the outcomes of patients with COVID-19 by use of available electronic health record data on admission to hospital, without the need for specific feature selection or missing data imputation. CovRNN was designed to predict three outcomes: in-hospital mortality, need for mechanical ventilation, and prolonged hospital stay (>7 days). For in-hospital mortality and mechanical ventilation, CovRNN produced time-to-event risk scores (survival prediction; evaluated by the concordance index) and all-time risk scores (binary prediction; area under the receiver operating characteristic curve [AUROC] was the main metric); we only trained a binary classification model for prolonged hospital stay. For binary classification tasks, we compared CovRNN against traditional machine learning algorithms: logistic regression and light gradient boost machine. Our models were trained and validated on the heterogeneous, deidentified data of 247 960 patients with COVID-19 from 87 US health-care systems derived from the Cerner Real-World COVID-19 Q3 Dataset up to September 2020. We held out the data of 4175 patients from two hospitals for external validation. The remaining 243 785 patients from the 85 health systems were grouped into training (n=170 626), validation (n=24 378), and multi-hospital test (n=48 781) sets. Model performance was evaluated in the multi-hospital test set. The transferability of CovRNN was externally validated by use of deidentified data from 36 140 patients derived from the US-based Optum deidentified COVID-19 electronic health record dataset (version 1015; from January, 2007, to Oct 15, 2020). Exact dates of data extraction were masked by the databases to ensure patient data safety. FINDINGS CovRNN binary models achieved AUROCs of 93·0% (95% CI 92·6-93·4) for the prediction of in-hospital mortality, 92·9% (92·6-93·2) for the prediction of mechanical ventilation, and 86·5% (86·2-86·9) for the prediction of a prolonged hospital stay, outperforming light gradient boost machine and logistic regression algorithms. External validation confirmed AUROCs in similar ranges (91·3-97·0% for in-hospital mortality prediction, 91·5-96·0% for the prediction of mechanical ventilation, and 81·0-88·3% for the prediction of prolonged hospital stay). For survival prediction, CovRNN achieved a concordance index of 86·0% (95% CI 85·1-86·9) for in-hospital mortality and 92·6% (92·2-93·0) for mechanical ventilation. INTERPRETATION Trained on a large, heterogeneous, real-world dataset, our CovRNN models showed high prediction accuracy and transferability through consistently good performances on multiple external datasets. Our results show the feasibility of a COVID-19 predictive model that delivers high accuracy without the need for complex feature engineering. FUNDING Cancer Prevention and Research Institute of Texas.
Collapse
Affiliation(s)
- Laila Rasmy
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Masayuki Nigo
- McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
| | | | - Ziqian Xie
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Bingyu Mao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Khush Patel
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Yujia Zhou
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Wanheng Zhang
- School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Angela Ross
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Degui Zhi
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA,Correspondence to: Dr Degui Zhi, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
12
|
Integration of Artificial Intelligence and Blockchain Technology in Healthcare and Agriculture. J FOOD QUALITY 2022. [DOI: 10.1155/2022/4228448] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Over the last decade, the healthcare sector has accelerated its digitization and electronic health records (EHRs). As information technology progresses, the notion of intelligent health also gathers popularity. By combining technologies such as the internet of things (IoT) and artificial intelligence (AI), innovative healthcare modifies and enhances traditional medical systems in terms of efficiency, service, and personalization. On the other side, intelligent healthcare systems are incredibly vulnerable to data breaches and other malicious assaults. Recently, blockchain technology has emerged as a potentially transformative option for enhancing data management, access control, and integrity inside healthcare systems. Integrating these advanced approaches in agriculture is critical for managing food supply chains, drug supply chains, quality maintenance, and intelligent prediction. This study reviews the literature, formulates a research topic, and analyzes the applicability of blockchain to the agriculture/food industry and healthcare, with a particular emphasis on AI and IoT. This article summarizes research on the newest blockchain solutions paired with AI technologies for strengthening and inventing new technological standards for the healthcare ecosystems and food industry.
Collapse
|
13
|
Rafee A, Riepenhausen S, Neuhaus P, Meidt A, Dugas M, Varghese J. ELaPro, a LOINC-mapped core dataset for top laboratory procedures of eligibility screening for clinical trials. BMC Med Res Methodol 2022; 22:141. [PMID: 35568796 PMCID: PMC9107639 DOI: 10.1186/s12874-022-01611-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 04/20/2022] [Indexed: 12/21/2022] Open
Abstract
Background Screening for eligible patients continues to pose a great challenge for many clinical trials. This has led to a rapidly growing interest in standardizing computable representations of eligibility criteria (EC) in order to develop tools that leverage data from electronic health record (EHR) systems. Although laboratory procedures (LP) represent a common entity of EC that is readily available and retrievable from EHR systems, there is a lack of interoperable data models for this entity of EC. A public, specialized data model that utilizes international, widely-adopted terminology for LP, e.g. Logical Observation Identifiers Names and Codes (LOINC®), is much needed to support automated screening tools. Objective The aim of this study is to establish a core dataset for LP most frequently requested to recruit patients for clinical trials using LOINC terminology. Employing such a core dataset could enhance the interface between study feasibility platforms and EHR systems and significantly improve automatic patient recruitment. Methods We used a semi-automated approach to analyze 10,516 screening forms from the Medical Data Models (MDM) portal’s data repository that are pre-annotated with Unified Medical Language System (UMLS). An automated semantic analysis based on concept frequency is followed by an extensive manual expert review performed by physicians to analyze complex recruitment-relevant concepts not amenable to automatic approach. Results Based on analysis of 138,225 EC from 10,516 screening forms, 55 laboratory procedures represented 77.87% of all UMLS laboratory concept occurrences identified in the selected EC forms. We identified 26,413 unique UMLS concepts from 118 UMLS semantic types and covered the vast majority of Medical Subject Headings (MeSH) disease domains. Conclusions Only a small set of common LP covers the majority of laboratory concepts in screening EC forms which supports the feasibility of establishing a focused core dataset for LP. We present ELaPro, a novel, LOINC-mapped, core dataset for the most frequent 55 LP requested in screening for clinical trials. ELaPro is available in multiple machine-readable data formats like CSV, ODM and HL7 FHIR. The extensive manual curation of this large number of free-text EC as well as the combining of UMLS and LOINC terminologies distinguishes this specialized dataset from previous relevant datasets in the literature. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-022-01611-y.
Collapse
Affiliation(s)
- Ahmed Rafee
- Institute of Medical Informatics, University of Münster, Münster, Germany. .,Department of Internal Medicine (D), University Hospital of Münster, Münster, Germany.
| | - Sarah Riepenhausen
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Philipp Neuhaus
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Alexandra Meidt
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Martin Dugas
- Institute of Medical Informatics, Heidelberg University Hospital, Heidelberg, Germany
| | - Julian Varghese
- Institute of Medical Informatics, University of Münster, Münster, Germany.
| |
Collapse
|
14
|
Castro VM, Gainer V, Wattanasin N, Benoit B, Cagan A, Ghosh B, Goryachev S, Metta R, Park H, Wang D, Mendis M, Rees M, Herrick C, Murphy SN. The Mass General Brigham Biobank Portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. J Am Med Inform Assoc 2021; 29:643-651. [PMID: 34849976 PMCID: PMC8922162 DOI: 10.1093/jamia/ocab264] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Revised: 10/20/2021] [Accepted: 11/16/2021] [Indexed: 01/07/2023] Open
Abstract
Objective Integrating and harmonizing disparate patient data sources into one consolidated data portal enables researchers to conduct analysis efficiently and effectively. Materials and Methods We describe an implementation of Informatics for Integrating Biology and the Bedside (i2b2) to create the Mass General Brigham (MGB) Biobank Portal data repository. The repository integrates data from primary and curated data sources and is updated weekly. The data are made readily available to investigators in a data portal where they can easily construct and export customized datasets for analysis. Results As of July 2021, there are 125 645 consented patients enrolled in the MGB Biobank. 88 527 (70.5%) have a biospecimen, 55 121 (43.9%) have completed the health information survey, 43 552 (34.7%) have genomic data and 124 760 (99.3%) have EHR data. Twenty machine learning computed phenotypes are calculated on a weekly basis. There are currently 1220 active investigators who have run 58 793 patient queries and exported 10 257 analysis files. Discussion The Biobank Portal allows noninformatics researchers to conduct study feasibility by querying across many data sources and then extract data that are most useful to them for clinical studies. While institutions require substantial informatics resources to establish and maintain integrated data repositories, they yield significant research value to a wide range of investigators. Conclusion The Biobank Portal and other patient data portals that integrate complex and simple datasets enable diverse research use cases. i2b2 tools to implement these registries and make the data interoperable are open source and freely available.
Collapse
Affiliation(s)
- Victor M Castro
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Vivian Gainer
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Nich Wattanasin
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Barbara Benoit
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Andrew Cagan
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Bhaswati Ghosh
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Sergey Goryachev
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Reeta Metta
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Heekyong Park
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - David Wang
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Michael Mendis
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Martin Rees
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Christopher Herrick
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Shawn N Murphy
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA.,Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
15
|
NE–LP: Normalized entropy- and loss prediction-based sampling for active learning in Chinese word segmentation on EHRs. Neural Comput Appl 2021. [DOI: 10.1007/s00521-021-05896-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
16
|
Abstract
Electronic health records (EHRs) are a rich source of data for researchers, but extracting meaningful information out of this highly complex data source is challenging. Phecodes represent one strategy for defining phenotypes for research using EHR data. They are a high-throughput phenotyping tool based on ICD (International Classification of Diseases) codes that can be used to rapidly define the case/control status of thousands of clinically meaningful diseases and conditions. Phecodes were originally developed to conduct phenome-wide association studies to scan for phenotypic associations with common genetic variants. Since then, phecodes have been used to support a wide range of EHR-based phenotyping methods, including the phenotype risk score. This review aims to comprehensively describe the development, validation, and applications of phecodes and suggest some future directions for phecodes and high-throughput phenotyping.
Collapse
Affiliation(s)
- Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee 37232, USA;
| |
Collapse
|
17
|
Ding X, Mower J, Subramanian D, Cohen T. Augmenting aer2vec: Enriching distributed representations of adverse event report data with orthographic and lexical information. J Biomed Inform 2021; 119:103833. [PMID: 34111555 DOI: 10.1016/j.jbi.2021.103833] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Revised: 05/10/2021] [Accepted: 06/02/2021] [Indexed: 11/29/2022]
Abstract
Adverse Drug Events (ADEs) are prevalent, costly, and sometimes preventable. Post-marketing drug surveillance aims to monitor ADEs that occur after a drug is released to market. Reports of such ADEs are aggregated by reporting systems, such as the Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS). In this paper, we consider the topic of how best to represent data derived from reports in FAERS for the purpose of detecting post-marketing surveillance signals, in order to inform regulatory decision making. In our previous work, we developed aer2vec, a method for deriving distributed representations (concept embeddings) of drugs and side effects from ADE reports, establishing the utility of distributional information for pharmacovigilance signal detection. In this paper, we advance this line of research further by evaluating the utility of encoding orthographic and lexical information. We do so by adapting two Natural Language Processing methods, subword embedding and vector retrofitting, which were developed to encode such information into word embeddings. Models were compared for their ability to distinguish between positive and negative examples in a set of manually curated drug/ADE relationships, with both aer2vec enhancements offering advantages in performances over baseline models, and best performance obtained when retrofitting and subword embeddings were applied in concert. In addition, this work demonstrates that models leveraging distributed representations do not require extensive manual preprocessing to perform well on this pharmacovigilance signal detection task, and may even benefit from information that would otherwise be lost during the normalization and standardization process.
Collapse
Affiliation(s)
- Xiruo Ding
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA.
| | - Justin Mower
- Department of Computer Science, Rice University, Houston, TX, USA.
| | | | - Trevor Cohen
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA.
| |
Collapse
|
18
|
Humphreys BL, Del Fiol G, Xu H. The UMLS knowledge sources at 30: indispensable to current research and applications in biomedical informatics. J Am Med Inform Assoc 2020; 27:1499-1501. [PMID: 33059366 DOI: 10.1093/jamia/ocaa208] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Indexed: 01/22/2023] Open
Affiliation(s)
| | - Guilherme Del Fiol
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|