1
|
Sivarajkumar S, Tam TYC, Mohammad HA, Viggiano S, Oniani D, Visweswaran S, Wang Y. Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing. J Am Med Inform Assoc 2024:ocae177. [PMID: 39001795 DOI: 10.1093/jamia/ocae177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 06/19/2024] [Accepted: 07/01/2024] [Indexed: 07/15/2024] Open
Abstract
OBJECTIVES Alzheimer's disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is that the traditional way to acquire sleep information is time-consuming, inefficient, non-scalable, and limited to patients' subjective experience. We aim to automate the extraction of specific sleep-related patterns, such as snoring, napping, poor sleep quality, daytime sleepiness, night wakings, other sleep problems, and sleep duration, from clinical notes of AD patients. These sleep patterns are hypothesized to play a role in the incidence of AD, providing insight into the relationship between sleep and AD onset and progression. MATERIALS AND METHODS A gold standard dataset is created from manual annotation of 570 randomly sampled clinical note documents from the adSLEEP, a corpus of 192 000 de-identified clinical notes of 7266 AD patients retrieved from the University of Pittsburgh Medical Center (UPMC). We developed a rule-based natural language processing (NLP) algorithm, machine learning models, and large language model (LLM)-based NLP algorithms to automate the extraction of sleep-related concepts, including snoring, napping, sleep problem, bad sleep quality, daytime sleepiness, night wakings, and sleep duration, from the gold standard dataset. RESULTS The annotated dataset of 482 patients comprised a predominantly White (89.2%), older adult population with an average age of 84.7 years, where females represented 64.1%, and a vast majority were non-Hispanic or Latino (94.6%). Rule-based NLP algorithm achieved the best performance of F1 across all sleep-related concepts. In terms of positive predictive value (PPV), the rule-based NLP algorithm achieved the highest PPV scores for daytime sleepiness (1.00) and sleep duration (1.00), while the machine learning models had the highest PPV for napping (0.95) and bad sleep quality (0.86), and LLAMA2 with finetuning had the highest PPV for night wakings (0.93) and sleep problem (0.89). DISCUSSION Although sleep information is infrequently documented in the clinical notes, the proposed rule-based NLP algorithm and LLM-based NLP algorithms still achieved promising results. In comparison, the machine learning-based approaches did not achieve good results, which is due to the small size of sleep information in the training data. CONCLUSION The results show that the rule-based NLP algorithm consistently achieved the best performance for all sleep concepts. This study focused on the clinical notes of patients with AD but could be extended to general sleep information extraction for other diseases.
Collapse
Affiliation(s)
- Sonish Sivarajkumar
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, United States
| | - Thomas Yu Chow Tam
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA 15260, United States
| | - Haneef Ahamed Mohammad
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA 15260, United States
| | - Samuel Viggiano
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA 15260, United States
| | - David Oniani
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA 15260, United States
| | - Shyam Visweswaran
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, United States
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, United States
| | - Yanshan Wang
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, United States
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA 15260, United States
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, United States
- Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA 15260, United States
| |
Collapse
|
2
|
Bazoge A, Morin E, Daille B, Gourraud PA. Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review. JMIR Med Inform 2023; 11:e42477. [PMID: 38100200 PMCID: PMC10757232 DOI: 10.2196/42477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 01/16/2023] [Accepted: 09/07/2023] [Indexed: 12/17/2023] Open
Abstract
BACKGROUND In recent years, health data collected during the clinical care process have been often repurposed for secondary use through clinical data warehouses (CDWs), which interconnect disparate data from different sources. A large amount of information of high clinical value is stored in unstructured text format. Natural language processing (NLP), which implements algorithms that can operate on massive unstructured textual data, has the potential to structure the data and make clinical information more accessible. OBJECTIVE The aim of this review was to provide an overview of studies applying NLP to textual data from CDWs. It focuses on identifying the (1) NLP tasks applied to data from CDWs and (2) NLP methods used to tackle these tasks. METHODS This review was performed according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. We searched for relevant articles in 3 bibliographic databases: PubMed, Google Scholar, and ACL Anthology. We reviewed the titles and abstracts and included articles according to the following inclusion criteria: (1) focus on NLP applied to textual data from CDWs, (2) articles published between 1995 and 2021, and (3) written in English. RESULTS We identified 1353 articles, of which 194 (14.34%) met the inclusion criteria. Among all identified NLP tasks in the included papers, information extraction from clinical text (112/194, 57.7%) and the identification of patients (51/194, 26.3%) were the most frequent tasks. To address the various tasks, symbolic methods were the most common NLP methods (124/232, 53.4%), showing that some tasks can be partially achieved with classical NLP techniques, such as regular expressions or pattern matching that exploit specialized lexica, such as drug lists and terminologies. Machine learning (70/232, 30.2%) and deep learning (38/232, 16.4%) have been increasingly used in recent years, including the most recent approaches based on transformers. NLP methods were mostly applied to English language data (153/194, 78.9%). CONCLUSIONS CDWs are central to the secondary use of clinical texts for research purposes. Although the use of NLP on data from CDWs is growing, there remain challenges in this field, especially with regard to languages other than English. Clinical NLP is an effective strategy for accessing, extracting, and transforming data from CDWs. Information retrieved with NLP can assist in clinical research and have an impact on clinical practice.
Collapse
Affiliation(s)
- Adrien Bazoge
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
- Nantes Université, CHU de Nantes, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, INSERM, CIC 1413, F-44000 Nantes, France
| | - Emmanuel Morin
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
| | - Béatrice Daille
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
| | - Pierre-Antoine Gourraud
- Nantes Université, CHU de Nantes, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, INSERM, CIC 1413, F-44000 Nantes, France
- Nantes Université, INSERM, CHU de Nantes, École Centrale Nantes, Centre de Recherche Translationnelle en Transplantation et Immunologie, CR2TI, F-44000 Nantes, France
| |
Collapse
|
3
|
Boxley C, Fujimoto M, Ratwani RM, Fong A. A text mining approach to categorize patient safety event reports by medication error type. Sci Rep 2023; 13:18354. [PMID: 37884577 PMCID: PMC10603175 DOI: 10.1038/s41598-023-45152-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 10/17/2023] [Indexed: 10/28/2023] Open
Abstract
Patient safety reporting systems give healthcare provider staff the ability to report medication related safety events and errors; however, many of these reports go unanalyzed and safety hazards go undetected. The objective of this study is to examine whether natural language processing can be used to better categorize medication related patient safety event reports. 3,861 medication related patient safety event reports that were previously annotated using a consolidated medication error taxonomy were used to develop three models using the following algorithms: (1) logistic regression, (2) elastic net, and (3) XGBoost. After development, models were tested, and model performance was analyzed. We found the XGBoost model performed best across all medication error categories. 'Wrong Drug', 'Wrong Dosage Form or Technique or Route', and 'Improper Dose/Dose Omission' categories performed best across the three models. In addition, we identified five words most closely associated with each medication error category and which medication error categories were most likely to co-occur. Machine learning techniques offer a semi-automated method for identifying specific medication error types from the free text of patient safety event reports. These algorithms have the potential to improve the categorization of medication related patient safety event reports which may lead to better identification of important medication safety patterns and trends.
Collapse
Affiliation(s)
- Christian Boxley
- MedStar Health National Center for Human Factors in Healthcare, 3007 Tilden St., NW Suite 6N, Washington, DC, 20008, USA.
| | | | - Raj M Ratwani
- MedStar Health National Center for Human Factors in Healthcare, 3007 Tilden St., NW Suite 6N, Washington, DC, 20008, USA
- Georgetown University School of Medicine, Washington, USA
| | - Allan Fong
- MedStar Health National Center for Human Factors in Healthcare, 3007 Tilden St., NW Suite 6N, Washington, DC, 20008, USA
| |
Collapse
|
4
|
Davis SE, Zabotka L, Desai RJ, Wang SV, Maro JC, Coughlin K, Hernández-Muñoz JJ, Stojanovic D, Shah NH, Smith JC. Use of Electronic Health Record Data for Drug Safety Signal Identification: A Scoping Review. Drug Saf 2023; 46:725-742. [PMID: 37340238 DOI: 10.1007/s40264-023-01325-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/31/2023] [Indexed: 06/22/2023]
Abstract
INTRODUCTION Pharmacovigilance programs protect patient health and safety by identifying adverse event signals through postmarketing surveillance of claims data and spontaneous reports. Electronic health records (EHRs) provide new opportunities to address limitations of traditional approaches and promote discovery-oriented pharmacovigilance. METHODS To evaluate the current state of EHR-based medication safety signal identification, we conducted a scoping literature review of studies aimed at identifying safety signals from routinely collected patient-level EHR data. We extracted information on study design, EHR data elements utilized, analytic methods employed, drugs and outcomes evaluated, and key statistical and data analysis choices. RESULTS We identified 81 eligible studies. Disproportionality methods were the predominant analytic approach, followed by data mining and regression. Variability in study design makes direct comparisons difficult. Studies varied widely in terms of data, confounding adjustment, and statistical considerations. CONCLUSION Despite broad interest in utilizing EHRs for safety signal identification, current efforts fail to leverage the full breadth and depth of available data or to rigorously control for confounding. The development of best practices and application of common data models would promote the expansion of EHR-based pharmacovigilance.
Collapse
Affiliation(s)
- Sharon E Davis
- Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA
- Vanderbilt University School of Medicine, Nashville, TN, USA
| | | | - Rishi J Desai
- Brigham and Women's Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Shirley V Wang
- Brigham and Women's Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Judith C Maro
- Harvard Medical School, Boston, MA, USA
- Harvard Pilgrim Health Care Institute, Boston, MA, USA
| | | | | | | | - Nigam H Shah
- School of Medicine, Stanford University, Stanford, CA, USA
- Stanford Health Care, Palo Alto, CA, USA
| | - Joshua C Smith
- Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA.
- Vanderbilt University School of Medicine, Nashville, TN, USA.
| |
Collapse
|
5
|
Lee K, Liu Z, Chandran U, Kalsekar I, Laxmanan B, Higashi MK, Jun T, Ma M, Li M, Mai Y, Gilman C, Wang T, Ai L, Aggarwal P, Pan Q, Oh W, Stolovitzky G, Schadt E, Wang X. Detecting Ground Glass Opacity Features in Patients With Lung Cancer: Automated Extraction and Longitudinal Analysis via Deep Learning-Based Natural Language Processing. JMIR AI 2023; 2:e44537. [PMID: 38875565 PMCID: PMC11041451 DOI: 10.2196/44537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 01/30/2023] [Accepted: 03/31/2023] [Indexed: 06/16/2024]
Abstract
BACKGROUND Ground-glass opacities (GGOs) appearing in computed tomography (CT) scans may indicate potential lung malignancy. Proper management of GGOs based on their features can prevent the development of lung cancer. Electronic health records are rich sources of information on GGO nodules and their granular features, but most of the valuable information is embedded in unstructured clinical notes. OBJECTIVE We aimed to develop, test, and validate a deep learning-based natural language processing (NLP) tool that automatically extracts GGO features to inform the longitudinal trajectory of GGO status from large-scale radiology notes. METHODS We developed a bidirectional long short-term memory with a conditional random field-based deep-learning NLP pipeline to extract GGO and granular features of GGO retrospectively from radiology notes of 13,216 lung cancer patients. We evaluated the pipeline with quality assessments and analyzed cohort characterization of the distribution of nodule features longitudinally to assess changes in size and solidity over time. RESULTS Our NLP pipeline built on the GGO ontology we developed achieved between 95% and 100% precision, 89% and 100% recall, and 92% and 100% F1-scores on different GGO features. We deployed this GGO NLP model to extract and structure comprehensive characteristics of GGOs from 29,496 radiology notes of 4521 lung cancer patients. Longitudinal analysis revealed that size increased in 16.8% (240/1424) of patients, decreased in 14.6% (208/1424), and remained unchanged in 68.5% (976/1424) in their last note compared to the first note. Among 1127 patients who had longitudinal radiology notes of GGO status, 815 (72.3%) were reported to have stable status, and 259 (23%) had increased/progressed status in the subsequent notes. CONCLUSIONS Our deep learning-based NLP pipeline can automatically extract granular GGO features at scale from electronic health records when this information is documented in radiology notes and help inform the natural history of GGO. This will open the way for a new paradigm in lung cancer prevention and early detection.
Collapse
Affiliation(s)
| | | | - Urmila Chandran
- Lung Cancer Initiative, Johnson & Johnson, New Brunswick, NJ, United States
| | - Iftekhar Kalsekar
- Lung Cancer Initiative, Johnson & Johnson, New Brunswick, NJ, United States
| | - Balaji Laxmanan
- Lung Cancer Initiative, Johnson & Johnson, New Brunswick, NJ, United States
| | | | - Tomi Jun
- Sema4, Stamford, CT, United States
| | - Meng Ma
- Sema4, Stamford, CT, United States
| | | | - Yun Mai
- Sema4, Stamford, CT, United States
| | | | | | - Lei Ai
- Sema4, Stamford, CT, United States
| | | | - Qi Pan
- Sema4, Stamford, CT, United States
| | - William Oh
- Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | | | - Eric Schadt
- Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | | |
Collapse
|
6
|
Keloth VK, Zhou S, Lindemann L, Zheng L, Elhanan G, Einstein AJ, Geller J, Perl Y. Mining of EHR for interface terminology concepts for annotating EHRs of COVID patients. BMC Med Inform Decis Mak 2023; 23:40. [PMID: 36829139 PMCID: PMC9951157 DOI: 10.1186/s12911-023-02136-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 02/09/2023] [Indexed: 02/26/2023] Open
Abstract
BACKGROUND Two years into the COVID-19 pandemic and with more than five million deaths worldwide, the healthcare establishment continues to struggle with every new wave of the pandemic resulting from a new coronavirus variant. Research has demonstrated that there are variations in the symptoms, and even in the order of symptom presentations, in COVID-19 patients infected by different SARS-CoV-2 variants (e.g., Alpha and Omicron). Textual data in the form of admission notes and physician notes in the Electronic Health Records (EHRs) is rich in information regarding the symptoms and their orders of presentation. Unstructured EHR data is often underutilized in research due to the lack of annotations that enable automatic extraction of useful information from the available extensive volumes of textual data. METHODS We present the design of a COVID Interface Terminology (CIT), not just a generic COVID-19 terminology, but one serving a specific purpose of enabling automatic annotation of EHRs of COVID-19 patients. CIT was constructed by integrating existing COVID-related ontologies and mining additional fine granularity concepts from clinical notes. The iterative mining approach utilized the techniques of 'anchoring' and 'concatenation' to identify potential fine granularity concepts to be added to the CIT. We also tested the generalizability of our approach on a hold-out dataset and compared the annotation coverage to the coverage obtained for the dataset used to build the CIT. RESULTS Our experiments demonstrate that this approach results in higher annotation coverage compared to existing ontologies such as SNOMED CT and Coronavirus Infectious Disease Ontology (CIDO). The final version of CIT achieved about 20% more coverage than SNOMED CT and 50% more coverage than CIDO. In the future, the concepts mined and added into CIT could be used as training data for machine learning models for mining even more concepts into CIT and further increasing the annotation coverage. CONCLUSION In this paper, we demonstrated the construction of a COVID interface terminology that can be utilized for automatically annotating EHRs of COVID-19 patients. The techniques presented can identify frequently documented fine granularity concepts that are missing in other ontologies thereby increasing the annotation coverage.
Collapse
Affiliation(s)
- Vipina K Keloth
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
| | - Shuxin Zhou
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Luke Lindemann
- School of Medicine and Health Sciences, The George Washington University, Washington (D.C.), USA
| | - Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, USA
| | - Gai Elhanan
- Renown Institute for Health Innovation, Desert Research Institute, Reno, NV, USA
| | - Andrew J Einstein
- Cardiology Division, Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA
- Department of Radiology, Columbia University Irving Medical Center, New York, NY, USA
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| |
Collapse
|
7
|
Whitaker B, Pizarro J, Deady M, Williams A, Ezzeldin H, Belov A, Kanderian S, Billings D, Cook K, Hettinger AZ, Anderson S. Detection of allergic transfusion-related adverse events from electronic medical records. Transfusion 2022; 62:2029-2038. [PMID: 36004803 DOI: 10.1111/trf.17069] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 07/18/2022] [Accepted: 07/19/2022] [Indexed: 11/29/2022]
Abstract
BACKGROUND Transfusion-related adverse events can be unrecognized and unreported. As part of the US Food and Drug Administration's Center for Biologics Evaluation and Research Biologics Effectiveness and Safety initiative, we explored whether machine learning methods, such as natural language processing (NLP), can identify and report transfusion allergic reactions (ARs) from electronic health records (EHRs). STUDY DESIGN AND METHODS In a 4-year period, all 146 reported transfusion ARs were pulled from a database of 86,764 transfusions in an academic health system, along with a random sample of 605 transfusions without reported ARs. Structured and unstructured EHR data were retrieved, including demographics, new symptoms, medications, and lab results. In unstructured data, evidence from clinicians' notes, test results, and prescriptions fields identified transfusion ARs, which were used to extract NLP features. Clinician reviews of selected validation cases assessed and confirmed model performance. RESULTS Clinician reviews of selected validation cases yielded a sensitivity of 67.9% and a specificity of 97.5% at a threshold of 0.9, with a positive predictive value (PPV) of 84%, estimated to 4.5% when extrapolated to match transfusion AR incidence in the full transfusion dataset. A higher threshold achieved sensitivity of 43% with specificity/PPV of 100% in our validation set. Essential features predicting ARs were recognized transfusion reactions, administration of antihistamines or glucocorticoids, and skin symptoms (e.g., hives and itching). Removal of NLP features decreased model performance. DISCUSSION NLP algorithms can identify transfusion reactions from the EHR with a reasonable level of precision for subsequent clinician review and confirmation.
Collapse
Affiliation(s)
- Barbee Whitaker
- Office of Biostatistics and Pharmacovigilance, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Jeno Pizarro
- International Business Machines (IBM) Corporation, Bethesda, Maryland, USA
| | - Matthew Deady
- International Business Machines (IBM) Corporation, Bethesda, Maryland, USA
| | - Alan Williams
- Office of Biostatistics and Pharmacovigilance, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Hussein Ezzeldin
- Office of Biostatistics and Pharmacovigilance, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Artur Belov
- Office of Biostatistics and Pharmacovigilance, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Sami Kanderian
- International Business Machines (IBM) Corporation, Bethesda, Maryland, USA
| | - Douglas Billings
- International Business Machines (IBM) Corporation, Bethesda, Maryland, USA
| | - Kerry Cook
- International Business Machines (IBM) Corporation, Bethesda, Maryland, USA
| | - Aaron Z Hettinger
- Center for Biostatistics, Informatics and Data Science, MedStar Health Research Institute, Hyattsville, Maryland, USA
| | - Steven Anderson
- Office of Biostatistics and Pharmacovigilance, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| |
Collapse
|
8
|
Explainable detection of adverse drug reaction with imbalanced data distribution. PLoS Comput Biol 2022; 18:e1010144. [PMID: 35704662 PMCID: PMC9239481 DOI: 10.1371/journal.pcbi.1010144] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2021] [Revised: 06/28/2022] [Accepted: 04/26/2022] [Indexed: 11/18/2022] Open
Abstract
Analysis of health-related texts can be used to detect adverse drug reactions (ADR). The greatest challenge for ADR detection lies in imbalanced data distributions where words related to ADR symptoms are often minority classes. As a result, trained models tend to converge to a point that strongly biases towards the majority class and then ignores the minority class. Since the most used cross-entropy criteria is an approximation to accuracy, the model focuses more readily on the majority class to achieve high accuracy. To address this issue, existing methods apply either oversampling or down-sampling strategies to balance the data distribution and exploit the most difficult samples of the minority class. However, increasing or reducing the number of individual tokens alone in sequence labeling tasks will result in the loss of the syntactic relations of the sentence. This paper proposes a weighted variant of conditional random field (CRF) for data-imbalanced sequence labeling tasks. Such a weighting strategy can alleviate data distribution imbalances between majority and minority classes. Instead of using softmax in the output layer, the CRF can capture the relationship of labels between tokens. The locally interpretable model-agnostic explanations (LIME) algorithm was applied to investigate performance differences between models with and without the weighted loss function. Experimental results on two different ADR tasks show that the proposed model outperforms previously proposed sequence labeling methods.
Collapse
|
9
|
Lindvall C, Deng CY, Agaronnik ND, Kwok A, Samineni S, Umeton R, Mackie-Jenkins W, Kehl KL, Tulsky JA, Enzinger AC. Deep Learning for Cancer Symptoms Monitoring on the Basis of Electronic Health Record Unstructured Clinical Notes. JCO Clin Cancer Inform 2022; 6:e2100136. [PMID: 35714301 PMCID: PMC9232368 DOI: 10.1200/cci.21.00136] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Symptoms are vital outcomes for cancer clinical trials, observational research, and population-level surveillance. Patient-reported outcomes (PROs) are valuable for monitoring symptoms, yet there are many challenges to collecting PROs at scale. We sought to develop, test, and externally validate a deep learning model to extract symptoms from unstructured clinical notes in the electronic health record. METHODS We randomly selected 1,225 outpatient progress notes from among patients treated at the Dana-Farber Cancer Institute between January 2016 and December 2019 and used 1,125 notes as our training/validation data set and 100 notes as our test data set. We evaluated the performance of 10 deep learning models for detecting 80 symptoms included in the National Cancer Institute's Patient-Reported Outcomes version of the Common Terminology Criteria for Adverse Events (PRO-CTCAE) framework. Model performance as compared with manual chart abstraction was assessed using standard metrics, and the highest performer was externally validated on a sample of 100 physician notes from a different clinical context. RESULTS In our training and test data sets, 75 of the 80 candidate symptoms were identified. The ELECTRA-small model had the highest performance for symptom identification at the token level (ie, at the individual symptom level), with an F1 of 0.87 and a processing time of 3.95 seconds per note. For the 10 most common symptoms in the test data set, the F1 score ranged from 0.98 for anxious to 0.86 for fatigue. For external validation of the same symptoms, the note-level performance ranged from F1 = 0.97 for diarrhea and dizziness to F1 = 0.73 for swelling. CONCLUSION Training a deep learning model to identify a wide range of electronic health record-documented symptoms relevant to cancer care is feasible. This approach could be used at the health system scale to complement to electronic PROs.
Collapse
Affiliation(s)
- Charlotta Lindvall
- Dana-Farber Cancer Institute, Boston, MA.,Harvard Medical School, Boston, MA.,Brigham and Women's Hospital, Boston, MA
| | | | - Nicole D Agaronnik
- Dana-Farber Cancer Institute, Boston, MA.,Harvard Medical School, Boston, MA
| | - Anne Kwok
- Dana-Farber Cancer Institute, Boston, MA
| | | | | | | | - Kenneth L Kehl
- Dana-Farber Cancer Institute, Boston, MA.,Harvard Medical School, Boston, MA.,Brigham and Women's Hospital, Boston, MA
| | - James A Tulsky
- Dana-Farber Cancer Institute, Boston, MA.,Harvard Medical School, Boston, MA.,Brigham and Women's Hospital, Boston, MA
| | - Andrea C Enzinger
- Dana-Farber Cancer Institute, Boston, MA.,Harvard Medical School, Boston, MA.,Brigham and Women's Hospital, Boston, MA
| |
Collapse
|
10
|
Using Machine Learning for Pharmacovigilance: A Systematic Review. Pharmaceutics 2022; 14:pharmaceutics14020266. [PMID: 35213998 PMCID: PMC8924891 DOI: 10.3390/pharmaceutics14020266] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Revised: 01/13/2022] [Accepted: 01/21/2022] [Indexed: 02/04/2023] Open
Abstract
Pharmacovigilance is a science that involves the ongoing monitoring of adverse drug reactions to existing medicines. Traditional approaches in this field can be expensive and time-consuming. The application of natural language processing (NLP) to analyze user-generated content is hypothesized as an effective supplemental source of evidence. In this systematic review, a broad and multi-disciplinary literature search was conducted involving four databases. A total of 5318 publications were initially found. Studies were considered relevant if they reported on the application of NLP to understand user-generated text for pharmacovigilance. A total of 16 relevant publications were included in this systematic review. All studies were evaluated to have medium reliability and validity. For all types of drugs, 14 publications reported positive findings with respect to the identification of adverse drug reactions, providing consistent evidence that natural language processing can be used effectively and accurately on user-generated textual content that was published to the Internet to identify adverse drug reactions for the purpose of pharmacovigilance. The evidence presented in this review suggest that the analysis of textual data has the potential to complement the traditional system of pharmacovigilance.
Collapse
|
11
|
Deady M, Ezzeldin H, Cook K, Billings D, Pizarro J, Plotogea AA, Saunders-Hastings P, Belov A, Whitaker BI, Anderson SA. The Food and Drug Administration Biologics Effectiveness and Safety Initiative Facilitates Detection of Vaccine Administrations From Unstructured Data in Medical Records Through Natural Language Processing. Front Digit Health 2022; 3:777905. [PMID: 35005697 PMCID: PMC8727347 DOI: 10.3389/fdgth.2021.777905] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 12/03/2021] [Indexed: 12/03/2022] Open
Abstract
Introduction: The Food and Drug Administration Center for Biologics Evaluation and Research conducts post-market surveillance of biologic products to ensure their safety and effectiveness. Studies have found that common vaccine exposures may be missing from structured data elements of electronic health records (EHRs), instead being captured in clinical notes. This impacts monitoring of adverse events following immunizations (AEFIs). For example, COVID-19 vaccines have been regularly administered outside of traditional medical settings. We developed a natural language processing (NLP) algorithm to mine unstructured clinical notes for vaccinations not captured in structured EHR data. Methods: A random sample of 1,000 influenza vaccine administrations, representing 995 unique patients, was extracted from a large U.S. EHR database. NLP techniques were used to detect administrations from the clinical notes in the training dataset [80% (N = 797) of patients]. The algorithm was applied to the validation dataset [20% (N = 198) of patients] to assess performance. Full medical charts for 28 randomly selected administration events in the validation dataset were reviewed by clinicians. The NLP algorithm was then applied across the entire dataset (N = 995) to quantify the number of additional events identified. Results: A total of 3,199 administrations were identified in the structured data and clinical notes combined. Of these, 2,740 (85.7%) were identified in the structured data, while the NLP algorithm identified 1,183 (37.0%) administrations in clinical notes; 459 were not also captured in the structured data. This represents a 16.8% increase in the identification of vaccine administrations compared to using structured data alone. The validation of 28 vaccine administrations confirmed 27 (96.4%) as “definite” vaccine administrations; 18 (64.3%) had evidence of a vaccination event in the structured data, while 10 (35.7%) were found solely in the unstructured notes. Discussion: We demonstrated the utility of an NLP algorithm to identify vaccine administrations not captured in structured EHR data. NLP techniques have the potential to improve detection of vaccine administrations not otherwise reported without increasing the analysis burden on physicians or practitioners. Future applications could include refining estimates of vaccine coverage and detecting other exposures, population characteristics, and outcomes not reliably captured in structured EHR data.
Collapse
Affiliation(s)
| | - Hussein Ezzeldin
- US Food and Drug Administration, Silver Spring, MD, United States
| | | | | | | | | | | | - Artur Belov
- US Food and Drug Administration, Silver Spring, MD, United States
| | | | | |
Collapse
|
12
|
Piscitelli A, Bevilacqua L, Labella B, Parravicini E, Auxilia F. A Keyword Approach to Identify Adverse Events Within Narrative Documents From 4 Italian Institutions. J Patient Saf 2022; 18:e362-e367. [PMID: 32910039 DOI: 10.1097/pts.0000000000000783] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
OBJECTIVES Existing methods for measuring adverse events in hospitals intercept a restricted number of events. Text mining refers to a range of techniques to extract data from narrative sources. The goal of this study was to evaluate the performance of an automated approach for extracting adverse event keywords from within electronic health records. METHODS The study involved 4 medical centers in the Region of Lombardy. A starting set of keywords was trained in an iterative process to develop queries for 7 adverse events, including those used by the Agency for Healthcare Research and Quality as patient safety indicators. We calculated positive predictive values of the 7 queries and performed an error analysis to detect reasons for false-positive cases of pulmonary embolism, deep vein thrombosis, and urinary tract infection. RESULTS Overall, 397,233 records were collected (34,805 discharge summaries, 292,593 emergency department notes, and 69,835 operation reports). Positive predictive values were higher for postoperative wound dehiscence (83.83%) and urinary tract infection (73.07%), whereas they were lower for deep vein thrombosis (5.37%), pulmonary embolism (13.63%), and postoperative sepsis (12.28%). The most common reasons for false positives were reporting of past events (42.25%), negations (22.80%), and conditions suspected by physicians but not confirmed by a diagnostic test (11.25%). CONCLUSIONS The results of our study demonstrated the feasibility of using an automated approach to detect multiple adverse events in several data sources. More sophisticated techniques, such as natural language processing, should be tested to evaluate the feasibility of using text mining as a routine method for monitoring adverse events in hospitals.
Collapse
Affiliation(s)
- Antonio Piscitelli
- From the Post-graduate School of Hygiene and Preventive Medicine, University of Milan, Milan
| | | | | | | | | |
Collapse
|
13
|
Edrees H, Song W, Syrowatka A, Simona A, Amato MG, Bates DW. Intelligent Telehealth in Pharmacovigilance: A Future Perspective. Drug Saf 2022; 45:449-458. [PMID: 35579810 PMCID: PMC9112241 DOI: 10.1007/s40264-022-01172-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/02/2022] [Indexed: 01/28/2023]
Abstract
Pharmacovigilance improves patient safety by detecting and preventing adverse drug events. However, challenges exist that limit adverse drug event detection, resulting in many adverse drug events being underreported or inaccurately reported. One challenge includes having access to large data sets from various sources including electronic health records and wearable medical devices. Artificial intelligence, including machine learning methods, such as natural language processing and deep learning, can detect and extract information about adverse drug events, thus automating the pharmacovigilance process and improving the surveillance of known and documented adverse drug events. In addition, with the increased demand for telehealth services, for managing both acute and chronic diseases, artificial intelligence methods can play a role in detecting and preventing adverse drug events. In this review, we discuss two use cases of how artificial intelligence methods may be useful to improve the quality of pharmacovigilance and the role of artificial intelligence in telehealth practices.
Collapse
Affiliation(s)
- Heba Edrees
- Division of General Internal Medicine, Brigham and Women’s Hospital, Boston, MA USA ,Department of Pharmacy Practice, MCPHS University, Boston, MA USA ,Harvard Medical School, 1620 Tremont St., 3rd Floor, Boston, MA 02120 USA
| | - Wenyu Song
- Division of General Internal Medicine, Brigham and Women’s Hospital, Boston, MA USA ,Harvard Medical School, 1620 Tremont St., 3rd Floor, Boston, MA 02120 USA
| | - Ania Syrowatka
- Division of General Internal Medicine, Brigham and Women’s Hospital, Boston, MA USA ,Harvard Medical School, 1620 Tremont St., 3rd Floor, Boston, MA 02120 USA
| | - Aurélien Simona
- Division of General Internal Medicine, Brigham and Women’s Hospital, Boston, MA USA ,Harvard Medical School, 1620 Tremont St., 3rd Floor, Boston, MA 02120 USA
| | - Mary G. Amato
- Division of General Internal Medicine, Brigham and Women’s Hospital, Boston, MA USA
| | - David W. Bates
- Division of General Internal Medicine, Brigham and Women’s Hospital, Boston, MA USA ,Harvard Medical School, 1620 Tremont St., 3rd Floor, Boston, MA 02120 USA ,Department of Health Policy and Management, Harvard School of Public Health, Boston, MA USA
| |
Collapse
|
14
|
Chopard D, Treder MS, Corcoran P, Ahmed N, Johnson C, Busse M, Spasic I. Text Mining of Adverse Events in Clinical Trials: Deep Learning Approach. JMIR Med Inform 2021; 9:e28632. [PMID: 34951601 PMCID: PMC8742206 DOI: 10.2196/28632] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 08/01/2021] [Accepted: 11/14/2021] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Pharmacovigilance and safety reporting, which involve processes for monitoring the use of medicines in clinical trials, play a critical role in the identification of previously unrecognized adverse events or changes in the patterns of adverse events. OBJECTIVE This study aims to demonstrate the feasibility of automating the coding of adverse events described in the narrative section of the serious adverse event report forms to enable statistical analysis of the aforementioned patterns. METHODS We used the Unified Medical Language System (UMLS) as the coding scheme, which integrates 217 source vocabularies, thus enabling coding against other relevant terminologies such as the International Classification of Diseases-10th Revision, Medical Dictionary for Regulatory Activities, and Systematized Nomenclature of Medicine). We used MetaMap, a highly configurable dictionary lookup software, to identify the mentions of the UMLS concepts. We trained a binary classifier using Bidirectional Encoder Representations from Transformers (BERT), a transformer-based language model that captures contextual relationships, to differentiate between mentions of the UMLS concepts that represented adverse events and those that did not. RESULTS The model achieved a high F1 score of 0.8080, despite the class imbalance. This is 10.15 percent points lower than human-like performance but also 17.45 percent points higher than that of the baseline approach. CONCLUSIONS These results confirmed that automated coding of adverse events described in the narrative section of serious adverse event reports is feasible. Once coded, adverse events can be statistically analyzed so that any correlations with the trialed medicines can be estimated in a timely fashion.
Collapse
Affiliation(s)
- Daphne Chopard
- School of Computer Science & Informatics, Cardiff University, Cardiff, United Kingdom
| | - Matthias S Treder
- School of Computer Science & Informatics, Cardiff University, Cardiff, United Kingdom
| | - Padraig Corcoran
- School of Computer Science & Informatics, Cardiff University, Cardiff, United Kingdom
| | - Nagheen Ahmed
- Centre for Trials Research, Cardiff University, Cardiff, United Kingdom
| | - Claire Johnson
- Centre for Trials Research, Cardiff University, Cardiff, United Kingdom
| | - Monica Busse
- Centre for Trials Research, Cardiff University, Cardiff, United Kingdom
| | - Irena Spasic
- School of Computer Science & Informatics, Cardiff University, Cardiff, United Kingdom
| |
Collapse
|
15
|
Mao J, Sedrakyan A, Sun T, Guiahi M, Chudnoff S, Kinard M, Johnson SB. Assessing adverse event reports of hysteroscopic sterilization device removal using natural language processing. Pharmacoepidemiol Drug Saf 2021; 31:442-451. [PMID: 34919294 DOI: 10.1002/pds.5402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 12/09/2021] [Accepted: 12/13/2021] [Indexed: 11/07/2022]
Abstract
OBJECTIVE To develop an annotation model to apply natural language processing (NLP) to device adverse event reports and implement the model to evaluate the most frequently experienced events among women reporting a sterilization device removal. METHODS We included adverse event reports from the Manufacturer and User Facility Device Experience database from January 2005 to June 2018 related to device removal following hysteroscopic sterilization. We used an iterative process to develop an annotation model that extracts six categories of desired information and applied the annotation model to train an NLP algorithm. We assessed the model performance using positive predictive value (PPV, also known as precision), sensitivity (also known as recall), and F1 score (a combined measure of PPV and sensitivity). Using extracted variables, we summarized the reporting source, the presence of prespecified and other patient and device events, additional sterilizations and other procedures performed, and time from implantation to removal. RESULTS The overall F1 score was 91.5% for labeled items and 93.9% for distinct events after excluding duplicates. A total of 16 535 reports of device removal were analyzed. The most frequently reported patient and device events were abdominal/pelvic/genital pain (N = 13 166, 79.6%) and device dislocation/migration (N = 3180, 19.2%), respectively. Of those reporting an additional sterilization procedure, the majority had a hysterectomy or salpingectomy (N = 7932). One-fifth of the cases that had device removal timing specified reported a removal after 7 years following implantation (N = 2444/11 293). CONCLUSIONS We present a roadmap to develop an annotation model for NLP to analyze device adverse event reports. The extracted information is informative and complements findings from previous research using administrative data.
Collapse
Affiliation(s)
- Jialin Mao
- Department of Population Health Sciences, Weill Cornell Medicine, New York, New York, USA
| | - Art Sedrakyan
- Department of Population Health Sciences, Weill Cornell Medicine, New York, New York, USA
| | - Tianyi Sun
- Department of Population Health Sciences, Weill Cornell Medicine, New York, New York, USA
| | - Maryam Guiahi
- Department of Obstetrics and Gynecology, Division of Family Planning, University of Colorado Anschutz Medical Campus, Aurora, Colorado, USA
| | - Scott Chudnoff
- Department of Obstetrics and Gynecology, Stamford Hospital, Stamford, Connecticut, USA
| | | | - Stephen B Johnson
- Department of Population Health, New York University Langone Health, New York, New York, USA
| |
Collapse
|
16
|
Natural Language Processing to Identify Pulmonary Nodules and Extract Nodule Characteristics From Radiology Reports. Chest 2021; 160:1902-1914. [PMID: 34089738 DOI: 10.1016/j.chest.2021.05.048] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 03/20/2021] [Accepted: 05/11/2021] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND There is an urgent need for population-based studies on managing patients with pulmonary nodules. RESEARCH QUESTION Is it possible to identify pulmonary nodules and associated characteristics using an automated method? STUDY DESIGN AND METHODS We revised and refined an existing natural language processing (NLP) algorithm to identify radiology transcripts with pulmonary nodules and greatly expanded its functionality to identify the characteristics of the largest nodule, when present, including size, lobe, laterality, attenuation, calcification, and edge. We compared NLP results with a reference standard of manual transcript review in a random test sample of 200 radiology transcripts. We applied the final automated method to a larger cohort of patients who underwent chest CT scan in an integrated health care system from 2006 to 2016, and described their demographic and clinical characteristics. RESULTS In the test sample, the NLP algorithm had very high sensitivity (98.6%; 95% CI, 95.0%-99.8%) and specificity (100%; 95% CI, 93.9%-100%) for identifying pulmonary nodules. For attenuation, edge, and calcification, the NLP algorithm achieved similar accuracies, and it correctly identified the diameter of the largest nodule in 135 of 141 cases (95.7%; 95% CI, 91.0%-98.4%). In the larger cohort, the NLP found 217,771 reports with nodules among 717,304 chest CT reports (30.4%). From 2006 to 2016, the number of reports with nodules increased by 150%, and the mean size of the largest nodule gradually decreased from 11 to 8.9 mm. Radiologists documented the laterality and lobe (90%-95%) more often than the attenuation, calcification, and edge characteristics (11%-14%). INTERPRETATION The NLP algorithm identified pulmonary nodules and associated characteristics with high accuracy. In our community practice settings, the documentation of nodule characteristics is incomplete. Our results call for better documentation of nodule findings. The NLP algorithm can be used in population-based studies to identify pulmonary nodules, avoiding labor-intensive chart review.
Collapse
|
17
|
Koleck TA, Tatonetti NP, Bakken S, Mitha S, Henderson MM, George M, Miaskowski C, Smaldone A, Topaz M. Identifying Symptom Information in Clinical Notes Using Natural Language Processing. Nurs Res 2021; 70:173-183. [PMID: 33196504 PMCID: PMC9109773 DOI: 10.1097/nnr.0000000000000488] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
BACKGROUND Symptoms are a core concept of nursing interest. Large-scale secondary data reuse of notes in electronic health records (EHRs) has the potential to increase the quantity and quality of symptom research. However, the symptom language used in clinical notes is complex. A need exists for methods designed specifically to identify and study symptom information from EHR notes. OBJECTIVES We aim to describe a method that combines standardized vocabularies, clinical expertise, and natural language processing to generate comprehensive symptom vocabularies and identify symptom information in EHR notes. We piloted this method with five diverse symptom concepts: constipation, depressed mood, disturbed sleep, fatigue, and palpitations. METHODS First, we obtained synonym lists for each pilot symptom concept from the Unified Medical Language System. Then, we used two large bodies of text (clinical notes from Columbia University Irving Medical Center and PubMed abstracts containing Medical Subject Headings or key words related to the pilot symptoms) to further expand our initial vocabulary of synonyms for each pilot symptom concept. We used NimbleMiner, an open-source natural language processing tool, to accomplish these tasks and evaluated NimbleMiner symptom identification performance by comparison to a manually annotated set of nurse- and physician-authored common EHR note types. RESULTS Compared to the baseline Unified Medical Language System synonym lists, we identified up to 11 times more additional synonym words or expressions, including abbreviations, misspellings, and unique multiword combinations, for each symptom concept. Natural language processing system symptom identification performance was excellent. DISCUSSION Using our comprehensive symptom vocabularies and NimbleMiner to label symptoms in clinical notes produced excellent performance metrics. The ability to extract symptom information from EHR notes in an accurate and scalable manner has the potential to greatly facilitate symptom science research.
Collapse
|
18
|
Malec SA, Wei P, Bernstam EV, Boyce RD, Cohen T. Using computable knowledge mined from the literature to elucidate confounders for EHR-based pharmacovigilance. J Biomed Inform 2021; 117:103719. [PMID: 33716168 PMCID: PMC8559730 DOI: 10.1016/j.jbi.2021.103719] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Revised: 12/31/2020] [Accepted: 01/04/2021] [Indexed: 10/21/2022]
Abstract
INTRODUCTION Drug safety research asks causal questions but relies on observational data. Confounding bias threatens the reliability of studies using such data. The successful control of confounding requires knowledge of variables called confounders affecting both the exposure and outcome of interest. However, causal knowledge of dynamic biological systems is complex and challenging. Fortunately, computable knowledge mined from the literature may hold clues about confounders. In this paper, we tested the hypothesis that incorporating literature-derived confounders can improve causal inference from observational data. METHODS We introduce two methods (semantic vector-based and string-based confounder search) that query literature-derived information for confounder candidates to control, using SemMedDB, a database of computable knowledge mined from the biomedical literature. These methods search SemMedDB for confounders by applying semantic constraint search for indications treated by the drug (exposure) and that are also known to cause the adverse event (outcome). We then include the literature-derived confounder candidates in statistical and causal models derived from free-text clinical notes. For evaluation, we use a reference dataset widely used in drug safety containing labeled pairwise relationships between drugs and adverse events and attempt to rediscover these relationships from a corpus of 2.2 M NLP-processed free-text clinical notes. We employ standard adjustment and causal inference procedures to predict and estimate causal effects by informing the models with varying numbers of literature-derived confounders and instantiating the exposure, outcome, and confounder variables in the models with dichotomous EHR-derived data. Finally, we compare the results from applying these procedures with naive measures of association (χ2 and reporting odds ratio) and with each other. RESULTS AND CONCLUSIONS We found semantic vector-based search to be superior to string-based search at reducing confounding bias. However, the effect of including more rather than fewer literature-derived confounders was inconclusive. We recommend using targeted learning estimation methods that can address treatment-confounder feedback, where confounders also behave as intermediate variables, and engaging subject-matter experts to adjudicate the handling of problematic covariates.
Collapse
Affiliation(s)
- Scott A Malec
- University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States.
| | - Peng Wei
- The University of Texas MD Anderson Cancer Center, Department of Biostatistics, Houston, TX, United States
| | - Elmer V Bernstam
- University of Texas Health Science Center at Houston, School of Biomedical Informatics, Houston, TX, United States
| | - Richard D Boyce
- University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States
| | - Trevor Cohen
- University of Washington, Department of Biomedical Informatics and Medical Education, Seattle, WA, United States
| |
Collapse
|
19
|
Wei Q, Ji Z, Li Z, Du J, Wang J, Xu J, Xiang Y, Tiryaki F, Wu S, Zhang Y, Tao C, Xu H. A study of deep learning approaches for medication and adverse drug event extraction from clinical text. J Am Med Inform Assoc 2021; 27:13-21. [PMID: 31135882 DOI: 10.1093/jamia/ocz063] [Citation(s) in RCA: 52] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Revised: 03/23/2019] [Accepted: 04/17/2019] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE This article presents our approaches to extraction of medications and associated adverse drug events (ADEs) from clinical documents, which is the second track of the 2018 National NLP Clinical Challenges (n2c2) shared task. MATERIALS AND METHODS The clinical corpus used in this study was from the MIMIC-III database and the organizers annotated 303 documents for training and 202 for testing. Our system consists of 2 components: a named entity recognition (NER) and a relation classification (RC) component. For each component, we implemented deep learning-based approaches (eg, BI-LSTM-CRF) and compared them with traditional machine learning approaches, namely, conditional random fields for NER and support vector machines for RC, respectively. In addition, we developed a deep learning-based joint model that recognizes ADEs and their relations to medications in 1 step using a sequence labeling approach. To further improve the performance, we also investigated different ensemble approaches to generating optimal performance by combining outputs from multiple approaches. RESULTS Our best-performing systems achieved F1 scores of 93.45% for NER, 96.30% for RC, and 89.05% for end-to-end evaluation, which ranked #2, #1, and #1 among all participants, respectively. Additional evaluations show that the deep learning-based approaches did outperform traditional machine learning algorithms in both NER and RC. The joint model that simultaneously recognizes ADEs and their relations to medications also achieved the best performance on RC, indicating its promise for relation extraction. CONCLUSION In this study, we developed deep learning approaches for extracting medications and their attributes such as ADEs, and demonstrated its superior performance compared with traditional machine learning algorithms, indicating its uses in broader NER and RC tasks in the medical domain.
Collapse
Affiliation(s)
- Qiang Wei
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Zongcheng Ji
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Zhiheng Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Jingcheng Du
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jingqi Wang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jun Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yang Xiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Firat Tiryaki
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Stephen Wu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Cui Tao
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
20
|
Christopoulou F, Tran TT, Sahu SK, Miwa M, Ananiadou S. Adverse drug events and medication relation extraction in electronic health records with ensemble deep learning methods. J Am Med Inform Assoc 2021; 27:39-46. [PMID: 31390003 PMCID: PMC6913215 DOI: 10.1093/jamia/ocz101] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Revised: 03/21/2019] [Accepted: 05/24/2019] [Indexed: 01/21/2023] Open
Abstract
Objective Identification of drugs, associated medication entities, and interactions among them are crucial to prevent unwanted effects of drug therapy, known as adverse drug events. This article describes our participation to the n2c2 shared-task in extracting relations between medication-related entities in electronic health records. Materials and Methods We proposed an ensemble approach for relation extraction and classification between drugs and medication-related entities. We incorporated state-of-the-art named-entity recognition (NER) models based on bidirectional long short-term memory (BiLSTM) networks and conditional random fields (CRF) for end-to-end extraction. We additionally developed separate models for intra- and inter-sentence relation extraction and combined them using an ensemble method. The intra-sentence models rely on bidirectional long short-term memory networks and attention mechanisms and are able to capture dependencies between multiple related pairs in the same sentence. For the inter-sentence relations, we adopted a neural architecture that utilizes the Transformer network to improve performance in longer sequences. Results Our team ranked third with a micro-averaged F1 score of 94.72% and 87.65% for relation and end-to-end relation extraction, respectively (Tracks 2 and 3). Our ensemble effectively takes advantages from our proposed models. Analysis of the reported results indicated that our proposed approach is more generalizable than the top-performing system, which employs additional training data- and corpus-driven processing techniques. Conclusions We proposed a relation extraction system to identify relations between drugs and medication-related entities. The proposed approach is independent of external syntactic tools. Analysis showed that by using latent Drug-Drug interactions we were able to significantly improve the performance of non–Drug-Drug pairs in EHRs.
Collapse
Affiliation(s)
- Fenia Christopoulou
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom.,Artificial Intelligence Research Centre, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Thy Thy Tran
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom.,Artificial Intelligence Research Centre, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Sunil Kumar Sahu
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom
| | - Makoto Miwa
- Artificial Intelligence Research Centre, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Toyota Technological Institute, Nagoya, Japan
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom.,Artificial Intelligence Research Centre, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| |
Collapse
|
21
|
Nguyen T, Zhang T, Fox G, Zeng S, Cao N, Pan C, Chen JY. Linking clinotypes to phenotypes and genotypes from laboratory test results in comprehensive physical exams. BMC Med Inform Decis Mak 2021; 21:51. [PMID: 33627109 PMCID: PMC7903607 DOI: 10.1186/s12911-021-01387-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 01/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In this work, we aimed to demonstrate how to utilize the lab test results and other clinical information to support precision medicine research and clinical decisions on complex diseases, with the support of electronic medical record facilities. We defined "clinotypes" as clinical information that could be observed and measured objectively using biomedical instruments. From well-known 'omic' problem definitions, we defined problems using clinotype information, including stratifying patients-identifying interested sub cohorts for future studies, mining significant associations between clinotypes and specific phenotypes-diseases, and discovering potential linkages between clinotype and genomic information. We solved these problems by integrating public omic databases and applying advanced machine learning and visual analytic techniques on two-year health exam records from a large population of healthy southern Chinese individuals (size n = 91,354). When developing the solution, we carefully addressed the missing information, imbalance and non-uniformed data annotation issues. RESULTS We organized the techniques and solutions to address the problems and issues above into CPA framework (Clinotype Prediction and Association-finding). At the data preprocessing step, we handled the missing value issue with predicted accuracy of 0.760. We curated 12,635 clinotype-gene associations. We found 147 Associations between 147 chronic diseases-phenotype and clinotypes, which improved the disease predictive performance to AUC (average) of 0.967. We mined 182 significant clinotype-clinotype associations among 69 clinotypes. CONCLUSIONS Our results showed strong potential connectivity between the omics information and the clinical lab test information. The results further emphasized the needs to utilize and integrate the clinical information, especially the lab test results, in future PheWas and omic studies. Furthermore, it showed that the clinotype information could initiate an alternative research direction and serve as an independent field of data to support the well-known 'phenome' and 'genome' researches.
Collapse
Affiliation(s)
- Thanh Nguyen
- Informatics Institute, School of Medicine, The University of Alabama at Birmingham, AL, Birmingham, USA
| | - Tongbin Zhang
- School of First Clinical Medical Sciences - School of Information and Engineering, Wenzhou Medical University, Zhejiang, China
- Department of Computer Technology and Information Management, The First Affiliated Hospital of Wenzhou Medical University, Zhejiang, China
| | - Geoffrey Fox
- School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
| | - Sisi Zeng
- School of First Clinical Medical Sciences - School of Information and Engineering, Wenzhou Medical University, Zhejiang, China
| | - Ni Cao
- School of First Clinical Medical Sciences - School of Information and Engineering, Wenzhou Medical University, Zhejiang, China
| | - Chuandi Pan
- School of First Clinical Medical Sciences - School of Information and Engineering, Wenzhou Medical University, Zhejiang, China
- Department of Computer Technology and Information Management, The First Affiliated Hospital of Wenzhou Medical University, Zhejiang, China
| | - Jake Y Chen
- Informatics Institute, School of Medicine, The University of Alabama at Birmingham, AL, Birmingham, USA.
| |
Collapse
|
22
|
Pandey B, Kumar Pandey D, Pratap Mishra B, Rhmann W. A comprehensive survey of deep learning in the field of medical imaging and medical natural language processing: Challenges and research directions. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2021. [DOI: 10.1016/j.jksuci.2021.01.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
23
|
Derington CG, Mueller SR, Glanz JM, Binswanger IA. Identifying naloxone administrations in electronic health record data using a text-mining tool. Subst Abus 2020; 42:806-812. [PMID: 33320803 PMCID: PMC8203755 DOI: 10.1080/08897077.2020.1856288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Background: Effective and efficient methods are needed to identify naloxone administrations within electronic health record (EHR) data to conduct overdose surveillance and research. The objective of this study was to develop and validate a text-mining tool to identify naloxone administrations in EHR data. Methods: Clinical notes stored in databases between January 2017 and March 2018 were used to iteratively develop a text-mining tool to identify naloxone administrations. The first iteration of the tool used broad search terms. Then, after reviewing clinical notes of overdose encounters, we developed a list of phrases that described naloxone administrations to inform iteration two. While validating iteration two, additional phrases were found, which were then added to inform the final iteration. The comparator was an administrative code query extracted from the EHR. Medical record review was used to identify true positives. The primary outcome was the positive predictive values (PPV) of the second iteration, final iteration, and administrative code query. Results: Iteration two, the final iteration, and the administrative code had PPVs of 84.3% (95% confidence interval [CI] 78.6-89.0%), 83.8% (95% CI 78.6-88.2%), and 57.1% (95% CI 47.1-66.8%), respectively. Both iterations of the tool had a significantly higher PPV than the administrative code (p < 0.001). Conclusions: A text-mining tool improved the identification of naloxone administrations in EHR data from less than 60% with the administrative code to greater than 80% with both versions of the tool. Text-mining tools can inform the use of more sophisticated informatics methods, which often require significant time, resource, and expertise investment.
Collapse
Affiliation(s)
- Catherine G. Derington
- Department of Population Health Sciences, University of Utah, 295 Chipeta Way, Salt Lake City UT 84112
| | - Shane R. Mueller
- Institute for Health Research, Kaiser Permanente Colorado, 2550 S. Parker Road, Suite 200, Aurora CO 80014
| | - Jason M. Glanz
- Institute for Health Research, Kaiser Permanente Colorado, 2550 S. Parker Road, Suite 200, Aurora CO 80014
- Department of Epidemiology, Colorado School of Public Health, 13001 E 17 Place, Mail Stop B-119, Aurora CO 80045
| | - Ingrid A. Binswanger
- Institute for Health Research, Kaiser Permanente Colorado, 2550 S. Parker Road, Suite 200, Aurora CO 80014
- Colorado Permanente Medical Group, 10350 E. Dakota Ave, Denver CO 80247
- Division of General Internal Medicine, University of Colorado School of Medicine, 13001 E 17 Place, Aurora CO 80045
| |
Collapse
|
24
|
Lee EK, Uppal K. CERC: an interactive content extraction, recognition, and construction tool for clinical and biomedical text. BMC Med Inform Decis Mak 2020; 20:306. [PMID: 33323109 PMCID: PMC7739454 DOI: 10.1186/s12911-020-01330-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/11/2020] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Automated summarization of scientific literature and patient records is essential for enhancing clinical decision-making and facilitating precision medicine. Most existing summarization methods are based on single indicators of relevance, offer limited capabilities for information visualization, and do not account for user specific interests. In this work, we develop an interactive content extraction, recognition, and construction system (CERC) that combines machine learning and visualization techniques with domain knowledge for highlighting and extracting salient information from clinical and biomedical text. METHODS A novel sentence-ranking framework multi indicator text summarization, MINTS, is developed for extractive summarization. MINTS uses random forests and multiple indicators of importance for relevance evaluation and ranking of sentences. Indicative summarization is performed using weighted term frequency-inverse document frequency scores of over-represented domain-specific terms. A controlled vocabulary dictionary generated using MeSH, SNOMED-CT, and PubTator is used for determining relevant terms. 35 full-text CRAFT articles were used as the training set. The performance of the MINTS algorithm is evaluated on a test set consisting of the remaining 32 full-text CRAFT articles and 30 clinical case reports using the ROUGE toolkit. RESULTS The random forests model classified sentences as "good" or "bad" with 87.5% accuracy on the test set. Summarization results from the MINTS algorithm achieved higher ROUGE-1, ROUGE-2, and ROUGE-SU4 scores when compared to methods based on single indicators such as term frequency distribution, position, eigenvector centrality (LexRank), and random selection, p < 0.01. The automatic language translator and the customizable information extraction and pre-processing pipeline for EHR demonstrate that CERC can readily be incorporated within clinical decision support systems to improve quality of care and assist in data-driven and evidence-based informed decision making for direct patient care. CONCLUSIONS We have developed a web-based summarization and visualization tool, CERC ( https://newton.isye.gatech.edu/CERC1/ ), for extracting salient information from clinical and biomedical text. The system ranks sentences by relevance and includes features that can facilitate early detection of medical risks in a clinical setting. The interactive interface allows users to filter content and edit/save summaries. The evaluation results on two test corpuses show that the newly developed MINTS algorithm outperforms methods based on single characteristics of importance.
Collapse
Affiliation(s)
- Eva K Lee
- Center for Operations Research in Medicine and HealthCare, School of Industrial and Systems Engineering, School of Biological Sciences, Georgia Institute of Technology, Atlanta, USA.
| | - Karan Uppal
- School of Medicine, Emory University, Atlanta, GA, USA
| |
Collapse
|
25
|
Routray R, Tetarenko N, Abu-Assal C, Mockute R, Assuncao B, Chen H, Bao S, Danysz K, Desai S, Cicirello S, Willis V, Alford SH, Krishnamurthy V, Mingle E. Application of Augmented Intelligence for Pharmacovigilance Case Seriousness Determination. Drug Saf 2020; 43:57-66. [PMID: 31605285 PMCID: PMC6965337 DOI: 10.1007/s40264-019-00869-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
INTRODUCTION Identification of adverse events and determination of their seriousness ensures timely detection of potential patient safety concerns. Adverse event seriousness is a key factor in defining reporting timelines and is often performed manually by pharmacovigilance experts. The dramatic increase in the volume of safety reports necessitates exploration of scalable solutions that also meet reporting timeline requirements. OBJECTIVE The aim of this study was to develop an augmented intelligence methodology for automatically identifying adverse event seriousness in spontaneous, solicited, and medical literature safety reports. Deep learning models were evaluated for accuracy and/or the F1 score against a ground truth labeled by pharmacovigilance experts. METHODS Using a stratified random sample of safety reports received by Celgene, we developed three neural networks for addressing identification of adverse event seriousness: (1) a binary adverse-event level seriousness classifier; (2) a classifier for determining seriousness categorization at the adverse-event level; and (3) an annotator for identifying seriousness criteria terms to provide supporting evidence at the document level. RESULTS The seriousness classifier achieved an accuracy of 83.0% in post-marketing reports, 92.9% in solicited reports, and 86.3% in medical literature reports. F1 scores for seriousness categorization were 77.7 for death, 78.9 for hospitalization, and 75.5 for important medical events. The seriousness annotator achieved an F1 score of 89.9 in solicited reports, and 75.2 in medical literature reports. CONCLUSIONS The results of this study indicate that a neural network approach can provide an accurate and scalable solution for potentially augmenting pharmacovigilance practitioner determination of adverse event seriousness in spontaneous, solicited, and medical literature reports.
Collapse
|
26
|
Crowson MG, Hamour A, Lin V, Chen JM, Chan TCY. Machine learning for pattern detection in cochlear implant FDA adverse event reports. Cochlear Implants Int 2020; 21:313-322. [DOI: 10.1080/14670100.2020.1784569] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Affiliation(s)
- Matthew G. Crowson
- Department of Otolaryngology-HNS, Sunnybrook Health Sciences Center, University of Toronto, Toronto, Ontario
- Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, Ontario
| | - Amr Hamour
- Department of Otolaryngology-HNS, Sunnybrook Health Sciences Center, University of Toronto, Toronto, Ontario
| | - Vincent Lin
- Department of Otolaryngology-HNS, Sunnybrook Health Sciences Center, University of Toronto, Toronto, Ontario
| | - Joseph M. Chen
- Department of Otolaryngology-HNS, Sunnybrook Health Sciences Center, University of Toronto, Toronto, Ontario
| | - Timothy C. Y. Chan
- Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, Ontario
| |
Collapse
|
27
|
Eskildsen NK, Eriksson R, Christensen SB, Aghassipour TS, Bygsø MJ, Brunak S, Hansen SL. Implementation and comparison of two text mining methods with a standard pharmacovigilance method for signal detection of medication errors. BMC Med Inform Decis Mak 2020; 20:94. [PMID: 32448248 PMCID: PMC7245808 DOI: 10.1186/s12911-020-1097-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Accepted: 04/21/2020] [Indexed: 11/16/2022] Open
Abstract
Background Medication errors have been identified as the most common preventable cause of adverse events. The lack of granularity in medication error terminology has led pharmacovigilance experts to rely on information in individual case safety reports’ (ICSRs) codes and narratives for signal detection, which is both time consuming and labour intensive. Thus, there is a need for complementary methods for the detection of medication errors from ICSRs. The aim of this study is to evaluate the utility of two natural language processing text mining methods as complementary tools to the traditional approach followed by pharmacovigilance experts for medication error signal detection. Methods The safety surveillance advisor (SSA) method, I2E text mining and University of Copenhagen Center for Protein Research (CPR) text mining, were evaluated for their ability to extract cases containing a type of medication error where patients extracted insulin from a prefilled pen or cartridge by a syringe. A total of 154,209 ICSRs were retrieved from Novo Nordisk’s safety database from January 1987 to February 2018. Each method was evaluated by recall (sensitivity) and precision (positive predictive value). Results We manually annotated 2533 ICSRs to investigate whether these contained the sought medication error. All these ICSRs were then analysed using the three methods. The recall was 90.4, 88.1 and 78.5% for the CPR text mining, the SSA method and the I2E text mining, respectively. Precision was low for all three methods ranging from 3.4% for the SSA method to 1.9 and 1.6% for the CPR and I2E text mining methods, respectively. Conclusions Text mining methods can, with advantage, be used for the detection of complex signals relying on information found in unstructured text (e.g., ICSR narratives) as standardised and both less labour-intensive and time-consuming methods compared to traditional pharmacovigilance methods. The employment of text mining in pharmacovigilance need not be limited to the surveillance of potential medication errors but can be used for the ongoing regulatory requests, e.g., obligations in risk management plans and may thus be utilised broadly for signal detection and ongoing surveillance activities.
Collapse
Affiliation(s)
- Nadine Kadi Eskildsen
- Department of Safety Surveillance, Global Safety, Novo Nordisk A/S, Bagsværd, Denmark
| | - Robert Eriksson
- Disease Systems Biology Program, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark.
| | | | | | - Mikael Juul Bygsø
- Department of Safety Surveillance, Global Safety, Novo Nordisk A/S, Bagsværd, Denmark
| | - Søren Brunak
- Disease Systems Biology Program, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark
| | - Suzanne Lisbet Hansen
- Department of Safety Surveillance, Global Safety, Novo Nordisk A/S, Bagsværd, Denmark.
| |
Collapse
|
28
|
Zhang Y, Cui S, Gao H. Adverse drug reaction detection on social media with deep linguistic features. J Biomed Inform 2020; 106:103437. [PMID: 32360987 DOI: 10.1016/j.jbi.2020.103437] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2019] [Revised: 04/02/2020] [Accepted: 04/26/2020] [Indexed: 11/26/2022]
Abstract
Adverse reactions caused by drugs are one of the most important public health problems. Social media has encouraged more patients to share their drug use experiences and has become a major source for the detection of professionally unreported adverse drug reactions (ADRs). Since a large number of user posts do not mention any ADR, accurate detection of the presence of ADRs in each user post is necessary before further research can be conducted. Previous feature-based methods focus on extracting more shallow linguistic features that are unable to capture deep and subtle information in the context, ultimately failing to provide satisfactory accuracy. To overcome the limitations of previous studies, this paper proposes a novel method that can extract deep linguistic features and then combine them with shallow linguistic features for ADR detection. We first extract predicate-ADR pairs under the guidance of extended syntactic dependencies and ADR lexicon. Then, we extract semantic and part-of-speech (POS) features for each pair and pool the features of different pairs to generate a holistic representation of deep linguistic features. Finally, we use the collection of deep features and several shallow features to train the predictive models. A series of experiments are performed on data sets collected from DailyStrength and Twitter. Our approach can achieve AUCs of 94.44% and 88.97% on the two data sets, respectively, outperforming other state-of-the-art methods. The results demonstrate the potential benefits of deep linguistic features for ADR detection on social data. This method can be applied to multiple other healthcare and text analysis tasks and can be used to support pharmacovigilance research.
Collapse
Affiliation(s)
- Ying Zhang
- School of Management and Economics, Beijing Institute of Technology, Beijing 100081, China; School of Business, University of Jinan, Jinan 250022, China.
| | - Shaoze Cui
- School of Economics and Management, Dalian University of Technology, Dalian 116023, China.
| | - Huiying Gao
- School of Management and Economics, Beijing Institute of Technology, Beijing 100081, China.
| |
Collapse
|
29
|
Koleck TA, Dreisbach C, Bourne PE, Bakken S. Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review. J Am Med Inform Assoc 2020; 26:364-379. [PMID: 30726935 DOI: 10.1093/jamia/ocy173] [Citation(s) in RCA: 200] [Impact Index Per Article: 50.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 11/20/2018] [Accepted: 11/27/2018] [Indexed: 12/26/2022] Open
Abstract
OBJECTIVE Natural language processing (NLP) of symptoms from electronic health records (EHRs) could contribute to the advancement of symptom science. We aim to synthesize the literature on the use of NLP to process or analyze symptom information documented in EHR free-text narratives. MATERIALS AND METHODS Our search of 1964 records from PubMed and EMBASE was narrowed to 27 eligible articles. Data related to the purpose, free-text corpus, patients, symptoms, NLP methodology, evaluation metrics, and quality indicators were extracted for each study. RESULTS Symptom-related information was presented as a primary outcome in 14 studies. EHR narratives represented various inpatient and outpatient clinical specialties, with general, cardiology, and mental health occurring most frequently. Studies encompassed a wide variety of symptoms, including shortness of breath, pain, nausea, dizziness, disturbed sleep, constipation, and depressed mood. NLP approaches included previously developed NLP tools, classification methods, and manually curated rule-based processing. Only one-third (n = 9) of studies reported patient demographic characteristics. DISCUSSION NLP is used to extract information from EHR free-text narratives written by a variety of healthcare providers on an expansive range of symptoms across diverse clinical specialties. The current focus of this field is on the development of methods to extract symptom information and the use of symptom information for disease classification tasks rather than the examination of symptoms themselves. CONCLUSION Future NLP studies should concentrate on the investigation of symptoms and symptom documentation in EHR free-text narratives. Efforts should be undertaken to examine patient characteristics and make symptom-related NLP algorithms or pipelines and vocabularies openly available.
Collapse
Affiliation(s)
| | - Caitlin Dreisbach
- School of Nursing, University of Virginia, Charlottesville, Virginia, USA.,Data Science Institute, University of Virginia, Charlottesville, Virginia, USA
| | - Philip E Bourne
- Data Science Institute, University of Virginia, Charlottesville, Virginia, USA
| | - Suzanne Bakken
- School of Nursing, Columbia University, New York, New York, USA.,Department of Biomedical Informatics, Columbia University, New York, New York, USA.,Data Science Institute, Columbia University, New York, New York, USA
| |
Collapse
|
30
|
Chung AE, Shoenbill K, Mitchell SA, Dueck AC, Schrag D, Bruner DW, Minasian LM, St Germain D, O'Mara AM, Baumgartner P, Rogak LJ, Abernethy AP, Griffin AC, Basch EM. Patient free text reporting of symptomatic adverse events in cancer clinical research using the National Cancer Institute's Patient-Reported Outcomes version of the Common Terminology Criteria for Adverse Events (PRO-CTCAE). J Am Med Inform Assoc 2020; 26:276-285. [PMID: 30840079 DOI: 10.1093/jamia/ocy169] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 10/17/2018] [Accepted: 11/26/2018] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE The study sought to describe patient-entered supplemental information on symptomatic adverse events (AEs) in cancer clinical research reported via a National Cancer Institute software system and examine the feasibility of mapping these entries to established terminologies. MATERIALS AND METHODS Patients in 3 multicenter trials electronically completed surveys during cancer treatment. Each survey included a prespecified subset of items from the National Cancer Institute's Patient-Reported Outcomes version of the Common Terminology Criteria for Adverse Events (PRO-CTCAE). Upon completion of the survey items, patients could add supplemental symptomatic AE information in a free text box. As patients typed into the box, structured dropdown terms could be selected from the PRO-CTCAE item library or Medical Dictionary for Regulatory Activities (MedDRA), or patients could type unstructured free text for submission. RESULTS Data were pooled from 1760 participants (48% women; 78% White) who completed 8892 surveys, of which 2387 (26.8%) included supplemental symptomatic AE information. Overall, 1024 (58%) patients entered supplemental information at least once, with an average of 2.3 per patient per study. This encompassed 1474 of 8892 (16.6%) dropdowns and 913 of 8892 (10.3%) unstructured free text entries. One-third of the unstructured free text entries (32%) could be mapped post hoc to a PRO-CTCAE term and 68% to a MedDRA term. DISCUSSION Participants frequently added supplemental information beyond study-specific survey items. Almost half selected a structured dropdown term, although many opted to submit unstructured free text entries. Most free text entries could be mapped post hoc to PRO-CTCAE or MedDRA terms, suggesting opportunities to enhance the system to perform real-time mapping for AE reporting. CONCLUSIONS Patient reporting of symptomatic AEs using a text box functionality with mapping to existing terminologies is both feasible and informative.
Collapse
Affiliation(s)
- Arlene E Chung
- Department of Medicine, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA.,Program on Health and Clinical Informatics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA.,Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Kimberly Shoenbill
- Program on Health and Clinical Informatics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA.,Department of Family Medicine, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
| | | | - Amylou C Dueck
- Alliance Statistics and Data Center, Mayo Clinic, Scottsdale, Arizona, USA
| | - Deborah Schrag
- Division of Population Sciences, Department of Medical Oncology, Dana-Farber/Harvard Cancer Center, Brookline, Massachusetts, USA
| | - Deborah W Bruner
- Nell Hodgson Woodruff School of Nursing, Winship Cancer Institute, Emory University, Atlanta, Georgia, USA
| | | | | | - Ann M O'Mara
- National Cancer Institute, Rockville, Maryland, USA
| | | | - Lauren J Rogak
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | - Amy P Abernethy
- Department of Medicine, Duke Cancer Institute, Durham, North Carolina, USA.,Flatiron Health, New York, New York, USA
| | - Ashley C Griffin
- Program on Health and Clinical Informatics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
| | - Ethan M Basch
- Department of Medicine, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA.,Program on Health and Clinical Informatics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA.,Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.,Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| |
Collapse
|
31
|
Dang TT, Nguyen TH, Ho TB. Causality Assessment of Adverse Drug Reaction: Controlling Confounding Induced by Polypharmacy. Curr Pharm Des 2020; 25:1134-1143. [PMID: 31038058 DOI: 10.2174/1381612825666190416115714] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 04/01/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Post-marketing pharmaceutical surveillance, a.k.a. pragmatic clinical trials (i.e., PCT), plays a vital role in preventing accidents in practical treatment. The most important and difficult task in PCT is to assess which drug causes adverse reactions (i.e., ADRs) from clinical texts. The confounding (i.e., factors cause confusions in causality assessment) is generated by the polypharmacy (i.e., multiple drugs use), which makes most of existing methods poor for detecting drugs that capably cause observed ADRs. OBJECTIVE We aim to improve the performance of detecting drug-ADR causal relations from clinical texts. To this end, a mechanism for reducing the impact of confounding on the detecting process is needful. METHODS We proposed a novel model which is called the analogy-based active voting (i.e., AAV) for improving the ability of detecting causal drug-ADR pairs, in case multiple drugs are prescribed for treating the comorbidity. This model is inspired by the analogy principle which was proposed by Bradford Hill. RESULTS The experimental results show the improvement of recognizing causal relations between drugs and ADRs that are confirmed by the SIDER. In addition, the proposed model is promising to detect infrequently observed causal drug-ADR pairs when the drug is not commonly used. CONCLUSION The proposed model demonstrates its ability for controlling the polypharmacy-induced confounding, to improve the quality of causality assessment of ADRs. Additionally, this also shows that the analogy principle is applicable for the assessment.
Collapse
Affiliation(s)
- Tran-Thai Dang
- Japan Advanced Instiute of Science and Technology, 1 Chrome-1 Asahidia, Nomi, Ishikawa 92312211, Japan
| | | | - Tu-Bao Ho
- Japan Advanced Instiute of Science and Technology, 1 Chrome-1 Asahidia, Nomi, Ishikawa 92312211, Japan.,John von Neumann Institute, VNU-HCM, Phurong Linh Trung, Thu Durc, Ho Chi Minh City, Vietnam.,Vietnam Institute for Advanced Study in Mathematics, Hanoi, Vietnam
| |
Collapse
|
32
|
Bielinski SJ, St Sauver JL, Olson JE, Larson NB, Black JL, Scherer SE, Bernard ME, Boerwinkle E, Borah BJ, Caraballo PJ, Curry TB, Doddapaneni H, Formea CM, Freimuth RR, Gibbs RA, Giri J, Hathcock MA, Hu J, Jacobson DJ, Jones LA, Kalla S, Koep TH, Korchina V, Kovar CL, Lee S, Liu H, Matey ET, McGree ME, McAllister TM, Moyer AM, Muzny DM, Nicholson WT, Oyen LJ, Qin X, Raj R, Roger VL, Rohrer Vitek CR, Ross JL, Sharp RR, Takahashi PY, Venner E, Walker K, Wang L, Wang Q, Wright JA, Wu TJ, Wang L, Weinshilboum RM. Cohort Profile: The Right Drug, Right Dose, Right Time: Using Genomic Data to Individualize Treatment Protocol (RIGHT Protocol). Int J Epidemiol 2020; 49:23-24k. [PMID: 31378813 PMCID: PMC7124480 DOI: 10.1093/ije/dyz123] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/31/2019] [Indexed: 12/29/2022] Open
Affiliation(s)
- Suzette J Bielinski
- Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Jennifer L St Sauver
- Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
- Robert D and Patricia E Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA
| | - Janet E Olson
- Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
| | - Nicholas B Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - John L Black
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
| | - Steven E Scherer
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | | | - Eric Boerwinkle
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Bijan J Borah
- Robert D and Patricia E Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA
- Division of Health Care Policy and Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Pedro J Caraballo
- Division of General Internal Medicine, Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Timothy B Curry
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
- Department of Anesthesia and Perioperative Medicine, Mayo Clinic, Rochester, MN, USA
| | | | | | - Robert R Freimuth
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Richard A Gibbs
- Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Jyothsna Giri
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
| | - Matthew A Hathcock
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Jianhong Hu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Debra J Jacobson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Leila A Jones
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
| | - Sara Kalla
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | | | - Viktoriya Korchina
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Christie L Kovar
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Sandra Lee
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Hongfang Liu
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Eric T Matey
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
- Department of Pharmacy, Mayo Clinic, Rochester, MN, USA
| | - Michaela E McGree
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | | | - Ann M Moyer
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
| | - Donna M Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Wayne T Nicholson
- Department of Anesthesia and Perioperative Medicine, Mayo Clinic, Rochester, MN, USA
| | - Lance J Oyen
- Department of Pharmacy, Mayo Clinic, Rochester, MN, USA
| | - Xiang Qin
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Ritika Raj
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Véronique L Roger
- Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
- Division of Cardiovascular Diseases, Department of Internal Medicine, Mayo Clinic, Rochester, MN, USA
| | | | | | - Richard R Sharp
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
- Division of Health Care Policy and Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Paul Y Takahashi
- Division of Community Internal Medicine, Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Eric Venner
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Kimberly Walker
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Liwei Wang
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Qiaoyan Wang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Jessica A Wright
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
- Department of Pharmacy, Mayo Clinic, Rochester, MN, USA
| | - Tsung-Jung Wu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Liewei Wang
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
- Division of Clinical Pharmacology, Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN, USA
| | - Richard M Weinshilboum
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
- Division of Clinical Pharmacology, Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
33
|
Mohammadhassanzadeh H, Sketris I, Traynor R, Alexander S, Winquist B, Stewart SA. Using Natural Language Processing to Examine the Uptake, Content, and Readability of Media Coverage of a Pan-Canadian Drug Safety Research Project: Cross-Sectional Observational Study. JMIR Form Res 2020; 4:e13296. [PMID: 31934872 PMCID: PMC6996767 DOI: 10.2196/13296] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2019] [Revised: 07/11/2019] [Accepted: 09/26/2019] [Indexed: 11/18/2022] Open
Abstract
Background Isotretinoin, for treating cystic acne, increases the risk of miscarriage and fetal abnormalities when taken during pregnancy. The Health Canada–approved product monograph for isotretinoin includes pregnancy prevention guidelines. A recent study by the Canadian Network for Observational Drug Effect Studies (CNODES) on the occurrence of pregnancy and pregnancy outcomes during isotretinoin therapy estimated poor adherence to these guidelines. Media uptake of this study was unknown; awareness of this uptake could help improve drug safety communication. Objective The aim of this study was to understand how the media present pharmacoepidemiological research using the CNODES isotretinoin study as a case study. Methods Google News was searched (April 25-May 6, 2016), using a predefined set of terms, for mention of the CNODES study. In total, 26 articles and 3 CNODES publications (original article, press release, and podcast) were identified. The article texts were cleaned (eg, advertisements and links removed), and the podcast was transcribed. A dictionary of 1295 unique words was created using natural language processing (NLP) techniques (term frequency-inverse document frequency, Porter stemming, and stop-word filtering) to identify common words and phrases. Similarity between the articles and reference publications was calculated using Euclidian distance; articles were grouped using hierarchical agglomerative clustering. Nine readability scales were applied to measure text readability based on factors such as number of words, difficult words, syllables, sentence counts, and other textual metrics. Results The top 5 dictionary words were pregnancy (250 appearances), isotretinoin (220), study (209), drug (201), and women (185). Three distinct clusters were identified: Clusters 2 (5 articles) and 3 (4 articles) were from health-related websites and media, respectively; Cluster 1 (18 articles) contained largely media sources; 2 articles fell outside these clusters. Use of the term isotretinoin versus Accutane (a brand name of isotretinoin), discussion of pregnancy complications, and assignment of responsibility for guideline adherence varied between clusters. For example, the term pregnanc appeared most often in Clusters 1 (14.6 average times per article) and 2 (11.4) and relatively infrequently in Cluster 3 (1.8). Average readability for all articles was high (eg, Flesch-Kincaid, 13; Gunning Fog, 15; SMOG Index, 10; Coleman Liau Index, 15; Linsear Write Index, 13; and Text Standard, 13). Readability increased from Cluster 2 (Gunning Fog of 16.9) to 3 (12.2). It varied between clusters (average 13th-15th grade) but exceeded the recommended health information reading level (grade 6th to 8th), overall. Conclusions Media interpretation of the CNODES study varied, with differences in synonym usage and areas of focus. All articles were written above the recommended health information reading level. Analyzing media using NLP techniques can help determine drug safety communication effectiveness. This project is important for understanding how drug safety studies are taken up and redistributed in the media.
Collapse
|
34
|
Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J. Applicability of Machine Learning Methods to Multi-label Medical Text Classification. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7303696 DOI: 10.1007/978-3-030-50423-6_38] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Structuring medical text using international standards allows to improve interoperability and quality of predictive modelling. Medical text classification task facilitates information extraction. In this work we investigate the applicability of several machine learning models and classifier chains (CC) to medical unstructured text classification. The experimental study was performed on a corpus of 11671 manually labeled Russian medical notes. The results showed that using CC strategy allows to improve classification performance. Ensemble of classifier chains based on linear SVC showed the best result: 0.924 micro F-measure, 0.872 micro precision and 0.927 micro recall.
Collapse
|
35
|
Natural Language Processing Combined with ICD-9-CM Codes as a Novel Method to Study the Epidemiology of Allergic Drug Reactions. THE JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY-IN PRACTICE 2019; 8:1032-1038.e1. [PMID: 31857264 DOI: 10.1016/j.jaip.2019.12.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Revised: 11/25/2019] [Accepted: 12/02/2019] [Indexed: 11/20/2022]
Abstract
BACKGROUND Allergic drug reaction epidemiologic data are sparse because it remains difficult to identify true cases in large data sets using manual chart review. OBJECTIVE To develop and validate a novel informatics method based on natural language processing (NLP) in combination with International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes that identifies allergic drug reactions in the electronic health record. METHODS Previously studied and high-yield ICD-9-CM codes were used to screen for possible allergic drug reactions among all inpatients admitted in 2007 and 2008. A random sample was selected for manual chart review to identify true cases of allergic drug reactions. A rule-based NLP algorithm was then developed to identify allergic drug reactions using free-text clinical notes and discharge summaries from the filtered cases. The performance of using manual chart review of ICD-9-CM codes alone was compared with ICD-9-CM codes in combination with NLP. RESULTS Of 3907 cases identified by ICD-9-CM codes, 725 (19%) were randomly selected for manual chart review; 335 were confirmed as allergic drug reactions, resulting in a positive predictive value (PPV) of 46% (range: 18%-79%) when using ICD-9-CM codes alone. Our NLP algorithm in combination with ICD-9-CM codes achieved a PPV of 86% (range: 69%-100%). Among the 335 confirmed positive cases, NLP identified 259 true cases, resulting in a recall/sensitivity of 77% (range: 26%-100%). Among the 390 negative cases, NLP achieved a specificity of 89% (range: 69%-100%). CONCLUSION Using NLP with ICD-9-CM codes improved identification of allergic drug reactions. The resulting decrease in manual chart review effort will facilitate large epidemiology studies of this understudied area.
Collapse
|
36
|
Gefen D, Ben-Assuli O, Shlomo N, Robertson N, Klempfner R. A case study of applying text analysis to identify possible adverse drug interactions: The case of Adalat (Nifedipine). Health Informatics J 2019; 26:1455-1464. [PMID: 31635509 DOI: 10.1177/1460458219882269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Adalat (Nifedipine) is a calcium-channel blocker that is also used as an antihypertensive drug. The drug was approved by the US Food and Drug Administration in 1985 but was discontinued in 1996 on account, among other things, of interactions with other medications. Nonetheless, Adalat is still used in other countries to treat congestive heart failure. We examine all the congestive heart failure electronic health records of the largest medical center in Israel to discover whether, possibly, taking Adalat with other medications is associated with patient death. This study examines a semantic space built by running latent semantic analysis on the entire corpus of congestive heart failure electronic health records of that medical center, encompassing 8 years of data on almost 12,000 patients. Through this semantic space, the most highly correlated medications and medical conditions that co-occurred with Adalat were identified. This was done separately for men and women. The results show that Adalat is correlated with different medications and conditions across genders. The data also suggest that taking Adalat with Captopril (angiotensin-converting enzyme inhibitor) or Rulid (antibiotic) might be dangerous in both genders. The study thus demonstrates the potential of applying latent semantic analysis to identify potentially dangerous drug interactions that may have otherwise gone under the radar.
Collapse
|
37
|
Wang Y, Fan X, Chen L, Chang EIC, Ananiadou S, Tsujii J, Xu Y. Mapping anatomical related entities to human body parts based on wikipedia in discharge summaries. BMC Bioinformatics 2019; 20:430. [PMID: 31419946 PMCID: PMC6697955 DOI: 10.1186/s12859-019-3005-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Accepted: 07/23/2019] [Indexed: 11/16/2022] Open
Abstract
*: Background Consisting of dictated free-text documents such as discharge summaries, medical narratives are widely used in medical natural language processing. Relationships between anatomical entities and human body parts are crucial for building medical text mining applications. To achieve this, we establish a mapping system consisting of a Wikipedia-based scoring algorithm and a named entity normalization method (NEN). The mapping system makes full use of information available on Wikipedia, which is a comprehensive Internet medical knowledge base. We also built a new ontology, Tree of Human Body Parts (THBP), from core anatomical parts by referring to anatomical experts and Unified Medical Language Systems (UMLS) to make the mapping system efficacious for clinical treatments. *: Result The gold standard is derived from 50 discharge summaries from our previous work, in which 2,224 anatomical entities are included. The F1-measure of the baseline system is 70.20%, while our algorithm based on Wikipedia achieves 86.67% with the assistance of NEN. *: Conclusions We construct a framework to map anatomical entities to THBP ontology using normalization and a scoring algorithm based on Wikipedia. The proposed framework is proven to be much more effective and efficient than the main baseline system.
Collapse
Affiliation(s)
- Yipei Wang
- State Key Laboratory of Software Development Environment and Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education and Research Institute of Beihang University in Shenzhen, Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Xueyuan Road No.37, Beijing, 100191 China
| | - Xingyu Fan
- Bioengineering College of Chongqing University, Shazheng Street No. 174, Chongqing, 400044 China
| | - Luoxin Chen
- State Key Laboratory of Software Development Environment and Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education and Research Institute of Beihang University in Shenzhen, Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Xueyuan Road No.37, Beijing, 100191 China
| | | | - Sophia Ananiadou
- The National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Junichi Tsujii
- The National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
- Artificial Intelligence Research Center (AIRC), Tokyo, Japan
| | - Yan Xu
- State Key Laboratory of Software Development Environment and Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education and Research Institute of Beihang University in Shenzhen, Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Xueyuan Road No.37, Beijing, 100191 China
- Microsoft Research, Danling Street No. 5, Beijing, 100080 China
| |
Collapse
|
38
|
Fan DF, Yu YC, Ding XS, Nie XL, Wei R, Feng XY, Peng XX, Gao MM, Jia LL, Wang XL. Exploring the drug-induced anemia signals in children using electronic medical records. Expert Opin Drug Saf 2019; 18:993-999. [PMID: 31315002 DOI: 10.1080/14740338.2019.1645832] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Objectives: The objectives were to identify drugs related with anemia in children and evaluate the novelty of these correlations. Methods: The authors established a two-step method for detecting the relationship between drugs and anemia using electronic medical records (EMRs), which were obtained from 247,136 patients in Beijing Children's Hospital between 2007 and 2017. The authors extracted potential drugs by mining cases for hemoglobin abnormalities from the EMR and then performed a retrospective cohort study to correlate them with anemia by calculating the matched odds ratios and 95% confidence interval using unconditional logistic regression analysis. Results: In total, nine positive drug-anemia associations were identified. Among them, the correlations of drugs fluconazole (OR 3.95; 95%CI: 2.65-5.87) and cefathiamidine (OR 3.49; 95%CI: 2.94-4.15) with anemia were considered new signals in both children and adults. Three associations of drugs, vancomycin, cefoperazone-sulbactam and ibuprofen, with anemia were considered new signals in children. Conclusion: The authors detected nine signals of drug-induced anemia, including two new signals in children and adults and three new signals in children. This study could serve as a model for using EMR and automatic mining to monitor adverse drug reaction signals in the pediatric population.
Collapse
Affiliation(s)
- Duan-Fang Fan
- Clinical Research Center, National Center for Children's Health, Beijing Children's Hospital, Capital Medical University , Beijing , China.,School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University , Nanjing , Jiangsu , China
| | - Yun-Cui Yu
- Clinical Research Center, National Center for Children's Health, Beijing Children's Hospital, Capital Medical University , Beijing , China
| | - Xuan-Sheng Ding
- School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University , Nanjing , Jiangsu , China
| | - Xiao-Lu Nie
- Center for Clinical Epidemiology and Evidence-based Medicine, National Center for Children's Health, Beijing Children's Hospital, Capital Medical University , Beijing , China
| | - Ran Wei
- Clinical Research Center, National Center for Children's Health, Beijing Children's Hospital, Capital Medical University , Beijing , China
| | - Xin-Ying Feng
- School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University , Nanjing , Jiangsu , China
| | - Xiao-Xia Peng
- Center for Clinical Epidemiology and Evidence-based Medicine, National Center for Children's Health, Beijing Children's Hospital, Capital Medical University , Beijing , China
| | - Miao-Miao Gao
- Department of Pharmacy, First Hospital of Shanxi Medical University , Taiyuan , Shanxi , China
| | - Lu-Lu Jia
- Clinical Research Center, National Center for Children's Health, Beijing Children's Hospital, Capital Medical University , Beijing , China
| | - Xiao-Ling Wang
- Clinical Research Center, National Center for Children's Health, Beijing Children's Hospital, Capital Medical University , Beijing , China
| |
Collapse
|
39
|
Machine Learning for Feature Selection and Cluster Analysis in Drug Utilisation Research. CURR EPIDEMIOL REP 2019. [DOI: 10.1007/s40471-019-00211-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
40
|
Zheng C, Yu W, Xie F, Chen W, Mercado C, Sy LS, Qian L, Glenn S, Lee G, Tseng HF, Duffy J, Jackson LA, Daley MF, Crane B, McLean HQ, Jacobsen SJ. The use of natural language processing to identify Tdap-related local reactions at five health care systems in the Vaccine Safety Datalink. Int J Med Inform 2019; 127:27-34. [PMID: 31128829 PMCID: PMC6645678 DOI: 10.1016/j.ijmedinf.2019.04.009] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2018] [Revised: 01/31/2019] [Accepted: 04/12/2019] [Indexed: 01/28/2023]
Abstract
OBJECTIVE Local reactions are the most common vaccine-related adverse event. There is no specific diagnosis code for local reaction due to vaccination. Previous vaccine safety studies used non-specific diagnosis codes to identify potential local reaction cases and confirmed the cases through manual chart review. In this study, a natural language processing (NLP) algorithm was developed to identify local reaction associated with tetanus-diphtheria-acellular pertussis (Tdap) vaccine in the Vaccine Safety Datalink. METHODS Presumptive cases of local reactions were identified among members ≥ 11 years of age using ICD-9-CM codes in all care settings in the 1-6 days following a Tdap vaccination between 2012 and 2014. The clinical notes were searched for signs and symptoms consistent with local reaction. Information on the timing and the location of a sign or symptom was also extracted to help determine whether or not the sign or symptom was vaccine related. Reactions triggered by causes other than Tdap vaccination were excluded. The NLP algorithm was developed at the lead study site and validated on a stratified random sample of 500 patients from five institutions. RESULTS The NLP algorithm achieved an overall weighted sensitivity of 87.9%, specificity of 92.8%, positive predictive value of 82.7%, and negative predictive value of 95.1%. In addition, using data at one site, the NLP algorithm identified 3326 potential Tdap-related local reactions that were not identified through diagnosis codes. CONCLUSION The NLP algorithm achieved high accuracy, and demonstrated the potential of NLP to reduce the efforts of manual chart review in vaccine safety studies.
Collapse
Affiliation(s)
- Chengyi Zheng
- Kaiser Permanente Southern California, Pasadena, CA, USA.
| | - Wei Yu
- Kaiser Permanente Southern California, Pasadena, CA, USA
| | - Fagen Xie
- Kaiser Permanente Southern California, Pasadena, CA, USA
| | - Wansu Chen
- Kaiser Permanente Southern California, Pasadena, CA, USA
| | - Cheryl Mercado
- Kaiser Permanente Southern California, Pasadena, CA, USA
| | - Lina S Sy
- Kaiser Permanente Southern California, Pasadena, CA, USA
| | - Lei Qian
- Kaiser Permanente Southern California, Pasadena, CA, USA
| | | | - Gina Lee
- Kaiser Permanente Southern California, Pasadena, CA, USA
| | - Hung Fu Tseng
- Kaiser Permanente Southern California, Pasadena, CA, USA
| | - Jonathan Duffy
- Centers for Disease Control and Prevention, Atlanta, GA, USA
| | | | | | - Brad Crane
- Kaiser Permanente Northwest, Portland, OR, USA
| | - Huong Q McLean
- Marshfield Clinic Research Institute, Marshfield, WI, USA
| | | |
Collapse
|
41
|
Thompson J, Hu J, Mudaranthakam DP, Streeter D, Neums L, Park M, Koestler DC, Gajewski B, Jensen R, Mayo MS. Relevant Word Order Vectorization for Improved Natural Language Processing in Electronic Health Records. Sci Rep 2019; 9:9253. [PMID: 31239489 PMCID: PMC6592944 DOI: 10.1038/s41598-019-45705-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Accepted: 06/11/2019] [Indexed: 12/14/2022] Open
Abstract
Electronic health records (EHR) represent a rich resource for conducting observational studies, supporting clinical trials, and more. However, much of the data contains unstructured text, presenting an obstacle to automated extraction. Natural language processing (NLP) can structure and learn from text, but NLP algorithms were not designed for the unique characteristics of EHR. Here, we propose Relevant Word Order Vectorization (RWOV) to aid with structuring. RWOV is based on finding the positional relationship between the most relevant words to predicting the class of a text. This facilitates machine learning algorithms to use the interaction of not just keywords but positional dependencies (e.g. a relevant word occurs 5 relevant words before some term of interest). As a proof-of-concept, we attempted to classify the hormone receptor status of breast cancer patients treated at the University of Kansas Medical Center, comparing RWOV to other methods using the F1 score and AUC. RWOV performed as well as, or better than other methods in all but one case. For F1 score, RWOV had a clear edge on most tasks. AUC tended to be closer, but for HER2, RWOV was significantly better for most comparisons. These results suggest RWOV should be further developed for EHR-related NLP.
Collapse
Affiliation(s)
- Jeffrey Thompson
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, KS, USA.
- University of Kansas Cancer Center, Kansas City, KS, USA.
| | - Jinxiang Hu
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, KS, USA
- University of Kansas Cancer Center, Kansas City, KS, USA
| | - Dinesh Pal Mudaranthakam
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, KS, USA
- University of Kansas Cancer Center, Kansas City, KS, USA
| | - David Streeter
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, KS, USA
- University of Kansas Cancer Center, Kansas City, KS, USA
| | - Lisa Neums
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, KS, USA
- University of Kansas Cancer Center, Kansas City, KS, USA
| | - Michele Park
- University of Kansas Cancer Center, Kansas City, KS, USA
| | - Devin C Koestler
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, KS, USA
- University of Kansas Cancer Center, Kansas City, KS, USA
| | - Byron Gajewski
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, KS, USA
- University of Kansas Cancer Center, Kansas City, KS, USA
| | - Roy Jensen
- University of Kansas Cancer Center, Kansas City, KS, USA
| | - Matthew S Mayo
- Department of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, KS, USA
- University of Kansas Cancer Center, Kansas City, KS, USA
| |
Collapse
|
42
|
Beck EM, Hatton ND, Ryan JJ. Novel techniques for advancing our understanding of pulmonary arterial hypertension. Eur Respir J 2019; 53:53/5/1900556. [DOI: 10.1183/13993003.00556-2019] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2019] [Accepted: 03/20/2019] [Indexed: 01/18/2023]
|
43
|
Automatic Disease Annotation From Radiology Reports Using Artificial Intelligence Implemented by a Recurrent Neural Network. AJR Am J Roentgenol 2019; 212:734-740. [DOI: 10.2214/ajr.18.19869] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
|
44
|
Zeng Z, Espino S, Roy A, Li X, Khan SA, Clare SE, Jiang X, Neapolitan R, Luo Y. Using natural language processing and machine learning to identify breast cancer local recurrence. BMC Bioinformatics 2018; 19:498. [PMID: 30591037 PMCID: PMC6309052 DOI: 10.1186/s12859-018-2466-x] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Identifying local recurrences in breast cancer from patient data sets is important for clinical research and practice. Developing a model using natural language processing and machine learning to identify local recurrences in breast cancer patients can reduce the time-consuming work of a manual chart review. METHODS We design a novel concept-based filter and a prediction model to detect local recurrences using EHRs. In the training dataset, we manually review a development corpus of 50 progress notes and extract partial sentences that indicate breast cancer local recurrence. We process these partial sentences to obtain a set of Unified Medical Language System (UMLS) concepts using MetaMap, and we call it positive concept set. We apply MetaMap on patients' progress notes and retain only the concepts that fall within the positive concept set. These features combined with the number of pathology reports recorded for each patient are used to train a support vector machine to identify local recurrences. RESULTS We compared our model with three baseline classifiers using either full MetaMap concepts, filtered MetaMap concepts, or bag of words. Our model achieved the best AUC (0.93 in cross-validation, 0.87 in held-out testing). CONCLUSIONS Compared to a labor-intensive chart review, our model provides an automated way to identify breast cancer local recurrences. We expect that by minimally adapting the positive concept set, this study has the potential to be replicated at other institutions with a moderately sized training dataset.
Collapse
Affiliation(s)
- Zexian Zeng
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Sasa Espino
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Ankita Roy
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Xiaoyu Li
- Department of Social and Behavioral Sciences, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Seema A Khan
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Susan E Clare
- Department of Surgery, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Xia Jiang
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Richard Neapolitan
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA.
| |
Collapse
|
45
|
Ta CN, Dumontier M, Hripcsak G, Tatonetti NP, Weng C. Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records. Sci Data 2018; 5:180273. [PMID: 30480666 PMCID: PMC6257042 DOI: 10.1038/sdata.2018.273] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2018] [Accepted: 10/16/2018] [Indexed: 12/11/2022] Open
Abstract
Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center's Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dataset, derived from all records, contains 36,578 single concepts (11,952 conditions, 12,334 drugs, and 10,816 procedures) and 32,788,901 concept pairs from 5,364,781 patients. The 5-year dataset, derived from records from 2013-2017, contains 29,964 single concepts (10,159 conditions, 10,264 drugs, and 8,270 procedures) and 15,927,195 concept pairs from 1,790,431 patients. Exclusion of rare concepts (count ≤ 10) and Poisson randomization enable data sharing by eliminating risks to patient privacy. EHR prevalences are informative of healthcare consumption rates. Analysis of co-occurrence frequencies via relative frequency analysis and observed-expected frequency ratio are informative of associations between clinical concepts, useful for biomedical research tasks such as drug repurposing and pharmacovigilance. COHD is publicly accessible through a web application-programming interface (API) and downloadable from the Figshare repository. The code is available on GitHub.
Collapse
Affiliation(s)
- Casey N. Ta
- Department of Biomedical Informatics, Columbia University, NY, USA
| | - Michel Dumontier
- Institute of Data Science, Maastricht University, Maastricht, The Netherlands
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, NY, USA
| | - Nicholas P. Tatonetti
- Department of Biomedical Informatics, Columbia University, NY, USA
- Department of Systems Biology, Columbia University, NY, USA
- Department of Medicine, Columbia University, NY, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, NY, USA
| |
Collapse
|
46
|
Wang L, Rastegar-Mojarad M, Ji Z, Liu S, Liu K, Moon S, Shen F, Wang Y, Yao L, Davis Iii JM, Liu H. Detecting Pharmacovigilance Signals Combining Electronic Medical Records With Spontaneous Reports: A Case Study of Conventional Disease-Modifying Antirheumatic Drugs for Rheumatoid Arthritis. Front Pharmacol 2018; 9:875. [PMID: 30131701 PMCID: PMC6090179 DOI: 10.3389/fphar.2018.00875] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2017] [Accepted: 07/19/2018] [Indexed: 12/24/2022] Open
Abstract
Multiple data sources are preferred in adverse drug event (ADEs) surveillance owing to inadequacies of single source. However, analytic methods to monitor potential ADEs after prolonged drug exposure are still lacking. In this study we propose a method aiming to screen potential ADEs by combining FDA Adverse Event Reporting System (FAERS) and Electronic Medical Record (EMR). The proposed method uses natural language processing (NLP) techniques to extract treatment outcome information captured in unstructured text and adopts case-crossover design in EMR. Performances were evaluated using two ADE knowledge bases: Adverse Drug Reaction Classification System (ADReCS) and SIDER. We tested our method in ADE signal detection of conventional disease-modifying antirheumatic drugs (DMARDs) in rheumatoid arthritis patients. Findings showed that recall greatly increased when combining FAERS with EMR compared with FAERS alone and EMR alone, especially for flexible mapping strategy. Precision (FAERS + EMR) in detecting ADEs improved using ADReCS as gold standard compared with SIDER. In addition, signals detected from EMR have considerably overlapped with signals detected from FAERS or ADE knowledge bases, implying the importance of EMR for pharmacovigilance. ADE signals detected from EMR and/or FAERS but not in existing knowledge bases provide hypothesis for future study.
Collapse
Affiliation(s)
- Liwei Wang
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Majid Rastegar-Mojarad
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Zhiliang Ji
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Xiamen University, Xiamen, China
| | - Sijia Liu
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Ke Liu
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Xiamen University, Xiamen, China
| | - Sungrim Moon
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Feichen Shen
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Lixia Yao
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - John M Davis Iii
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN, United States
| |
Collapse
|
47
|
Combi C, Zorzi M, Pozzani G, Arzenton E, Moretti U. Normalizing Spontaneous Reports Into MedDRA: Some Experiments With MagiCoder. IEEE J Biomed Health Inform 2018; 23:95-102. [PMID: 30059326 DOI: 10.1109/jbhi.2018.2861213] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Text normalization into medical dictionaries is useful to support clinical tasks. A typical setting is pharmacovigilance (PV). The manual detection of suspected adverse drug reactions (ADRs) in narrative reports is time consuming and natural language processing (NLP) provides a concrete help to PV experts. In this paper, we carry out experiments for testing performances of MagiCoder, an NLP application designed to extract MedDRA terms from narrative clinical text. Given a narrative description, MagiCoder proposes an automatic encoding. The pharmacologist reviews, (possibly) corrects, and then, validates the solution. This drastically reduces the time needed for the validation of reports with respect to a completely manual encoding. In previous work, we mainly tested MagiCoder performances on Italian written spontaneous reports. In this paper, we include some new features, change the experiment design, and carry on more tests about MagiCoder. Moreover, we do a change of language, moving to English documents. In particular, we tested MagiCoder on the CADEC dataset, a corpus of manually annotated posts about ADRs collected from the social media.
Collapse
|
48
|
Bhasuran B, Natarajan J. Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One 2018; 13:e0200699. [PMID: 30048465 PMCID: PMC6061985 DOI: 10.1371/journal.pone.0200699] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 07/02/2018] [Indexed: 12/26/2022] Open
Abstract
A wealth of knowledge concerning relations between genes and its associated diseases is present in biomedical literature. Mining these biological associations from literature can provide immense support to research ranging from drug-targetable pathways to biomarker discovery. However, time and cost of manual curation heavily slows it down. In this current scenario one of the crucial technologies is biomedical text mining, and relation extraction shows the promising result to explore the research of genes associated with diseases. By developing automatic extraction of gene-disease associations from the literature using joint ensemble learning we addressed this problem from a text mining perspective. In the proposed work, we employ a supervised machine learning approach in which a rich feature set covering conceptual, syntax and semantic properties jointly learned with word embedding are trained using ensemble support vector machine for extracting gene-disease relations from four gold standard corpora. Upon evaluating the machine learning approach shows promised results of 85.34%, 83.93%,87.39% and 85.57% of F-measure on EUADR, GAD, CoMAGC and PolySearch corpora respectively. We strongly believe that the presented novel approach combining rich syntax and semantic feature set with domain-specific word embedding through ensemble support vector machines evaluated on four gold standard corpora can act as a new baseline for future works in gene-disease relation extraction from literature.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India
| | - Jeyakumar Natarajan
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India
- Data mining and Text mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India
- * E-mail:
| |
Collapse
|
49
|
Chen X, Faviez C, Schuck S, Lillo-Le-Louët A, Texier N, Dahamna B, Huot C, Foulquié P, Pereira S, Leroux V, Karapetiantz P, Guenegou-Arnoux A, Katsahian S, Bousquet C, Burgun A. Mining Patients' Narratives in Social Media for Pharmacovigilance: Adverse Effects and Misuse of Methylphenidate. Front Pharmacol 2018; 9:541. [PMID: 29881351 PMCID: PMC5978246 DOI: 10.3389/fphar.2018.00541] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 05/04/2018] [Indexed: 12/29/2022] Open
Abstract
Background: The Food and Drug Administration (FDA) in the United States and the European Medicines Agency (EMA) have recognized social media as a new data source to strengthen their activities regarding drug safety. Objective: Our objective in the ADR-PRISM project was to provide text mining and visualization tools to explore a corpus of posts extracted from social media. We evaluated this approach on a corpus of 21 million posts from five patient forums, and conducted a qualitative analysis of the data available on methylphenidate in this corpus. Methods: We applied text mining methods based on named entity recognition and relation extraction in the corpus, followed by signal detection using proportional reporting ratio (PRR). We also used topic modeling based on the Correlated Topic Model to obtain the list of the matics in the corpus and classify the messages based on their topics. Results: We automatically identified 3443 posts about methylphenidate published between 2007 and 2016, among which 61 adverse drug reactions (ADR) were automatically detected. Two pharmacovigilance experts evaluated manually the quality of automatic identification, and a f-measure of 0.57 was reached. Patient's reports were mainly neuro-psychiatric effects. Applying PRR, 67% of the ADRs were signals, including most of the neuro-psychiatric symptoms but also palpitations. Topic modeling showed that the most represented topics were related to Childhood and Treatment initiation, but also Side effects. Cases of misuse were also identified in this corpus, including recreational use and abuse. Conclusion: Named entity recognition combined with signal detection and topic modeling have demonstrated their complementarity in mining social media data. An in-depth analysis focused on methylphenidate showed that this approach was able to detect potential signals and to provide better understanding of patients' behaviors regarding drugs, including misuse.
Collapse
Affiliation(s)
- Xiaoyi Chen
- UMRS 1138, équipe 22, Institut National de la Santé et de la Recherche Médicale, Centre de Recherche des Cordeliers, Université Paris Descartes, Paris, France
| | | | | | - Agnès Lillo-Le-Louët
- Centre Régional de Pharmacovigilance, Hôpital Européen Georges-Pompidou, AP-HP, Paris, France
| | | | - Badisse Dahamna
- Service d'Informatique Biomédicale, Centre Hospitalier Universitaire de Rouen, Rouen, France.,Laboratoire d'Informatique, du Traitement de l'Information et des Systèmes-TIBS EA 4108, Rouen, France
| | | | | | | | | | - Pierre Karapetiantz
- UMRS 1138, équipe 22, Institut National de la Santé et de la Recherche Médicale, Centre de Recherche des Cordeliers, Université Paris Descartes, Paris, France
| | - Armelle Guenegou-Arnoux
- UMRS 1138, équipe 22, Institut National de la Santé et de la Recherche Médicale, Centre de Recherche des Cordeliers, Université Paris Descartes, Paris, France
| | - Sandrine Katsahian
- UMRS 1138, équipe 22, Institut National de la Santé et de la Recherche Médicale, Centre de Recherche des Cordeliers, Université Paris Descartes, Paris, France.,Département d'Informatique Médicale, Hôpital Européen Georges Pompidou, Paris, France
| | - Cédric Bousquet
- Sorbonne Université, Inserm, université Paris 13, Laboratoire d'informatique médicale et d'ingénierie des connaissances en e-santé, LIMICS, Paris, France
| | - Anita Burgun
- UMRS 1138, équipe 22, Institut National de la Santé et de la Recherche Médicale, Centre de Recherche des Cordeliers, Université Paris Descartes, Paris, France.,Département d'Informatique Médicale, Hôpital Européen Georges Pompidou, Paris, France
| |
Collapse
|
50
|
Smith JC, Chen Q, Denny JC, Roden DM, Johnson KB, Miller RA. Evaluation of a Novel System to Enhance Clinicians' Recognition of Preadmission Adverse Drug Reactions. Appl Clin Inform 2018; 9:313-325. [PMID: 29742757 DOI: 10.1055/s-0038-1646963] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Often unrecognized by providers, adverse drug reactions (ADRs) diminish patients' quality of life, cause preventable admissions and emergency department visits, and increase health care costs. OBJECTIVE This article evaluates whether an automated system, the Adverse Drug Effect Recognizer (ADER), could assist clinicians in detecting and addressing inpatients' ongoing preadmission ADRs. METHODS ADER uses natural language processing to extract patients' medications, findings, and past diagnoses from admission notes. It compares excerpted information to a database of known medication adverse effects and promptly warns clinicians about potential ongoing ADRs and potential confounders via alerts placed in patients' electronic health records (EHRs). A 3-month intervention trial evaluated ADER's impact on antihypertensive medication ordering behaviors. At the time of patient admission, ADER warned providers on the Internal Medicine wards of Vanderbilt University Hospital about potential ongoing preadmission antihypertensive medication ADRs. A retrospective control group, comprised similar physicians from a period prior to the intervention, received no alerts. The evaluation compared ordering behaviors for each group to determine if preadmission medications changed during hospitalization or at discharge. The study also analyzed intervention group participants' survey responses and user comments. RESULTS ADER identified potential preadmission ADRs for 30% of both groups. Compared with controls, intervention providers more often withheld or discontinued suspected ADR-causing medications during the inpatient stay (p < 0.001). Intervention providers who responded to alert-related surveys held or discontinued suspected ADR-causing medications more often at discharge (p < 0.001). CONCLUSION Results indicate that ADER helped physicians recognize ADRs and reduced ordering of suspected ADR-causing medications. In hospitals using EHRs, ADER-like systems could improve clinicians' recognition and elimination of ongoing ADRs.
Collapse
Affiliation(s)
- Joshua C Smith
- Department of Biomedical Informatics, Vanderbilt University Medical Center and Vanderbilt University School of Medicine, Nashville, Tennessee, United States
| | - Qingxia Chen
- Department of Biomedical Informatics, Vanderbilt University Medical Center and Vanderbilt University School of Medicine, Nashville, Tennessee, United States.,Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States
| | - Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University Medical Center and Vanderbilt University School of Medicine, Nashville, Tennessee, United States.,Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, United States
| | - Dan M Roden
- Department of Biomedical Informatics, Vanderbilt University Medical Center and Vanderbilt University School of Medicine, Nashville, Tennessee, United States.,Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, United States.,Department of Pharmacology, Vanderbilt University School of Medicine, Nashville, Tennessee, United States
| | - Kevin B Johnson
- Department of Biomedical Informatics, Vanderbilt University Medical Center and Vanderbilt University School of Medicine, Nashville, Tennessee, United States.,Department of Pediatrics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States
| | - Randolph A Miller
- Department of Biomedical Informatics, Vanderbilt University Medical Center and Vanderbilt University School of Medicine, Nashville, Tennessee, United States.,Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, United States.,School of Nursing, Vanderbilt University, Nashville, Tennessee, United States
| |
Collapse
|