1
|
Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets. J Biomed Inform 2023; 146:104487. [PMID: 37673376 DOI: 10.1016/j.jbi.2023.104487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/18/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]
Abstract
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA.
| |
Collapse
|
2
|
Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets. ARXIV 2023:arXiv:2306.11189v1. [PMID: 37502629 PMCID: PMC10370213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| |
Collapse
|
3
|
Oliveira Dos Santos Á, Sergio da Silva E, Machado Couto L, Valadares Labanca Reis G, Silva Belo V. The use of artificial intelligence for automating or semi-automating biomedical literature analyses: a scoping review. J Biomed Inform 2023; 142:104389. [PMID: 37187321 DOI: 10.1016/j.jbi.2023.104389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 04/11/2023] [Accepted: 05/08/2023] [Indexed: 05/17/2023]
Abstract
OBJECTIVE Evidence-based medicine (EBM) is a decision-making process based on the conscious and judicious use of the best available scientific evidence. However, the exponential increase in the amount of information currently available likely exceeds the capacity of human-only analysis. In this context, artificial intelligence (AI) and its branches such as machine learning (ML) can be used to facilitate human efforts in analyzing the literature to foster EBM. The present scoping review aimed to examine the use of AI in the automation of biomedical literature survey and analysis with a view to establishing the state-of-the-art and identifying knowledge gaps. MATERIALS AND METHODS Comprehensive searches of the main databases were performed for articles published up to June 2022 and studies were selected according to inclusion and exclusion criteria. Data were extracted from the included articles and the findings categorized. RESULTS The total number of records retrieved from the databases was 12,145, of which 273 were included in the review. Classification of the studies according to the use of AI in evaluating the biomedical literature revealed three main application groups, namely assembly of scientific evidence (n=127; 47%), mining the biomedical literature (n=112; 41%) and quality analysis (n=34; 12%). Most studies addressed the preparation of systematic reviews, while articles focusing on the development of guidelines and evidence synthesis were the least frequent. The biggest knowledge gap was identified within the quality analysis group, particularly regarding methods and tools that assess the strength of recommendation and consistency of evidence. CONCLUSION Our review shows that, despite significant progress in the automation of biomedical literature surveys and analyses in recent years, intense research is needed to fill knowledge gaps on more difficult aspects of ML, deep learning and natural language processing, and to consolidate the use of automation by end-users (biomedical researchers and healthcare professionals).
Collapse
Affiliation(s)
| | - Eduardo Sergio da Silva
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| | - Letícia Machado Couto
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| | | | - Vinícius Silva Belo
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| |
Collapse
|
4
|
Kingsmore KM, Puglisi CE, Grammer AC, Lipsky PE. An introduction to machine learning and analysis of its use in rheumatic diseases. Nat Rev Rheumatol 2021; 17:710-730. [PMID: 34728818 DOI: 10.1038/s41584-021-00708-w] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/04/2021] [Indexed: 02/07/2023]
Abstract
Machine learning (ML) is a computerized analytical technique that is being increasingly employed in biomedicine. ML often provides an advantage over explicitly programmed strategies in the analysis of multidimensional information by recognizing relationships in the data that were not previously appreciated. As such, the use of ML in rheumatology is increasing, and numerous studies have employed ML to classify patients with rheumatic autoimmune inflammatory diseases (RAIDs) from medical records and imaging, biometric or gene expression data. However, these studies are limited by sample size, the accuracy of sample labelling, and absence of datasets for external validation. In addition, there is potential for ML models to overfit or underfit the data and, thereby, these models might produce results that cannot be replicated in an unrelated dataset. In this Review, we introduce the basic principles of ML and discuss its current strengths and weaknesses in the classification of patients with RAIDs. Moreover, we highlight the successful analysis of the same type of input data (for example, medical records) with different algorithms, illustrating the potential plasticity of this analytical approach. Altogether, a better understanding of ML and the future application of advanced analytical techniques based on this approach, coupled with the increasing availability of biomedical data, may facilitate the development of meaningful precision medicine for patients with RAIDs.
Collapse
Affiliation(s)
| | | | - Amrie C Grammer
- AMPEL BioSolutions and RILITE Research Institute, Charlottesville, VA, USA
| | - Peter E Lipsky
- AMPEL BioSolutions and RILITE Research Institute, Charlottesville, VA, USA
| |
Collapse
|
5
|
Deng S, Sun Y, Zhao T, Hu Y, Zang T. A Review of Drug Side Effect Identification Methods. Curr Pharm Des 2020; 26:3096-3104. [PMID: 32532187 DOI: 10.2174/1381612826666200612163819] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 05/18/2020] [Indexed: 11/22/2022]
Abstract
Drug side effects have become an important indicator for evaluating the safety of drugs. There are two main factors in the frequent occurrence of drug safety problems; on the one hand, the clinical understanding of drug side effects is insufficient, leading to frequent adverse drug reactions, while on the other hand, due to the long-term period and complexity of clinical trials, side effects of approved drugs on the market cannot be reported in a timely manner. Therefore, many researchers have focused on developing methods to identify drug side effects. In this review, we summarize the methods of identifying drug side effects and common databases in this field. We classified methods of identifying side effects into four categories: biological experimental, machine learning, text mining and network methods. We point out the key points of each kind of method. In addition, we also explain the advantages and disadvantages of each method. Finally, we propose future research directions.
Collapse
Affiliation(s)
- Shuai Deng
- College of Science, Beijing Forestry University, Beijing, China
| | - Yige Sun
- Microbiology Department, Harbin Medical University, Harbin, 150081, China
| | - Tianyi Zhao
- School of Life Science and Technology, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yang Hu
- School of Life Science and Technology, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Tianyi Zang
- School of Life Science and Technology, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| |
Collapse
|
6
|
Beeksma M, Verberne S, van den Bosch A, Das E, Hendrickx I, Groenewoud S. Predicting life expectancy with a long short-term memory recurrent neural network using electronic medical records. BMC Med Inform Decis Mak 2019; 19:36. [PMID: 30819172 PMCID: PMC6394008 DOI: 10.1186/s12911-019-0775-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2018] [Accepted: 02/18/2019] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Life expectancy is one of the most important factors in end-of-life decision making. Good prognostication for example helps to determine the course of treatment and helps to anticipate the procurement of health care services and facilities, or more broadly: facilitates Advance Care Planning. Advance Care Planning improves the quality of the final phase of life by stimulating doctors to explore the preferences for end-of-life care with their patients, and people close to the patients. Physicians, however, tend to overestimate life expectancy, and miss the window of opportunity to initiate Advance Care Planning. This research tests the potential of using machine learning and natural language processing techniques for predicting life expectancy from electronic medical records. METHODS We approached the task of predicting life expectancy as a supervised machine learning task. We trained and tested a long short-term memory recurrent neural network on the medical records of deceased patients. We developed the model with a ten-fold cross-validation procedure, and evaluated its performance on a held-out set of test data. We compared the performance of a model which does not use text features (baseline model) to the performance of a model which uses features extracted from the free texts of the medical records (keyword model), and to doctors' performance on a similar task as described in scientific literature. RESULTS Both doctors and the baseline model were correct in 20% of the cases, taking a margin of 33% around the actual life expectancy as the target. The keyword model, in comparison, attained an accuracy of 29% with its prognoses. While doctors overestimated life expectancy in 63% of the incorrect prognoses, which harms anticipation to appropriate end-of-life care, the keyword model overestimated life expectancy in only 31% of the incorrect prognoses. CONCLUSIONS Prognostication of life expectancy is difficult for humans. Our research shows that machine learning and natural language processing techniques offer a feasible and promising approach to predicting life expectancy. The research has potential for real-life applications, such as supporting timely recognition of the right moment to start Advance Care Planning.
Collapse
Affiliation(s)
- Merijn Beeksma
- Centre for Language Studies, Radboud University, Erasmusplein 1, 6525 HT Nijmegen, The Netherlands
| | - Suzan Verberne
- Leiden Institute for Advanced Computer Sciences, Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands
| | - Antal van den Bosch
- KNAW Meertens Institute, Oudezijds Achterburgwal 185, 1012 DK Amsterdam, The Netherlands
| | - Enny Das
- Centre for Language Studies, Radboud University, Erasmusplein 1, 6525 HT Nijmegen, The Netherlands
| | - Iris Hendrickx
- Centre for Language Studies, Radboud University, Erasmusplein 1, 6525 HT Nijmegen, The Netherlands
| | - Stef Groenewoud
- IQ Healthcare, Radboudumc, Mailbox 9101, 6500 HB Nijmegen, The Netherlands
| |
Collapse
|
7
|
Inferring Drug-Protein⁻Side Effect Relationships from Biomedical Text. Genes (Basel) 2019; 10:genes10020159. [PMID: 30791472 PMCID: PMC6409686 DOI: 10.3390/genes10020159] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 02/13/2019] [Accepted: 02/14/2019] [Indexed: 11/16/2022] Open
Abstract
Background: Although there are many studies of drugs and their side effects, the underlying mechanisms of these side effects are not well understood. It is also difficult to understand the specific pathways between drugs and side effects. Objective: The present study seeks to construct putative paths between drugs and their side effects by applying text-mining techniques to free text of biomedical studies, and to develop ranking metrics that could identify the most-likely paths. Materials and Methods: We extracted three types of relationships—drug-protein, protein-protein, and protein–side effect—from biomedical texts by using text mining and predefined relation-extraction rules. Based on the extracted relationships, we constructed whole drug-protein–side effect paths. For each path, we calculated its ranking score by a new ranking function that combines corpus- and ontology-based semantic similarity as well as co-occurrence frequency. Results: We extracted 13 plausible biomedical paths connecting drugs and their side effects from cancer-related abstracts in the PubMed database. The top 20 paths were examined, and the proposed ranking function outperformed the other methods tested, including co-occurrence, COALS, and UMLS by P@5-P@20. In addition, we confirmed that the paths are novel hypotheses that are worth investigating further. Discussion: The risk of side effects has been an important issue for the US Food and Drug Administration (FDA). However, the causes and mechanisms of such side effects have not been fully elucidated. This study extends previous research on understanding drug side effects by using various techniques such as Named Entity Recognition (NER), Relation Extraction (RE), and semantic similarity. Conclusion: It is not easy to reveal the biomedical mechanisms of side effects due to a huge number of possible paths. However, we automatically generated predictable paths using the proposed approach, which could provide meaningful information to biomedical researchers to generate plausible hypotheses for the understanding of such mechanisms.
Collapse
|
8
|
Lee S, Han J, Park RW, Kim GJ, Rim JH, Cho J, Lee KH, Lee J, Kim S, Kim JH. Development of a Controlled Vocabulary-Based Adverse Drug Reaction Signal Dictionary for Multicenter Electronic Health Record-Based Pharmacovigilance. Drug Saf 2019; 42:657-670. [PMID: 30649749 DOI: 10.1007/s40264-018-0767-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Affiliation(s)
- Suehyun Lee
- Division of Biomedical Informatics, Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea
- Department of Biomedical Informatics, College of Medicine, Konyang University, Daejeon, Korea
| | - Jongsoo Han
- Division of Biomedical Informatics, Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea
| | - Rae Woong Park
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Korea
| | - Grace Juyun Kim
- Division of Biomedical Informatics, Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea
| | - John Hoon Rim
- Department of Laboratory Medicine, Severance Hospital, Yonsei University College of Medicine, Seoul, Korea
- Physician-Scientist Program, Department of Medicine, Yonsei University Graduate School of Medicine, Seoul, Korea
- Department of Pharmacology, Yonsei University College of Medicine, Seoul, Korea
| | - Jooyoung Cho
- Physician-Scientist Program, Department of Medicine, Yonsei University Graduate School of Medicine, Seoul, Korea
- Department of Laboratory Medicine, Wonju Severance Christian Hospital, Yonsei University Wonju College of Medicine, Wonju, Korea
| | - Kye Hwa Lee
- Division of Biomedical Informatics, Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea
- Precision Medicine Center, Seoul National University Hospital, Seoul, Korea
| | - Jisan Lee
- College of Nursing, Catholic University of Pusan, Busan, Korea
| | - Sujeong Kim
- College of Nursing, Seattle University, Seattle, USA
| | - Ju Han Kim
- Division of Biomedical Informatics, Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea.
- Precision Medicine Center, Seoul National University Hospital, Seoul, Korea.
| |
Collapse
|
9
|
Abstract
Recent advances in technology have led to the exponential growth of scientific literature in biomedical sciences. This rapid increase in information has surpassed the threshold for manual curation efforts, necessitating the use of text mining approaches in the field of life sciences. One such application of text mining is in fostering in silico drug discovery such as drug target screening, pharmacogenomics, adverse drug event detection, etc. This chapter serves as an introduction to the applications of various text mining approaches in drug discovery. It is divided into two parts with the first half as an overview of text mining in the biosciences. The second half of the chapter reviews strategies and methods for four unique applications of text mining in drug discovery.
Collapse
Affiliation(s)
- Si Zheng
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Shazia Dharssi
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Meng Wu
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Jiao Li
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|
10
|
Li F, Liu W, Yu H. Extraction of Information Related to Adverse Drug Events from Electronic Health Record Notes: Design of an End-to-End Model Based on Deep Learning. JMIR Med Inform 2018; 6:e12159. [PMID: 30478023 PMCID: PMC6288593 DOI: 10.2196/12159] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Revised: 10/31/2018] [Accepted: 11/09/2018] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Pharmacovigilance and drug-safety surveillance are crucial for monitoring adverse drug events (ADEs), but the main ADE-reporting systems such as Food and Drug Administration Adverse Event Reporting System face challenges such as underreporting. Therefore, as complementary surveillance, data on ADEs are extracted from electronic health record (EHR) notes via natural language processing (NLP). As NLP develops, many up-to-date machine-learning techniques are introduced in this field, such as deep learning and multi-task learning (MTL). However, only a few studies have focused on employing such techniques to extract ADEs. OBJECTIVE We aimed to design a deep learning model for extracting ADEs and related information such as medications and indications. Since extraction of ADE-related information includes two steps-named entity recognition and relation extraction-our second objective was to improve the deep learning model using multi-task learning between the two steps. METHODS We employed the dataset from the Medication, Indication and Adverse Drug Events (MADE) 1.0 challenge to train and test our models. This dataset consists of 1089 EHR notes of cancer patients and includes 9 entity types such as Medication, Indication, and ADE and 7 types of relations between these entities. To extract information from the dataset, we proposed a deep-learning model that uses a bidirectional long short-term memory (BiLSTM) conditional random field network to recognize entities and a BiLSTM-Attention network to extract relations. To further improve the deep-learning model, we employed three typical MTL methods, namely, hard parameter sharing, parameter regularization, and task relation learning, to build three MTL models, called HardMTL, RegMTL, and LearnMTL, respectively. RESULTS Since extraction of ADE-related information is a two-step task, the result of the second step (ie, relation extraction) was used to compare all models. We used microaveraged precision, recall, and F1 as evaluation metrics. Our deep learning model achieved state-of-the-art results (F1=65.9%), which is significantly higher than that (F1=61.7%) of the best system in the MADE1.0 challenge. HardMTL further improved the F1 by 0.8%, boosting the F1 to 66.7%, whereas RegMTL and LearnMTL failed to boost the performance. CONCLUSIONS Deep learning models can significantly improve the performance of ADE-related information extraction. MTL may be effective for named entity recognition and relation extraction, but it depends on the methods, data, and other factors. Our results can facilitate research on ADE detection, NLP, and machine learning.
Collapse
Affiliation(s)
- Fei Li
- Department of Computer Science, University of Massachusetts Lowell, Lowell, MA, United States
- Center for Healthcare Organization and Implementation Research, Bedford Veterans Affairs Medical Center, Bedford, MA, United States
- Department of Medicine, University of Massachusetts Medical School, Worcester, MA, United States
| | - Weisong Liu
- Department of Computer Science, University of Massachusetts Lowell, Lowell, MA, United States
- Center for Healthcare Organization and Implementation Research, Bedford Veterans Affairs Medical Center, Bedford, MA, United States
- Department of Medicine, University of Massachusetts Medical School, Worcester, MA, United States
| | - Hong Yu
- Department of Computer Science, University of Massachusetts Lowell, Lowell, MA, United States
- Center for Healthcare Organization and Implementation Research, Bedford Veterans Affairs Medical Center, Bedford, MA, United States
- Department of Medicine, University of Massachusetts Medical School, Worcester, MA, United States
- School of Computer Science, University of Massachusetts, Amherst, MA, United States
| |
Collapse
|
11
|
Li XX, Yin J, Tang J, Li Y, Yang Q, Xiao Z, Zhang R, Wang Y, Hong J, Tao L, Xue W, Zhu F. Determining the Balance Between Drug Efficacy and Safety by the Network and Biological System Profile of Its Therapeutic Target. Front Pharmacol 2018; 9:1245. [PMID: 30429792 PMCID: PMC6220079 DOI: 10.3389/fphar.2018.01245] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2018] [Accepted: 10/12/2018] [Indexed: 12/14/2022] Open
Abstract
One of the most challenging puzzles in drug discovery is the identification and characterization of candidate drug of well-balanced profile between efficacy and safety. So far, extensive efforts have been made to evaluate this balance by estimating the quantitative structure–therapeutic relationship and exploring target profile of adverse drug reaction. Particularly, the therapeutic index (TI) has emerged as a key indicator illustrating this delicate balance, and a clinically successful agent requires a sufficient TI suitable for it corresponding indication. However, the TI information are largely unknown for most drugs, and the mechanism underlying the drugs with narrow TI (NTI drugs) is still elusive. In this study, the collective effects of human protein–protein interaction (PPI) network and biological system profile on the drugs' efficacy–safety balance were systematically evaluated. First, a comprehensive literature review of the FDA approved drugs confirmed their NTI status. Second, a popular feature selection algorithm based on artificial intelligence (AI) was adopted to identify key factors differencing the target mechanism between NTI and non-NTI drugs. Finally, this work revealed that the targets of NTI drugs were highly centralized and connected in human PPI network, and the number of similarity proteins and affiliated signaling pathways of the corresponding targets was much higher than those of non-NTI drugs. These findings together with the newly discovered features or feature groups clarified the key factors indicating drug's narrow TI, and could thus provide a novel direction for determining the delicate drug efficacy-safety balance.
Collapse
Affiliation(s)
- Xiao Xu Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Jiayi Yin
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Jing Tang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Yinghong Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Qingxia Yang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Ziyu Xiao
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Runyuan Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yunxia Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Jiajun Hong
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicine of Zhejiang Province, School of Medicine, Hangzhou Normal University, Hangzhou, China
| | - Weiwei Xue
- School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| |
Collapse
|
12
|
Wang Q, Xu R. Immunotherapy-related adverse events (irAEs): extraction from FDA drug labels and comparative analysis. JAMIA Open 2018; 2:173-178. [PMID: 30976759 PMCID: PMC6447026 DOI: 10.1093/jamiaopen/ooy045] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 09/06/2018] [Accepted: 09/28/2018] [Indexed: 11/26/2022] Open
Abstract
Objectives Immune checkpoint inhibitors (ICIs) have dramatically improved outcomes in cancer patients. However, ICIs are associated with significant immune-related adverse events (irAEs) and the underlying biological mechanisms are not well-understood. To ensure safe cancer treatment, research efforts are needed to comprehensively detect and understand irAEs. Materials and methods We manually extracted and standardized irAEs from The U.S Food and Drug Administration (FDA) drug labels for six FDA-approved ICIs. We compared irAE profile similarities among ICIs and 1507 FDA-approved non-ICI drugs. We investigated how irAEs have differential effects on human organs by classifying irAEs based on their targeted organ systems. Finally, we identified broad-spectrum (nontarget-specific) and narrow-spectrum (target-specific) irAEs. Results A total of 893 irAEs were extracted. 31.4% irAEs were shared among ICIs as compared to the 8.0% between ICIs and non-ICI drugs. irAEs were resulted from both on- and off-target effects: irAE profiles were more similar for ICIs with same target than different targets, demonstrating the on-target effects; irAE profile similarity for ICIs with the same target is less than 50%, demonstrating unknown off-target effects. ICIs significantly target many organ systems, including endocrine system (3.4-fold enrichment), metabolism (3.7-fold enrichment), immune system (3.6-fold enrichment), and autoimmune system (4.8-fold enrichment). We identified 21 broad-spectrum irAEs shared among all ICIs, 20 irAEs specific for PD-L1/PD-1 inhibition, and 28 irAEs specific for CTLA-4 inhibition. Discussion and conclusion Our study presents the first effort toward building a standardized database of irAEs. The extracted irAEs can serve as the gold standard for automatic irAE extractions from other data resources and set a foundation to understand biological mechanisms of irAEs.
Collapse
Affiliation(s)
- QuanQiu Wang
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| | - Rong Xu
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| |
Collapse
|
13
|
Onye SC, Akkeleş A, Dimililer N. relSCAN - A system for extracting chemical-induced disease relation from biomedical literature. J Biomed Inform 2018; 87:79-87. [PMID: 30296491 DOI: 10.1016/j.jbi.2018.09.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Revised: 09/17/2018] [Accepted: 09/30/2018] [Indexed: 11/20/2022]
Abstract
This paper proposes an effective and robust approach for Chemical-Induced Disease (CID) relation extraction from PubMed articles. The study was performed on the Chemical Disease Relation (CDR) task of BioCreative V track-3 corpus. The proposed system, named relSCAN, is an efficient CID relation extraction system with two phases to classify relation instances from the Co-occurrence and Non-Co-occurrence mention levels. We describe the case of chemical and disease mentions that occur in the same sentence as 'Co-occurrence', or as 'Non-Co-occurrence' otherwise. In the first phase, the relation instances are constructed on both mention levels. In the second phase, we employ a hybrid feature set to classify the relation instances at both of these mention levels using the combination of two Machine Learning (ML) classifiers (Support Vector Machine (SVM) and J48 Decision tree). This system is entirely corpus dependent and does not rely on information from external resources in order to boost its performance. We achieved good results, which are comparable with the other state-of-the-art CID relation extraction systems on the BioCreative V corpus. Furthermore, our system achieves the best performance on the Non-Co-occurrence mention level.
Collapse
Affiliation(s)
- Stanley Chika Onye
- Department of Applied Mathematics and Computer Science, Faculty of Arts & Sciences, Eastern Mediterranean University, Famagusta, North Cyprus via Mersin 10, Turkey.
| | - Arif Akkeleş
- Department of Mathematics, Faculty of Arts & Sciences, Eastern Mediterranean University, Famagusta, North Cyprus via Mersin 10, Turkey
| | - Nazife Dimililer
- Department of Information Technology, School of Computing and Technology, Eastern Mediterranean University, Famagusta, North Cyprus via Mersin 10, Turkey
| |
Collapse
|
14
|
Li H, Yang M, Chen Q, Tang B, Wang X, Yan J. Chemical-induced disease extraction via recurrent piecewise convolutional neural networks. BMC Med Inform Decis Mak 2018; 18:60. [PMID: 30066652 PMCID: PMC6069297 DOI: 10.1186/s12911-018-0629-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Extracting relationships between chemicals and diseases from unstructured literature have attracted plenty of attention since the relationships are very useful for a large number of biomedical applications such as drug repositioning and pharmacovigilance. A number of machine learning methods have been proposed for chemical-induced disease (CID) extraction due to some publicly available annotated corpora. Most of them suffer from time-consuming feature engineering except deep learning methods. In this paper, we propose a novel document-level deep learning method, called recurrent piecewise convolutional neural networks (RPCNN), for CID extraction. RESULTS Experimental results on a benchmark dataset, the CDR (Chemical-induced Disease Relation) dataset of the BioCreative V challenge for CID extraction show that the highest precision, recall and F-score of our RPCNN-based CID extraction system are 65.24, 77.21 and 70.77%, which is competitive with other state-of-the-art systems. CONCLUSIONS A novel deep learning method is proposed for document-level CID extraction, where domain knowledge, piecewise strategy, attention mechanism, and multi-instance learning are combined together. The effectiveness of the method is proved by experiments conducted on a benchmark dataset.
Collapse
Affiliation(s)
- Haodi Li
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, Guangdong, China.,Shenzhen Calligraphy Digital Simulation Technology Engineering Laboratory, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Ming Yang
- Pharmacy Department, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, Guandong, Shenzhen, China
| | - Qingcai Chen
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, Guangdong, China. .,Shenzhen Calligraphy Digital Simulation Technology Engineering Laboratory, Harbin Institute of Technology, Shenzhen, Guangdong, China.
| | - Buzhou Tang
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, Guangdong, China. .,Shenzhen Calligraphy Digital Simulation Technology Engineering Laboratory, Harbin Institute of Technology, Shenzhen, Guangdong, China.
| | - Xiaolong Wang
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, Guangdong, China.,Shenzhen Calligraphy Digital Simulation Technology Engineering Laboratory, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Jun Yan
- Yidu Cloud (Beijing) Technology Co., Ltd, Beijing, China
| |
Collapse
|
15
|
Taewijit S, Theeramunkong T, Ikeda M. Distant Supervision with Transductive Learning for Adverse Drug Reaction Identification from Electronic Medical Records. JOURNAL OF HEALTHCARE ENGINEERING 2017; 2017:7575280. [PMID: 29090077 PMCID: PMC5635478 DOI: 10.1155/2017/7575280] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Accepted: 07/19/2017] [Indexed: 11/17/2022]
Abstract
Information extraction and knowledge discovery regarding adverse drug reaction (ADR) from large-scale clinical texts are very useful and needy processes. Two major difficulties of this task are the lack of domain experts for labeling examples and intractable processing of unstructured clinical texts. Even though most previous works have been conducted on these issues by applying semisupervised learning for the former and a word-based approach for the latter, they face with complexity in an acquisition of initial labeled data and ignorance of structured sequence of natural language. In this study, we propose automatic data labeling by distant supervision where knowledge bases are exploited to assign an entity-level relation label for each drug-event pair in texts, and then, we use patterns for characterizing ADR relation. The multiple-instance learning with expectation-maximization method is employed to estimate model parameters. The method applies transductive learning to iteratively reassign a probability of unknown drug-event pair at the training time. By investigating experiments with 50,998 discharge summaries, we evaluate our method by varying large number of parameters, that is, pattern types, pattern-weighting models, and initial and iterative weightings of relations for unlabeled data. Based on evaluations, our proposed method outperforms the word-based feature for NB-EM (iEM), MILR, and TSVM with F1 score of 11.3%, 9.3%, and 6.5% improvement, respectively.
Collapse
Affiliation(s)
- Siriwon Taewijit
- The School of Information, Communication and Computer Technologies, Sirindhorn International Institute of Technology, Thammasat University, Pathum Thani 12120, Thailand
- The School of Knowledge Science, Japan Advanced Institute of Science and Technology, Nomi 923-1292, Japan
| | - Thanaruk Theeramunkong
- The School of Information, Communication and Computer Technologies, Sirindhorn International Institute of Technology, Thammasat University, Pathum Thani 12120, Thailand
| | - Mitsuru Ikeda
- The School of Knowledge Science, Japan Advanced Institute of Science and Technology, Nomi 923-1292, Japan
| |
Collapse
|
16
|
Peng Y, Wei CH, Lu Z. Improving chemical disease relation extraction with rich features and weakly labeled data. J Cheminform 2016; 8:53. [PMID: 28316651 PMCID: PMC5054544 DOI: 10.1186/s13321-016-0165-z] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2016] [Accepted: 09/28/2016] [Indexed: 01/08/2023] Open
Abstract
Background Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedical literature. In this work we aim to build on current advances in named entity recognition and a recent BioCreative effort to further improve the state of the art in biomedical relation extraction, in particular for the chemical-induced disease (CID) relations. Results We propose a rich-feature approach with Support Vector Machine to aid in the extraction of CIDs from PubMed articles. Our feature vector includes novel statistical features, linguistic knowledge, and domain resources. We also incorporate the output of a rule-based system as features, thus combining the advantages of rule- and machine learning-based systems. Furthermore, we augment our approach with automatically generated labeled text from an existing knowledge base to improve performance without additional cost for corpus construction. To evaluate our system, we perform experiments on the human-annotated BioCreative V benchmarking dataset and compare with previous results. When trained using only BioCreative V training and development sets, our system achieves an F-score of 57.51 %, which already compares favorably to previous methods. Our system performance was further improved to 61.01 % in F-score when augmented with additional automatically generated weakly labeled data. Conclusions Our text-mining approach demonstrates state-of-the-art performance in disease-chemical relation extraction. More importantly, this work exemplifies the use of (freely available) curated document-level annotations in existing biomedical databases, which are largely overlooked in text-mining system development.
Collapse
Affiliation(s)
- Yifan Peng
- National Center for Biotechnology Information, Bethesda, MD 20894 USA ; Computer and Information Sciences, University of Delaware, Newark, DE 19716 USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, Bethesda, MD 20894 USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, Bethesda, MD 20894 USA
| |
Collapse
|
17
|
Bravo À, Li TS, Su AI, Good BM, Furlong LI. Combining machine learning, crowdsourcing and expert knowledge to detect chemical-induced diseases in text. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw094. [PMID: 27307137 PMCID: PMC4908671 DOI: 10.1093/database/baw094] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 05/10/2016] [Indexed: 01/13/2023]
Abstract
Drug toxicity is a major concern for both regulatory agencies and the pharmaceutical industry. In this context, text-mining methods for the identification of drug side effects from free text are key for the development of up-to-date knowledge sources on drug adverse reactions. We present a new system for identification of drug side effects from the literature that combines three approaches: machine learning, rule- and knowledge-based approaches. This system has been developed to address the Task 3.B of Biocreative V challenge (BC5) dealing with Chemical-induced Disease (CID) relations. The first two approaches focus on identifying relations at the sentence-level, while the knowledge-based approach is applied both at sentence and abstract levels. The machine learning method is based on the BeFree system using two corpora as training data: the annotated data provided by the CID task organizers and a new CID corpus developed by crowdsourcing. Different combinations of results from the three strategies were selected for each run of the challenge. In the final evaluation setting, the system achieved the highest Recall of the challenge (63%). By performing an error analysis, we identified the main causes of misclassifications and areas for improving of our system, and highlighted the need of consistent gold standard data sets for advancing the state of the art in text mining of drug side effects. Database URL: https://zenodo.org/record/29887?ln¼en#.VsL3yDLWR_V
Collapse
Affiliation(s)
- Àlex Bravo
- Research Programme on Biomedical Informatics (GRIB), IMIM, UPF, Barcelona, Spain and
| | - Tong Shu Li
- Department of Molecular and Experimental Medicine, the Scripps Research Institute, La Jolla, CA, USA
| | - Andrew I Su
- Department of Molecular and Experimental Medicine, the Scripps Research Institute, La Jolla, CA, USA
| | - Benjamin M Good
- Department of Molecular and Experimental Medicine, the Scripps Research Institute, La Jolla, CA, USA
| | - Laura I Furlong
- Research Programme on Biomedical Informatics (GRIB), IMIM, UPF, Barcelona, Spain and
| |
Collapse
|
18
|
Li H, Tang B, Chen Q, Chen K, Wang X, Wang B, Wang Z. HITSZ_CDR: an end-to-end chemical and disease relation extraction system for BioCreative V. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw077. [PMID: 27270713 PMCID: PMC4911788 DOI: 10.1093/database/baw077] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 04/21/2016] [Indexed: 11/12/2022]
Abstract
In this article, an end-to-end system was proposed for the challenge task of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction in BioCreative V, where DNER includes disease mention recognition (DMR) and normalization (DN). Evaluation on the challenge corpus showed that our system achieved the highest F1-scores 86.93% on DMR, 84.11% on DN, 43.04% on CID relation extraction, respectively. The F1-score on DMR is higher than our previous one reported by the challenge organizers (86.76%), the highest F1-score of the challenge.Database URL: http://database.oxfordjournals.org/content/2016/baw077.
Collapse
Affiliation(s)
- Haodi Li
- Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen School, China
| | - Buzhou Tang
- Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen School, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China
| | - Qingcai Chen
- Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen School, China
| | - Kai Chen
- Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen School, China
| | - Xiaolong Wang
- Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen School, China
| | - Baohua Wang
- College of Mathematics and statistics, Shenzhen University, China
| | - Zhe Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China
| |
Collapse
|
19
|
Li TS, Bravo À, Furlong LI, Good BM, Su AI. A crowdsourcing workflow for extracting chemical-induced disease relations from free text. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw051. [PMID: 27087308 PMCID: PMC4834205 DOI: 10.1093/database/baw051] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/05/2015] [Accepted: 03/17/2016] [Indexed: 01/05/2023]
Abstract
Relations between chemicals and diseases are one of the most queried biomedical interactions. Although expert manual curation is the standard method for extracting these relations from the literature, it is expensive and impractical to apply to large numbers of documents, and therefore alternative methods are required. We describe here a crowdsourcing workflow for extracting chemical-induced disease relations from free text as part of the BioCreative V Chemical Disease Relation challenge. Five non-expert workers on the CrowdFlower platform were shown each potential chemical-induced disease relation highlighted in the original source text and asked to make binary judgments about whether the text supported the relation. Worker responses were aggregated through voting, and relations receiving four or more votes were predicted as true. On the official evaluation dataset of 500 PubMed abstracts, the crowd attained a 0.505 F-score (0.475 precision, 0.540 recall), with a maximum theoretical recall of 0.751 due to errors with named entity recognition. The total crowdsourcing cost was $1290.67 ($2.58 per abstract) and took a total of 7 h. A qualitative error analysis revealed that 46.66% of sampled errors were due to task limitations and gold standard errors, indicating that performance can still be improved. All code and results are publicly available at https://github.com/SuLab/crowd_cid_relex Database URL: https://github.com/SuLab/crowd_cid_relex
Collapse
Affiliation(s)
- Tong Shu Li
- Department of Molecular and Experimental Medicine, the Scripps Research Institute, La Jolla, CA 92037, USA
| | - Àlex Bravo
- Research Programme on Biomedical Informatics (GRIB), IMIM, UPF, Barcelona, Spain
| | - Laura I Furlong
- Research Programme on Biomedical Informatics (GRIB), IMIM, UPF, Barcelona, Spain
| | - Benjamin M Good
- Department of Molecular and Experimental Medicine, the Scripps Research Institute, La Jolla, CA 92037, USA
| | - Andrew I Su
- Department of Molecular and Experimental Medicine, the Scripps Research Institute, La Jolla, CA 92037, USA
| |
Collapse
|
20
|
Gu J, Qian L, Zhou G. Chemical-induced disease relation extraction with various linguistic features. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw042. [PMID: 27052618 PMCID: PMC4822558 DOI: 10.1093/database/baw042] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 03/04/2016] [Indexed: 01/06/2023]
Abstract
Understanding the relations between chemicals and diseases is crucial in various biomedical tasks such as new drug discoveries and new therapy developments. While manually mining these relations from the biomedical literature is costly and time-consuming, such a procedure is often difficult to keep up-to-date. To address these issues, the BioCreative-V community proposed a challenging task of automatic extraction of chemical-induced disease (CID) relations in order to benefit biocuration. This article describes our work on the CID relation extraction task on the BioCreative-V tasks. We built a machine learning based system that utilized simple yet effective linguistic features to extract relations with maximum entropy models. In addition to leveraging various features, the hypernym relations between entity concepts derived from the Medical Subject Headings (MeSH)-controlled vocabulary were also employed during both training and testing stages to obtain more accurate classification models and better extraction performance, respectively. We demoted relation extraction between entities in documents to relation extraction between entity mentions. In our system, pairs of chemical and disease mentions at both intra- and inter-sentence levels were first constructed as relation instances for training and testing, then two classification models at both levels were trained from the training examples and applied to the testing examples. Finally, we merged the classification results from mention level to document level to acquire final relations between chemicals and diseases. Our system achieved promising F-scores of 60.4% on the development dataset and 58.3% on the test dataset using gold-standard entity annotations, respectively. Database URL: https://github.com/JHnlp/BC5CIDTask
Collapse
Affiliation(s)
- Jinghang Gu
- Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China, 215006
| | - Longhua Qian
- Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China, 215006
| | - Guodong Zhou
- Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China, 215006
| |
Collapse
|
21
|
Huang CC, Lu Z. Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw025. [PMID: 27016698 PMCID: PMC4808250 DOI: 10.1093/database/baw025] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 02/14/2016] [Indexed: 01/28/2023]
Abstract
Identifying relevant papers from the literature is a common task in biocuration. Most current biomedical literature search systems primarily rely on matching user keywords. Semantic search, on the other hand, seeks to improve search accuracy by understanding the entities and contextual relations in user keywords. However, past research has mostly focused on semantically identifying biological entities (e.g. chemicals, diseases and genes) with little effort on discovering semantic relations. In this work, we aim to discover biomedical semantic relations in PubMed queries in an automated and unsupervised fashion. Specifically, we focus on extracting and understanding the contextual information (or context patterns) that is used by PubMed users to represent semantic relations between entities such as 'CHEMICAL-1 compared to CHEMICAL-2' With the advances in automatic named entity recognition, we first tag entities in PubMed queries and then use tagged entities as knowledge to recognize pattern semantics. More specifically, we transform PubMed queries into context patterns involving participating entities, which are subsequently projected to latent topics via latent semantic analysis (LSA) to avoid the data sparseness and specificity issues. Finally, we mine semantically similar contextual patterns or semantic relations based on LSA topic distributions. Our two separate evaluation experiments of chemical-chemical (CC) and chemical-disease (CD) relations show that the proposed approach significantly outperforms a baseline method, which simply measures pattern semantics by similarity in participating entities. The highest performance achieved by our approach is nearly 0.9 and 0.85 respectively for the CC and CD task when compared against the ground truth in terms of normalized discounted cumulative gain (nDCG), a standard measure of ranking quality. These results suggest that our approach can effectively identify and return related semantic patterns in a ranked order covering diverse bio-entity relations. To assess the potential utility of our automated top-ranked patterns of a given relation in semantic search, we performed a pilot study on frequently sought semantic relations in PubMed and observed improved literature retrieval effectiveness based on post-hoc human relevance evaluation. Further investigation in larger tests and in real-world scenarios is warranted.
Collapse
Affiliation(s)
- Chung-Chi Huang
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
22
|
Xu J, Wu Y, Zhang Y, Wang J, Lee HJ, Xu H. CD-REST: a system for extracting chemical-induced disease relation in literature. Database (Oxford) 2016; 2016:baw036. [PMID: 27016700 PMCID: PMC4808251 DOI: 10.1093/database/baw036] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Revised: 02/04/2016] [Accepted: 03/01/2016] [Indexed: 11/24/2022]
Abstract
Mining chemical-induced disease relations embedded in the vast biomedical literature could facilitate a wide range of computational biomedical applications, such as pharmacovigilance. The BioCreative V organized a Chemical Disease Relation (CDR) Track regarding chemical-induced disease relation extraction from biomedical literature in 2015. We participated in all subtasks of this challenge. In this article, we present our participation system Chemical Disease Relation Extraction SysTem (CD-REST), an end-to-end system for extracting chemical-induced disease relations in biomedical literature. CD-REST consists of two main components: (1) a chemical and disease named entity recognition and normalization module, which employs the Conditional Random Fields algorithm for entity recognition and a Vector Space Model-based approach for normalization; and (2) a relation extraction module that classifies both sentence-level and document-level candidate drug-disease pairs by support vector machines. Our system achieved the best performance on the chemical-induced disease relation extraction subtask in the BioCreative V CDR Track, demonstrating the effectiveness of our proposed machine learning-based approaches for automatic extraction of chemical-induced disease relations in biomedical literature. The CD-REST system provides web services using HTTP POST request. The web services can be accessed fromhttp://clinicalnlptool.com/cdr The online CD-REST demonstration system is available athttp://clinicalnlptool.com/cdr/cdr.html. Database URL:http://clinicalnlptool.com/cdr;http://clinicalnlptool.com/cdr/cdr.html.
Collapse
Affiliation(s)
- Jun Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yonghui Wu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Jingqi Wang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Hee-Jin Lee
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
23
|
Wei CH, Peng Y, Leaman R, Davis AP, Mattingly CJ, Li J, Wiegers TC, Lu Z. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw032. [PMID: 26994911 PMCID: PMC4799720 DOI: 10.1093/database/baw032] [Citation(s) in RCA: 94] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/23/2015] [Accepted: 02/25/2016] [Indexed: 11/14/2022]
Abstract
Manually curating chemicals, diseases and their relationships is significantly important to biomedical research, but it is plagued by its high cost and the rapid growth of the biomedical literature. In recent years, there has been a growing interest in developing computational approaches for automatic chemical-disease relation (CDR) extraction. Despite these attempts, the lack of a comprehensive benchmarking dataset has limited the comparison of different techniques in order to assess and advance the current state-of-the-art. To this end, we organized a challenge task through BioCreative V to automatically extract CDRs from the literature. We designed two challenge tasks: disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. To assist system development and assessment, we created a large annotated text corpus that consisted of human annotations of chemicals, diseases and their interactions from 1500 PubMed articles. 34 teams worldwide participated in the CDR task: 16 (DNER) and 18 (CID). The best systems achieved an F-score of 86.46% for the DNER task—a result that approaches the human inter-annotator agreement (0.8875)—and an F-score of 57.03% for the CID task, the highest results ever reported for such tasks. When combining team results via machine learning, the ensemble system was able to further improve over the best team results by achieving 88.89% and 62.80% in F-score for the DNER and CID task, respectively. Additionally, another novel aspect of our evaluation is to test each participating system’s ability to return real-time results: the average response time for each team’s DNER and CID web service systems were 5.6 and 9.3 s, respectively. Most teams used hybrid systems for their submissions based on machining learning. Given the level of participation and results, we found our task to be successful in engaging the text-mining research community, producing a large annotated corpus and improving the results of automatic disease recognition and CDR extraction. Database URL:http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information, Bethesda, MD 20894, USA
| | - Yifan Peng
- National Center for Biotechnology Information, Bethesda, MD 20894, USA Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA
| | - Robert Leaman
- National Center for Biotechnology Information, Bethesda, MD 20894, USA
| | - Allan Peter Davis
- Department of Biological Sciences and the Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA and
| | - Carolyn J Mattingly
- Department of Biological Sciences and the Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA and
| | - Jiao Li
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100700, China
| | - Thomas C Wiegers
- Department of Biological Sciences and the Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA and
| | - Zhiyong Lu
- National Center for Biotechnology Information, Bethesda, MD 20894, USA
| |
Collapse
|
24
|
Kuhn M, Letunic I, Jensen LJ, Bork P. The SIDER database of drugs and side effects. Nucleic Acids Res 2015; 44:D1075-9. [PMID: 26481350 PMCID: PMC4702794 DOI: 10.1093/nar/gkv1075] [Citation(s) in RCA: 647] [Impact Index Per Article: 71.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Accepted: 10/06/2015] [Indexed: 01/10/2023] Open
Abstract
Unwanted side effects of drugs are a burden on patients and a severe impediment in the development of new drugs. At the same time, adverse drug reactions (ADRs) recorded during clinical trials are an important source of human phenotypic data. It is therefore essential to combine data on drugs, targets and side effects into a more complete picture of the therapeutic mechanism of actions of drugs and the ways in which they cause adverse reactions. To this end, we have created the SIDER (‘Side Effect Resource’, http://sideeffects.embl.de) database of drugs and ADRs. The current release, SIDER 4, contains data on 1430 drugs, 5880 ADRs and 140 064 drug–ADR pairs, which is an increase of 40% compared to the previous version. For more fine-grained analyses, we extracted the frequency with which side effects occur from the package inserts. This information is available for 39% of drug–ADR pairs, 19% of which can be compared to the frequency under placebo treatment. SIDER furthermore contains a data set of drug indications, extracted from the package inserts using Natural Language Processing. These drug indications are used to reduce the rate of false positives by identifying medical terms that do not correspond to ADRs.
Collapse
Affiliation(s)
- Michael Kuhn
- Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr. 108, 01307 Dresden, Germany
| | - Ivica Letunic
- Biobyte solutions GmbH, Bothestr. 142, 69117 Heidelberg, Germany
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen N, Denmark
| | - Peer Bork
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Molecular Medicine Partnership Unit, Meyerhofstrasse 1, 69117 Heidelberg, Germany Max-Delbrück-Centre for Molecular Medicine, Robert-Rössle-Strasse 10, 13092 Berlin, Germany
| |
Collapse
|
25
|
In silico assessment of adverse drug reactions and associated mechanisms. Drug Discov Today 2015; 21:58-71. [PMID: 26272036 DOI: 10.1016/j.drudis.2015.07.018] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2015] [Revised: 07/15/2015] [Accepted: 07/31/2015] [Indexed: 12/31/2022]
Abstract
During recent years, various in silico approaches have been developed to estimate chemical and biological drug features, for example chemical fragments, protein targets, pathways, among others, that correlate with adverse drug reactions (ADRs) and explain the associated mechanisms. These features have also been used for the creation of predictive models that enable estimation of ADRs during the early stages of drug development. In this review, we discuss various in silico approaches to predict these features for a certain drug, estimate correlations with ADRs, establish causal relationships between selected features and ADR mechanisms and create corresponding predictive models.
Collapse
|
26
|
Xu R, Wang Q. Large-scale automatic extraction of side effects associated with targeted anticancer drugs from full-text oncological articles. J Biomed Inform 2015; 55:64-72. [PMID: 25817969 DOI: 10.1016/j.jbi.2015.03.009] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2014] [Revised: 02/12/2015] [Accepted: 03/20/2015] [Indexed: 10/23/2022]
Abstract
Targeted anticancer drugs such as imatinib, trastuzumab and erlotinib dramatically improved treatment outcomes in cancer patients, however, these innovative agents are often associated with unexpected side effects. The pathophysiological mechanisms underlying these side effects are not well understood. The availability of a comprehensive knowledge base of side effects associated with targeted anticancer drugs has the potential to illuminate complex pathways underlying toxicities induced by these innovative drugs. While side effect association knowledge for targeted drugs exists in multiple heterogeneous data sources, published full-text oncological articles represent an important source of pivotal, investigational, and even failed trials in a variety of patient populations. In this study, we present an automatic process to extract targeted anticancer drug-associated side effects (drug-SE pairs) from a large number of high profile full-text oncological articles. We downloaded 13,855 full-text articles from the Journal of Oncology (JCO) published between 1983 and 2013. We developed text classification, relationship extraction, signaling filtering, and signal prioritization algorithms to extract drug-SE pairs from downloaded articles. We extracted a total of 26,264 drug-SE pairs with an average precision of 0.405, a recall of 0.899, and an F1 score of 0.465. We show that side effect knowledge from JCO articles is largely complementary to that from the US Food and Drug Administration (FDA) drug labels. Through integrative correlation analysis, we show that targeted drug-associated side effects positively correlate with their gene targets and disease indications. In conclusion, this unique database that we built from a large number of high-profile oncological articles could facilitate the development of computational models to understand toxic effects associated with targeted anticancer drugs.
Collapse
Affiliation(s)
- Rong Xu
- Medical Informatics Program, Center for Clinical Investigation, Case Western Reserve University, Cleveland, OH 44106, United States.
| | - QuanQiu Wang
- ThinTek, LLC, Palo Alto, CA 94306, United States.
| |
Collapse
|
27
|
Xu R, Wang Q. Comparing a knowledge-driven approach to a supervised machine learning approach in large-scale extraction of drug-side effect relationships from free-text biomedical literature. BMC Bioinformatics 2015; 16 Suppl 5:S6. [PMID: 25860223 PMCID: PMC4402591 DOI: 10.1186/1471-2105-16-s5-s6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Systems approaches to studying drug-side-effect (drug-SE) associations are emerging as an active research area for both drug target discovery and drug repositioning. However, a comprehensive drug-SE association knowledge base does not exist. In this study, we present a novel knowledge-driven (KD) approach to effectively extract a large number of drug-SE pairs from published biomedical literature. DATA AND METHODS For the text corpus, we used 21,354,075 MEDLINE records (119,085,682 sentences). First, we used known drug-SE associations derived from FDA drug labels as prior knowledge to automatically find SE-related sentences and abstracts. We then extracted a total of 49,575 drug-SE pairs from MEDLINE sentences and 180,454 pairs from abstracts. RESULTS On average, the KD approach has achieved a precision of 0.335, a recall of 0.509, and an F1 of 0.392, which is significantly better than a SVM-based machine learning approach (precision: 0.135, recall: 0.900, F1: 0.233) with a 73.0% increase in F1 score. Through integrative analysis, we demonstrate that the higher-level phenotypic drug-SE relationships reflects lower-level genetic, genomic, and chemical drug mechanisms. In addition, we show that the extracted drug-SE pairs can be directly used in drug repositioning. CONCLUSION In summary, we automatically constructed a large-scale higher-level drug phenotype relationship knowledge, which can have great potential in computational drug discovery.
Collapse
|
28
|
Skolnick J, Gao M, Roy A, Srinivasan B, Zhou H. Implications of the small number of distinct ligand binding pockets in proteins for drug discovery, evolution and biochemical function. Bioorg Med Chem Lett 2015; 25:1163-70. [PMID: 25690787 DOI: 10.1016/j.bmcl.2015.01.059] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2014] [Revised: 01/23/2015] [Accepted: 01/24/2015] [Indexed: 01/05/2023]
Abstract
Coincidence of the properties of ligand binding pockets in native proteins with those in proteins generated by computer simulations without selection for function shows that pockets are a generic protein feature and the number of distinct pockets is small. Similar pockets occur in unrelated protein structures, an observation successfully employed in pocket-based virtual ligand screening. The small number of pockets suggests that off-target interactions among diverse proteins are inherent; kinases, proteases and phosphatases show this prototypical behavior. The ability to repurpose FDA approved drugs is general, and minor side effects cannot be avoided. Finally, the implications to drug discovery are explored.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center for the Study of Systems Biology, Georgia Institute of Technology, 250 14th St NW, Atlanta, GA 30318, USA.
| | - Mu Gao
- Center for the Study of Systems Biology, Georgia Institute of Technology, 250 14th St NW, Atlanta, GA 30318, USA
| | - Ambrish Roy
- Center for the Study of Systems Biology, Georgia Institute of Technology, 250 14th St NW, Atlanta, GA 30318, USA
| | - Bharath Srinivasan
- Center for the Study of Systems Biology, Georgia Institute of Technology, 250 14th St NW, Atlanta, GA 30318, USA
| | - Hongyi Zhou
- Center for the Study of Systems Biology, Georgia Institute of Technology, 250 14th St NW, Atlanta, GA 30318, USA
| |
Collapse
|
29
|
Xu R, Wang Q. Combining automatic table classification and relationship extraction in extracting anticancer drug-side effect pairs from full-text articles. J Biomed Inform 2014; 53:128-35. [PMID: 25445920 DOI: 10.1016/j.jbi.2014.10.002] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2014] [Revised: 08/30/2014] [Accepted: 10/03/2014] [Indexed: 01/09/2023]
Abstract
Anticancer drug-associated side effect knowledge often exists in multiple heterogeneous and complementary data sources. A comprehensive anticancer drug-side effect (drug-SE) relationship knowledge base is important for computation-based drug target discovery, drug toxicity predication and drug repositioning. In this study, we present a two-step approach by combining table classification and relationship extraction to extract drug-SE pairs from a large number of high-profile oncological full-text articles. The data consists of 31,255 tables downloaded from the Journal of Oncology (JCO). We first trained a statistical classifier to classify tables into SE-related and -unrelated categories. We then extracted drug-SE pairs from SE-related tables. We compared drug side effect knowledge extracted from JCO tables to that derived from FDA drug labels. Finally, we systematically analyzed relationships between anti-cancer drug-associated side effects and drug-associated gene targets, metabolism genes, and disease indications. The statistical table classifier is effective in classifying tables into SE-related and -unrelated (precision: 0.711; recall: 0.941; F1: 0.810). We extracted a total of 26,918 drug-SE pairs from SE-related tables with a precision of 0.605, a recall of 0.460, and a F1 of 0.520. Drug-SE pairs extracted from JCO tables is largely complementary to those derived from FDA drug labels; as many as 84.7% of the pairs extracted from JCO tables have not been included a side effect database constructed from FDA drug labels. Side effects associated with anticancer drugs positively correlate with drug target genes, drug metabolism genes, and disease indications.
Collapse
Affiliation(s)
- Rong Xu
- Medical Informatics Program, Center for Clinical Investigation, Case Western Reserve University, Cleveland, OH 44106, United States.
| | - QuanQiu Wang
- ThinTek, LLC, Palo Alto, CA 94306, United States.
| |
Collapse
|