1
|
Zhang Y, Yang Z, Yang Y, Lin H, Wang J. Location-enhanced syntactic knowledge for biomedical relation extraction. J Biomed Inform 2024; 156:104676. [PMID: 38876451 DOI: 10.1016/j.jbi.2024.104676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 06/08/2024] [Accepted: 06/10/2024] [Indexed: 06/16/2024]
Abstract
Biomedical relation extraction has long been considered a challenging task due to the specialization and complexity of biomedical texts. Syntactic knowledge has been widely employed in existing research to enhance relation extraction, providing guidance for the semantic understanding and text representation of models. However, the utilization of syntactic knowledge in most studies is not exhaustive, and there is often a lack of fine-grained noise reduction, leading to confusion in relation classification. In this paper, we propose an attention generator that comprehensively considers both syntactic dependency type information and syntactic position information to distinguish the importance of different dependency connections. Additionally, we integrate positional information, dependency type information, and word representations together to introduce location-enhanced syntactic knowledge for guiding our biomedical relation extraction. Experimental results on three widely used English benchmark datasets in the biomedical domain consistently outperform a range of baseline models, demonstrating that our approach not only makes full use of syntactic knowledge but also effectively reduces the impact of noisy words.
Collapse
Affiliation(s)
- Yan Zhang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Yumeng Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| |
Collapse
|
2
|
Singla A, Khanna R, Kaur M, Kelm K, Zaiane O, Rosenfelt CS, Bui TA, Rezaei N, Nicholas D, Reformat MZ, Majnemer A, Ogourtsova T, Bolduc F. Developing a Chatbot to Support Individuals With Neurodevelopmental Disorders: Tutorial. J Med Internet Res 2024; 26:e50182. [PMID: 38888947 PMCID: PMC11220430 DOI: 10.2196/50182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 07/27/2023] [Accepted: 04/19/2024] [Indexed: 06/20/2024] Open
Abstract
Families of individuals with neurodevelopmental disabilities or differences (NDDs) often struggle to find reliable health information on the web. NDDs encompass various conditions affecting up to 14% of children in high-income countries, and most individuals present with complex phenotypes and related conditions. It is challenging for their families to develop literacy solely by searching information on the internet. While in-person coaching can enhance care, it is only available to a minority of those with NDDs. Chatbots, or computer programs that simulate conversation, have emerged in the commercial sector as useful tools for answering questions, but their use in health care remains limited. To address this challenge, the researchers developed a chatbot named CAMI (Coaching Assistant for Medical/Health Information) that can provide information about trusted resources covering core knowledge and services relevant to families of individuals with NDDs. The chatbot was developed, in collaboration with individuals with lived experience, to provide information about trusted resources covering core knowledge and services that may be of interest. The developers used the Django framework (Django Software Foundation) for the development and used a knowledge graph to depict the key entities in NDDs and their relationships to allow the chatbot to suggest web resources that may be related to the user queries. To identify NDD domain-specific entities from user input, a combination of standard sources (the Unified Medical Language System) and other entities were used which were identified by health professionals as well as collaborators. Although most entities were identified in the text, some were not captured in the system and therefore went undetected. Nonetheless, the chatbot was able to provide resources addressing most user queries related to NDDs. The researchers found that enriching the vocabulary with synonyms and lay language terms for specific subdomains enhanced entity detection. By using a data set of numerous individuals with NDDs, the researchers developed a knowledge graph that established meaningful connections between entities, allowing the chatbot to present related symptoms, diagnoses, and resources. To the researchers' knowledge, CAMI is the first chatbot to provide resources related to NDDs. Our work highlighted the importance of engaging end users to supplement standard generic ontologies to named entities for language recognition. It also demonstrates that complex medical and health-related information can be integrated using knowledge graphs and leveraging existing large datasets. This has multiple implications: generalizability to other health domains as well as reducing the need for experts and optimizing their input while keeping health care professionals in the loop. The researchers' work also shows how health and computer science domains need to collaborate to achieve the granularity needed to make chatbots truly useful and impactful.
Collapse
Affiliation(s)
- Ashwani Singla
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| | - Ritvik Khanna
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| | - Manpreet Kaur
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| | - Karen Kelm
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| | - Osmar Zaiane
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| | | | - Truong An Bui
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| | - Navid Rezaei
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| | - David Nicholas
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| | - Marek Z Reformat
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| | - Annette Majnemer
- School of Physical & Occupational Therapy, McGill University, Montreal, QC, Canada
| | - Tatiana Ogourtsova
- School of Physical & Occupational Therapy, McGill University, Montreal, QC, Canada
| | - Francois Bolduc
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|
3
|
Gill JK, Chetty M, Lim S, Hallinan J. Large language model based framework for automated extraction of genetic interactions from unstructured data. PLoS One 2024; 19:e0303231. [PMID: 38771886 PMCID: PMC11108146 DOI: 10.1371/journal.pone.0303231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 04/23/2024] [Indexed: 05/23/2024] Open
Abstract
Extracting biological interactions from published literature helps us understand complex biological systems, accelerate research, and support decision-making in drug or treatment development. Despite efforts to automate the extraction of biological relations using text mining tools and machine learning pipelines, manual curation continues to serve as the gold standard. However, the rapidly increasing volume of literature pertaining to biological relations poses challenges in its manual curation and refinement. These challenges are further compounded because only a small fraction of the published literature is relevant to biological relation extraction, and the embedded sentences of relevant sections have complex structures, which can lead to incorrect inference of relationships. To overcome these challenges, we propose GIX, an automated and robust Gene Interaction Extraction framework, based on pre-trained Large Language models fine-tuned through extensive evaluations on various gene/protein interaction corpora including LLL and RegulonDB. GIX identifies relevant publications with minimal keywords, optimises sentence selection to reduce computational overhead, simplifies sentence structure while preserving meaning, and provides a confidence factor indicating the reliability of extracted relations. GIX's Stage-2 relation extraction method performed well on benchmark protein/gene interaction datasets, assessed using 10-fold cross-validation, surpassing state-of-the-art approaches. We demonstrated that the proposed method, although fully automated, performs as well as manual relation extraction, with enhanced robustness. We also observed GIX's capability to augment existing datasets with new sentences, incorporating newly discovered biological terms and processes. Further, we demonstrated GIX's real-world applicability in inferring E. coli gene circuits.
Collapse
Affiliation(s)
- Jaskaran Kaur Gill
- Health Innovation and Transformation Centre, Federation University, Ballarat, Victoria, Australia
| | - Madhu Chetty
- Health Innovation and Transformation Centre, Federation University, Ballarat, Victoria, Australia
| | - Suryani Lim
- Health Innovation and Transformation Centre, Federation University, Ballarat, Victoria, Australia
| | - Jennifer Hallinan
- Health Innovation and Transformation Centre, Federation University, Ballarat, Victoria, Australia
- BioThink, Brisbane, Queensland, Australia
| |
Collapse
|
4
|
Huang MS, Han JC, Lin PY, You YT, Tsai RTH, Hsu WL. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Brief Bioinform 2024; 25:bbae132. [PMID: 38609331 PMCID: PMC11014787 DOI: 10.1093/bib/bbae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/06/2023] [Accepted: 03/02/2023] [Indexed: 04/14/2024] Open
Abstract
Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.
Collapse
Affiliation(s)
- Ming-Siang Huang
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| | - Jen-Chieh Han
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Pei-Yen Lin
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Yu-Ting You
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Richard Tzong-Han Tsai
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
- Center for Geographic Information Science, Research Center for Humanities and Social Sciences, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| |
Collapse
|
5
|
Zou S, Liu Z, Wang K, Cao J, Liu S, Xiong W, Li S. A study on pharmaceutical text relationship extraction based on heterogeneous graph neural networks. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:1489-1507. [PMID: 38303474 DOI: 10.3934/mbe.2024064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
Effective information extraction of pharmaceutical texts is of great significance for clinical research. The ancient Chinese medicine text has streamlined sentences and complex semantic relationships, and the textual relationships may exist between heterogeneous entities. The current mainstream relationship extraction model does not take into account the associations between entities and relationships when extracting, resulting in insufficient semantic information to form an effective structured representation. In this paper, we propose a heterogeneous graph neural network relationship extraction model adapted to traditional Chinese medicine (TCM) text. First, the given sentence and predefined relationships are embedded by bidirectional encoder representation from transformers (BERT fine-tuned) word embedding as model input. Second, a heterogeneous graph network is constructed to associate words, phrases, and relationship nodes to obtain the hidden layer representation. Then, in the decoding stage, two-stage subject-object entity identification method is adopted, and the identifier adopts a binary classifier to locate the start and end positions of the TCM entities, identifying all the subject-object entities in the sentence, and finally forming the TCM entity relationship group. Through the experiments on the TCM relationship extraction dataset, the results show that the precision value of the heterogeneous graph neural network embedded with BERT is 86.99% and the F1 value reaches 87.40%, which is improved by 8.83% and 10.21% compared with the relationship extraction models CNN, Bert-CNN, and Graph LSTM.
Collapse
Affiliation(s)
- Shuilong Zou
- Nanchang Institute of science & Technology, Nanchang 330004, China
| | - Zhaoyang Liu
- School of Computer, Jiangxi University of Chinese Medicine, Nanchang 330004, China
| | - Kaiqi Wang
- School of Computer, Jiangxi University of Chinese Medicine, Nanchang 330004, China
| | - Jun Cao
- School of Computer, Jiangxi University of Chinese Medicine, Nanchang 330004, China
| | - Shixiong Liu
- Nanchang Institute of science & Technology, Nanchang 330004, China
| | - Wangping Xiong
- School of Computer, Jiangxi University of Chinese Medicine, Nanchang 330004, China
| | - Shaoyi Li
- Nanchang Institute of science & Technology, Nanchang 330004, China
| |
Collapse
|
6
|
Nachtegael C, De Stefani J, Lenaerts T. A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction. PLoS One 2023; 18:e0292356. [PMID: 38100453 PMCID: PMC10723703 DOI: 10.1371/journal.pone.0292356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 09/19/2023] [Indexed: 12/17/2023] Open
Abstract
Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.
Collapse
Affiliation(s)
- Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Jacopo De Stefani
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Technology, Policy and Management Faculty, Technische Universiteit Delft, Delft, Netherlands
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Bruxelles, Belgium
| |
Collapse
|
7
|
Hu Y, Chen Y, Qin Y, Huang R. Learning entity-oriented representation for biomedical relation extraction. J Biomed Inform 2023; 147:104527. [PMID: 37852347 DOI: 10.1016/j.jbi.2023.104527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 10/11/2023] [Accepted: 10/15/2023] [Indexed: 10/20/2023]
Abstract
Biomedical Relation Extraction (BioRE) aims to automatically extract semantic relations for given entity pairs and is of great significance in biomedical research. Current popular methods often utilize pretrained language models to extract semantic features from individual input instances, which frequently suffer from overlapping semantics. Overlapping semantics refers to the situation in which a sentence contains multiple entity pairs that share the same context, leading to highly similar information between these entity pairs. In this study, we propose a model for learning Entity-oriented Representation (EoR) that aims to improve the performance of the model by enhancing the discriminability between entity pairs. It contains three modules: sentence representation, entity-oriented representation, and output. The first module learns the global semantic information of the input instance; the second module focuses on extracting the semantic information of the sentence from the target entities; and the third module enhances distinguishability among entity pairs and classifies the relation type. We evaluated our approach on four BioRE tasks with eight datasets, and the experiments showed that our EoR achieved state-of-the-art performance for PPI, DDI, CPI, and DPI tasks. Further analysis demonstrated the benefits of entity-oriented semantic information in handling multiple entity pairs in the BioRE task.
Collapse
Affiliation(s)
- Ying Hu
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Yanping Chen
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Yongbin Qin
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Ruizhang Huang
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| |
Collapse
|
8
|
Xiao Y, Ji Z, Li J, Zhu Q. CLART: A cascaded lattice-and-radical transformer network for Chinese medical named entity recognition. Heliyon 2023; 9:e20692. [PMID: 37876457 PMCID: PMC10590790 DOI: 10.1016/j.heliyon.2023.e20692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/01/2023] [Accepted: 10/04/2023] [Indexed: 10/26/2023] Open
Abstract
Chinese medical named entity recognition (NER) is a fundamental task in Chinese medical natural language processing, aiming to recognize Chinese medical entities within unstructured medical texts. However, it poses significant challenges mainly due to the extensive usage of medical terms in Chinese medical texts. Although previous studies have made attempts to incorporate lexical or radical knowledge in order to improve the comprehension of medical texts, these studies either focus solely on one of these aspects or utilize a basic concatenation operation to combine these features, which fails to fully utilize the potential of lexical and radical knowledge. In this paper, we propose a novel Cascaded LAttice-and-Radical Transformer (CLART) network to exploit both lexical and radical information for Chinese medical NER. Specifically, given a sentence, a medical lexicon, and a radical dictionary, we first construct a flat lattice (i.e., character-word sequence) for the sentence and radical components of each Chinese character through word matching and radical parsing, respectively. We then employ a lattice Transformer module to capture the dense interactions between characters and matched words, facilitating the enhanced utilization of lexical knowledge. Subsequently, we design a radical Transformer module to model the dense interactions between the lattice and radical features, facilitating better fusion of the lexical and radical knowledge. Finally, we feed the updated lattice-and-radical-aware character representations into a Conditional Random Fields (CRF) decoder to obtain the predicted labels. Experimental results conducted on two publicly available Chinese medical NER datasets show the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Yinlong Xiao
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
| | | | - Jianqiang Li
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
| | - Qing Zhu
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
| |
Collapse
|
9
|
Wang T, You J, Gong X, Yang S, Wang L, Chang Z. Probabilistic Bayesian Deep Learning Approach for Online Forecasting of Fed-Batch Fermentation. ACS OMEGA 2023; 8:25272-25278. [PMID: 37483241 PMCID: PMC10357427 DOI: 10.1021/acsomega.3c02387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/09/2023] [Accepted: 06/22/2023] [Indexed: 07/25/2023]
Abstract
The microbial fermentation process often involves various biological metabolic reactions and chemical processes. The mixed bacterial culture process of 2-keto-l-gulonic acid has strong nonlinear and time-varying characteristics. In this study, a probabilistic Bayesian deep learning approach is proposed to obtain a highly accurate and robust prediction of product formation. The Bayesian optimized deep neural network (BODNN) is utilized as basic model for prediction, the structural parameters of which are optimized. Then, the training datasets are classified into different categories according to the prior evaluation of prediction error. The final forecasting is a weighted combination of BODNN models based on the Bayesian hybrid method. The weights can be interpreted as Bayesian posterior probabilities and are computed recursively. The validation of 95 industrial batches is carried out, and the average root mean square errors are 1.51 and 2.01% for 4 and 8 h ahead prediction, respectively. The results illustrate that the proposed approach can capture the dynamics of fermentation batches and is suitable for online process monitoring.
Collapse
Affiliation(s)
- Tao Wang
- School
of Computer Science and Technology, Shandong
University of Technology, Zibo 255000, China
| | - Jiebing You
- Department
of Neurology, Zibo Central Hospital, Zibo, Shandong 255036, China
| | - Xiugang Gong
- School
of Computer Science and Technology, Shandong
University of Technology, Zibo 255000, China
| | - Shanliang Yang
- School
of Computer Science and Technology, Shandong
University of Technology, Zibo 255000, China
| | - Lei Wang
- School
of Computer Science and Technology, Shandong
University of Technology, Zibo 255000, China
| | - Zheng Chang
- School
of Computer Science and Technology, Shandong
University of Technology, Zibo 255000, China
| |
Collapse
|
10
|
Deng H, Li Q, Liu Y, Zhu J. MTMG: A multi-task model with multi-granularity information for drug-drug interaction extraction. Heliyon 2023; 9:e16819. [PMID: 37484258 PMCID: PMC10360954 DOI: 10.1016/j.heliyon.2023.e16819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 05/29/2023] [Accepted: 05/30/2023] [Indexed: 07/25/2023] Open
Abstract
Drug-drug interactions (DDIs) extraction includes identifying drug entities and interactions between drug pairs from the biomedical corpus. The discovery of potential DDIs aids in our understanding of the mechanisms underlying adverse reactions or combination therapy to improve patient safety. The manual extraction of DDIs is very time-consuming and expensive; therefore, computer-aided extraction of DDIs is vital. Many neural network-based methods have been proposed and achieved good efficiency in the extraction of DDIs over the years. However, most studies improved the performance of DDIs extraction with various external drug features while directly using golden drug entities, leading to error propagation and low universality in practical application. In this paper, we propose a new multi-task framework called MTMG, which changes DDIs extraction from a sentence-level classification task to a sequence labeling task named Drug-Specified Token Classification (DSTC). The proposed approach, MTMG, jointly trains DSTC with drug named entity recognition (DNER) and two sentence-level auxiliary tasks we designed. We aim to improve the performance of the entire DDIs extraction pipeline by better using the correlation between entities and relationships and, to the extent possible, using the information of varying granularity implied in the dataset. Experimental results show that MTMG can both improve the accuracy of DNER and DDIs extraction and outperforms state-of-the-art technique.
Collapse
|
11
|
Xie W, Fan K, Zhang S, Li L. Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature. J Biomed Semantics 2023; 14:5. [PMID: 37248476 PMCID: PMC10228061 DOI: 10.1186/s13326-023-00287-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 04/29/2023] [Indexed: 05/31/2023] Open
Abstract
BACKGROUND Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper. RESULTS PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively. CONCLUSIONS By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.
Collapse
Affiliation(s)
- Weixin Xie
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| | - Kunjie Fan
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| | - Shijun Zhang
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| | - Lang Li
- Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
12
|
Shi B, Fan R, Zhang L, Huang J, Xiong N, Vasilakos A, Wan J, Zhang L. A Joint Extraction System Based on Conditional Layer Normalization for Health Monitoring. SENSORS (BASEL, SWITZERLAND) 2023; 23:4812. [PMID: 37430725 DOI: 10.3390/s23104812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Revised: 05/10/2023] [Accepted: 05/11/2023] [Indexed: 07/12/2023]
Abstract
Natural language processing (NLP) technology has played a pivotal role in health monitoring as an important artificial intelligence method. As a key technology in NLP, relation triplet extraction is closely related to the performance of health monitoring. In this paper, a novel model is proposed for joint extraction of entities and relations, combining conditional layer normalization with the talking-head attention mechanism to strengthen the interaction between entity recognition and relation extraction. In addition, the proposed model utilizes position information to enhance the extraction accuracy of overlapping triplets. Experiments on the Baidu2019 and CHIP2020 datasets demonstrate that the proposed model can effectively extract overlapping triplets, which leads to significant performance improvements compared with baselines.
Collapse
Affiliation(s)
- Binbin Shi
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Rongli Fan
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Lijuan Zhang
- School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Jie Huang
- School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Neal Xiong
- Department of Computer Science, Mathematics Sul Ross State University, Alpine, TX 79830, USA
| | | | - Jian Wan
- School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | - Lei Zhang
- School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| |
Collapse
|
13
|
Al-Sabri R, Gao J, Chen J, Oloulade BM, Lyu T. Multi-View Graph Neural Architecture Search for Biomedical Entity and Relation Extraction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1221-1233. [PMID: 36074877 DOI: 10.1109/tcbb.2022.3205113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Recently, graph neural architecture search (GNAS) frameworks have been successfully used to automatically design the optimal neural architectures for many problems such as node classification and graph classification. In the existing GNAS frameworks, the designed graph neural network (GNN) architectures learn the representation of homogenous graphs with one type of relationship connecting two nodes. However, multi-view graphs, where each view represents a type of relationship among nodes, are ubiquitous in the real world. The traditional GNAS frameworks learn the graph representation without considering the interactions between nodes and multiple relationships, so they fail to solve multi-view graph-based problems, such as multi-view graphs modelling the biomedical entity and relation extraction tasks. In this paper, we propose MVGNAS, a multi-view graph neural network automatic modelling framework for biomedical entity and relation extraction, to resolve this challenge. In MVGNAS, we propose an automatic multi-view representation learning to learn low-dimensional representations of nodes that capture multiple relationships in a multi-view graph, representing the first research work in literature to solve the problem of multi-view graph representation learning architecture search for biomedical entity and relation extraction tasks. The experimental results demonstrate that MVGNAS can achieve the best performance in biomedical entity and relation extraction tasks against the state-of-the-art baseline methods.
Collapse
|
14
|
Chasseray Y, Barthe-Delanoë AM, Négny S, Le Lann JM. Knowledge extraction from textual data and performance evaluation in an unsupervised context. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
|
15
|
Weber L, Sänger M, Garda S, Barth F, Alt C, Leser U. Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models. Database (Oxford) 2022; 2022:6833204. [PMID: 36399413 PMCID: PMC9674024 DOI: 10.1093/database/baac098] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 10/18/2022] [Accepted: 10/21/2022] [Indexed: 11/19/2022]
Abstract
The identification of chemical-protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical-protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot.
Collapse
Affiliation(s)
- Leon Weber
- *Corresponding authors: Tel: +49 30 209341293; Emails: and
| | - Mario Sänger
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Samuele Garda
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Fabio Barth
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Christoph Alt
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany,Research Cluster of Excellence, Science of Intelligence, Marchstr. 23, Berlin 10587, Germany
| | - Ulf Leser
- *Corresponding authors: Tel: +49 30 209341293; Emails: and
| |
Collapse
|
16
|
Zhang Z, Chen ALP. Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning. BMC Bioinformatics 2022; 23:458. [PMID: 36329384 PMCID: PMC9632084 DOI: 10.1186/s12859-022-04994-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 10/19/2022] [Indexed: 11/06/2022] Open
Abstract
Background Biomedical named entity recognition (BioNER) is a basic and important task for biomedical text mining with the purpose of automatically recognizing and classifying biomedical entities. The performance of BioNER systems directly impacts downstream applications. Recently, deep neural networks, especially pre-trained language models, have made great progress for BioNER. However, because of the lack of high-quality and large-scale annotated data and relevant external knowledge, the capability of the BioNER system remains limited. Results In this paper, we propose a novel fully-shared multi-task learning model based on the pre-trained language model in biomedical domain, namely BioBERT, with a new attention module to integrate the auto-processed syntactic information for the BioNER task. We have conducted numerous experiments on seven benchmark BioNER datasets. The proposed best multi-task model obtains F1 score improvements of 1.03% on BC2GM, 0.91% on NCBI-disease, 0.81% on Linnaeus, 1.26% on JNLPBA, 0.82% on BC5CDR-Chemical, 0.87% on BC5CDR-Disease, and 1.10% on Species-800 compared to the single-task BioBERT model. Conclusion The results demonstrate our model outperforms previous studies on all datasets. Further analysis and case studies are also provided to prove the importance of the proposed attention module and fully-shared multi-task learning method used in our model.
Collapse
Affiliation(s)
- Zhiyu Zhang
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Arbee L P Chen
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan. .,Department of Computer Science and Information Engineering, Asia University, Taichung, Taiwan.
| |
Collapse
|
17
|
An automatic hypothesis generation for plausible linkage between xanthium and diabetes. Sci Rep 2022; 12:17547. [PMID: 36266295 PMCID: PMC9585073 DOI: 10.1038/s41598-022-20752-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Accepted: 09/19/2022] [Indexed: 01/13/2023] Open
Abstract
There has been a significant increase in text mining implementation for biomedical literature in recent years. Previous studies introduced the implementation of text mining and literature-based discovery to generate hypotheses of potential candidates for drug development. By conducting a hypothesis-generation step and using evidence from published journal articles or proceedings, previous studies have managed to reduce experimental time and costs. First, we applied the closed discovery approach from Swanson's ABC model to collect publications related to 36 Xanthium compounds or diabetes. Second, we extracted biomedical entities and relations using a knowledge extraction engine, the Public Knowledge Discovery Engine for Java or PKDE4J. Third, we built a knowledge graph using the obtained bio entities and relations and then generated paths with Xanthium compounds as source nodes and diabetes as the target node. Lastly, we employed graph embeddings to rank each path and evaluated the results based on domain experts' opinions and literature. Among 36 Xanthium compounds, 35 had direct paths to five diabetes-related nodes. We ranked 2,740,314 paths in total between 35 Xanthium compounds and three diabetes-related phrases: type 1 diabetes, type 2 diabetes, and diabetes mellitus. Based on the top five percentile paths, we concluded that adenosine, choline, beta-sitosterol, rhamnose, and scopoletin were potential candidates for diabetes drug development using natural products. Our framework for hypothesis generation employs a closed discovery from Swanson's ABC model that has proven very helpful in discovering biological linkages between bio entities. The PKDE4J tools we used to capture bio entities from our document collection could label entities into five categories: genes, compounds, phenotypes, biological processes, and molecular functions. Using the BioPREP model, we managed to interpret the semantic relatedness between two nodes and provided paths containing valuable hypotheses. Lastly, using a graph-embedding algorithm in our path-ranking analysis, we exploited the semantic relatedness while preserving the graph structure properties.
Collapse
|
18
|
Turki H, Jemielniak D, Hadj Taieb MA, Labra Gayo JE, Ben Aouicha M, Banat M, Shafee T, Prud’hommeaux E, Lubiana T, Das D, Mietchen D. Using logical constraints to validate statistical information about disease outbreaks in collaborative knowledge graphs: the case of COVID-19 epidemiology in Wikidata. PeerJ Comput Sci 2022; 8:e1085. [PMID: 36262159 PMCID: PMC9575845 DOI: 10.7717/peerj-cs.1085] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 08/15/2022] [Indexed: 06/16/2023]
Abstract
Urgent global research demands real-time dissemination of precise data. Wikidata, a collaborative and openly licensed knowledge graph available in RDF format, provides an ideal forum for exchanging structured data that can be verified and consolidated using validation schemas and bot edits. In this research article, we catalog an automatable task set necessary to assess and validate the portion of Wikidata relating to the COVID-19 epidemiology. These tasks assess statistical data and are implemented in SPARQL, a query language for semantic databases. We demonstrate the efficiency of our methods for evaluating structured non-relational information on COVID-19 in Wikidata, and its applicability in collaborative ontologies and knowledge graphs more broadly. We show the advantages and limitations of our proposed approach by comparing it to the features of other methods for the validation of linked web data as revealed by previous research.
Collapse
Affiliation(s)
- Houcemeddine Turki
- Data Engineering and Semantics Research Unit, Faculty of Sciences of Sfax, University of Sfax, Sfax, Tunisia
| | - Dariusz Jemielniak
- Department of Management in Networked and Digital Societies, Kozminski University, Warsaw, Masovia, Poland
| | - Mohamed A. Hadj Taieb
- Data Engineering and Semantics Research Unit, Faculty of Sciences of Sfax, University of Sfax, Sfax, Tunisia
| | - Jose E. Labra Gayo
- Web Semantics Oviedo (WESO) Research Group, University of Oviedo, Oviedo, Asturias, Spain
| | - Mohamed Ben Aouicha
- Data Engineering and Semantics Research Unit, Faculty of Sciences of Sfax, University of Sfax, Sfax, Tunisia
| | - Mus’ab Banat
- Faculty of Medicine, Hashemite University, Zarqa, Jordan
| | - Thomas Shafee
- La Trobe University, Melbourne, Victoria, Australia
- Swinburne University of Technology, Melbourne, Victoria, Australia
| | - Eric Prud’hommeaux
- World Wide Web Consortium, Cambridge, Massachusetts, United States of America
| | - Tiago Lubiana
- Computational Systems Biology Laboratory, University of São Paulo, São Paulo, Brazil
| | - Diptanshu Das
- Institute of Child Health (ICH), Kolkata, West Bengal, India
- Medica Superspecialty Hospital, Kolkata, West Bengal, India
| | - Daniel Mietchen
- Ronin Institute, Montclair, New Jersey, United States of America
- Department of Evolutionary and Integrative Ecology, Leibniz Institute of Freshwater Ecology and Inland Fisheries, Berlin, Germany
- School of Data Science, University of Virginia, Charlottesville, Virginia, United States
- Institute for Globally Distributed Open Research and Education (IGDORE), Jena, Germany
| |
Collapse
|
19
|
Chen J, Sun X, Jin X, Sutcliffe R. Extracting drug-drug interactions from no-blinding texts using key semantic sentences and GHM loss. J Biomed Inform 2022; 135:104192. [PMID: 36064114 DOI: 10.1016/j.jbi.2022.104192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 08/28/2022] [Accepted: 08/29/2022] [Indexed: 11/26/2022]
Abstract
The extraction of drug-drug interactions (DDIs) is an important task in the field of biomedical research, which can reduce unexpected health risks during patient treatment. Previous work indicates that methods using external drug information have a much higher performance than those methods not using it. However, the use of external drug information is time-consuming and resource-costly. In this work, we propose a novel method for extracting DDIs which does not use external drug information, but still achieves comparable performance. First, we no longer convert the drug name to standard tokens such as DRUG0, the method commonly used in previous research. Instead, full drug names with drug entity marking are input to BioBERT, allowing us to enhance the selected drug entity pair. Second, we adopt the Key Semantic Sentence approach to emphasize the words closely related to the DDI relation of the selected drug pair. After the above steps, the misclassification of similar instances which are created from the same sentence but corresponding to different pairs of drug entities can be significantly reduced. Then, we employ the Gradient Harmonizing Mechanism (GHM) loss to reduce the weight of mislabeled instances and easy-to-classify instances, both of which can lead to poor performance in DDI extraction. Overall, we demonstrate in this work that it is better not to use drug blinding with BioBERT, and show that GHM performs better than Cross-Entropy loss if the proportion of label noise is less than 30%. The proposed model achieves state-of-the-art results with an F1-score of 84.13% on the DDIExtraction 2013 corpus (a standard English DDI corpus), which fills the performance gap (4%) between methods that rely on and do not rely on external drug information.
Collapse
Affiliation(s)
- Jiacheng Chen
- School of Information Science and Technology, Northwest University, Xi'an, 710127, China
| | - Xia Sun
- School of Information Science and Technology, Northwest University, Xi'an, 710127, China.
| | - Xin Jin
- School of Information Science and Technology, Northwest University, Xi'an, 710127, China
| | - Richard Sutcliffe
- School of Information Science and Technology, Northwest University, Xi'an, 710127, China; School of Computer Science and Electronic Engineering, University of Essex, Colchester, CO4 3SQ, UK.
| |
Collapse
|
20
|
Luo L, Lai PT, Wei CH, Lu Z. A sequence labeling framework for extracting drug-protein relations from biomedical literature. Database (Oxford) 2022; 2022:baac058. [PMID: 35856889 PMCID: PMC9297941 DOI: 10.1093/database/baac058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 05/24/2022] [Accepted: 07/14/2022] [Indexed: 06/15/2023]
Abstract
UNLABELLED Automatic extracting interactions between chemical compound/drug and gene/protein are significantly beneficial to drug discovery, drug repurposing, drug design and biomedical knowledge graph construction. To promote the development of the relation extraction between drug and protein, the BioCreative VII challenge organized the DrugProt track. This paper describes the approach we developed for this task. In addition to the conventional text classification framework that has been widely used in relation extraction tasks, we propose a sequence labeling framework to drug-protein relation extraction. We first comprehensively compared the cutting-edge biomedical pre-trained language models for both frameworks. Then, we explored several ensemble methods to further improve the final performance. In the evaluation of the challenge, our best submission (i.e. the ensemble of models in two frameworks via major voting) achieved the F1-score of 0.795 on the official test set. Further, we realized the sequence labeling framework is more efficient and achieves better performance than the text classification framework. Finally, our ensemble of the sequence labeling models with majority voting achieves the best F1-score of 0.800 on the test set. DATABASE URL https://github.com/lingluodlut/BioCreativeVII_DrugProt.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- *Corresponding author: Tel: 301 594 7089; Fax: 301 480 2288;
| |
Collapse
|
21
|
Liu X, Tan J, Fan J, Tan K, Hu J, Dong S. A Syntax-enhanced model based on category keywords for biomedical relation extraction. J Biomed Inform 2022; 132:104135. [PMID: 35842217 DOI: 10.1016/j.jbi.2022.104135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 05/10/2022] [Accepted: 07/05/2022] [Indexed: 10/17/2022]
Abstract
Certain categories in multi-category biomedical relationship extraction have linguistic similarities to some extent. Keywords related to categories and syntax structures of samples between these categories have some notable features, which are very useful in biomedical relation extraction. The pre-trained model has been widely used and has achieved great success in biomedical relationship extraction, but it is still incapable of mining this kind of information accurately. To solve the problem, we present a syntax-enhanced model based on category keywords. First, we prune syntactic dependency trees in terms of category keywords obtained by the chi-square test. It reduces noisy information caused by current syntactic parsing tools and retains useful information related to categories. Next, to encode category-related syntactic dependency trees, a syntactic transformer is presented, which enhances the ability of the pre-trained model to capture syntax structures and to distinguish multiple categories. We evaluate our method on three biomedical datasets. Compared with state-of-the-art models, our method performs better on these datasets. We conduct further analysis to verify the effectiveness of our method.
Collapse
Affiliation(s)
- Xiaofeng Liu
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; Zhongshan Institute of Modern Industrial Technology, South China University of Technology, Zhongshan, China
| | - Jiajie Tan
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Jianye Fan
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Kaiwen Tan
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
| | - Jinlong Hu
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; Zhongshan Institute of Modern Industrial Technology, South China University of Technology, Zhongshan, China
| | - Shoubin Dong
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; Zhongshan Institute of Modern Industrial Technology, South China University of Technology, Zhongshan, China.
| |
Collapse
|
22
|
Feng B, Gao J. AnthraxKP: a knowledge graph-based, Anthrax Knowledge Portal mined from biomedical literature. Database (Oxford) 2022; 2022:6598946. [PMID: 35653350 PMCID: PMC9216567 DOI: 10.1093/database/baac037] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 04/13/2022] [Accepted: 05/13/2022] [Indexed: 11/15/2022]
Abstract
Abstract
Anthrax is a zoonotic infectious disease caused by Bacillus anthracis (anthrax bacterium) that affects not only domestic and wild animals worldwide but also human health. As the study develops in-depth, a large quantity of related biomedical publications emerge. Acquiring knowledge from the literature is essential for gaining insight into anthrax etiology, diagnosis, treatment and research. In this study, we used a set of text mining tools to identify nearly 14 000 entities of 29 categories, such as genes, diseases, chemicals, species, vaccines and proteins, from nearly 8000 anthrax biomedical literature and extracted 281 categories of association relationships among the entities. We curated Anthrax-related Entities Dictionary and Anthrax Ontology. We formed Anthrax Knowledge Graph (AnthraxKG) containing more than 6000 nodes, 6000 edges and 32 000 properties. An interactive visualized Anthrax Knowledge Portal(AnthraxKP) was also developed based on AnthraxKG by using Web technology. AnthraxKP in this study provides rich and authentic relevant knowledge in many forms, which can help researchers carry out research more efficiently.
Database URL: AnthraxKP is permitted users to query and download data at http://139.224.212.120:18095/.
Collapse
Affiliation(s)
- Baiyang Feng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University , Erdos East Street No. 29, Hohhot 010011, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry , Zhaowuda Road No. 306, Hohhot 010018, China
| | - Jing Gao
- College of Computer and Information Engineering, Inner Mongolia Agricultural University , Erdos East Street No. 29, Hohhot 010011, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry , Zhaowuda Road No. 306, Hohhot 010018, China
- Inner Mongolia Autonomous Region Big Data Center , Chilechuan Street No. 1, Hohhot 010091, China
| |
Collapse
|
23
|
Vo TH, Nguyen NTK, Kha QH, Le NQK. On the road to explainable AI in drug-drug interactions prediction: A systematic review. Comput Struct Biotechnol J 2022; 20:2112-2123. [PMID: 35832629 PMCID: PMC9092071 DOI: 10.1016/j.csbj.2022.04.021] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Revised: 04/15/2022] [Accepted: 04/15/2022] [Indexed: 12/26/2022] Open
Abstract
Over the past decade, polypharmacy instances have been common in multi-diseases treatment. However, unwanted drug-drug interactions (DDIs) that might cause unexpected adverse drug events (ADEs) in multiple regimens therapy remain a significant issue. Since artificial intelligence (AI) is ubiquitous today, many AI prediction models have been developed to predict DDIs to support clinicians in pharmacotherapy-related decisions. However, even though DDI prediction models have great potential for assisting physicians in polypharmacy decisions, there are still concerns regarding the reliability of AI models due to their black-box nature. Building AI models with explainable mechanisms can augment their transparency to address the above issue. Explainable AI (XAI) promotes safety and clarity by showing how decisions are made in AI models, especially in critical tasks like DDI predictions. In this review, a comprehensive overview of AI-based DDI prediction, including the publicly available source for AI-DDIs studies, the methods used in data manipulation and feature preprocessing, the XAI mechanisms to promote trust of AI, especially for critical tasks as DDIs prediction, the modeling methods, is provided. Limitations and the future directions of XAI in DDIs are also discussed.
Collapse
Affiliation(s)
- Thanh Hoa Vo
- Master Program in Clinical Genomics and Proteomics, College of Pharmacy, Taipei Medical University, Taipei 110, Taiwan
| | - Ngan Thi Kim Nguyen
- School of Nutrition and Health Sciences, College of Nutrition, Taipei Medical University, Taipei 11031, Taiwan
| | - Quang Hien Kha
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| |
Collapse
|
24
|
Xiong Y, Peng H, Xiang Y, Wong KC, Chen Q, Yan J, Tang B. Leveraging Multi-source Knowledge for Chinese Clinical Named Entity Recognition via Relational Graph Convolutional Network. J Biomed Inform 2022; 128:104035. [PMID: 35217186 DOI: 10.1016/j.jbi.2022.104035] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 02/04/2022] [Accepted: 02/18/2022] [Indexed: 11/29/2022]
Abstract
OBJECTIVE External knowledge, such as lexicon of words in Chinese and domain knowledge graph (KG) of concepts, has been recently adopted to improve the performance of machine learning methods for named entity recognition (NER) as it can provide additional information beyond context. However, most existing studies only consider knowledge from one source (i.e., either lexicon or knowledge graph) in different ways and consider lexicon words or KG concepts independently with their boundaries. In this paper, we focus on leveraging multi-source knowledge in a unified manner where lexicon words or KG concepts are well combined with their boundaries for Chinese Clinical NER (CNER). MATERIAL AND METHODS We propose a novel method based on relational graph convolutional network (RGCN), called MKRGCN, to utilize multi-source knowledge in a unified manner for CNER. For any sentence, a relational graph based on words or concepts in each knowledge source is constructed, where lexicon words or KG concepts appearing in the sentence are linked to the containing tokens with the boundary information of the lexicon words or KG concepts. RGCN is used to model all relational graphs constructed from multi-source knowledge, and the representations of tokens from multi-source knowledge are integrated into the context representations of tokens via an attention mechanism. Based on the knowledge-enhanced representations of tokens, we deploy a conditional random field (CRF) layer for named entity label prediction. In this study, a lexicon of words and a medical knowledge graph are used as knowledge sources for Chinese CNER. RESULTS Our proposed method achieves the best performance on CCKS2017 and CCKS2018 in Chinese with F1-scores of 91.88% and 89.91%, respectively, significantly outperforming existing methods. The extended experiments on NCBI-Disease and BC2GM in English also prove the effectiveness of our method when only considering one knowledge source via RGCN. CONCLUSION The MKRGCN model can integrate knowledge from the external lexicon and knowledge graph effectively for Chinese CNER and has the potential to be applied to English NER.
Collapse
Affiliation(s)
- Ying Xiong
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, China; Peng Cheng Laboratory, Shenzhen, China
| | - Hao Peng
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, China
| | | | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong, China
| | - Qingcai Chen
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, China; Peng Cheng Laboratory, Shenzhen, China
| | - Jun Yan
- Yidu Cloud (Beijing) Technology Co., Ltd, Beijing, China
| | - Buzhou Tang
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, China; Peng Cheng Laboratory, Shenzhen, China
| |
Collapse
|
25
|
Pourreza Shahri M, Kahanda I. Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes. BMC Bioinformatics 2021; 22:500. [PMID: 34656098 PMCID: PMC8520253 DOI: 10.1186/s12859-021-04421-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 10/04/2021] [Indexed: 11/13/2022] Open
Abstract
Background Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward. Results In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists. Conclusions This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.
Collapse
Affiliation(s)
| | - Indika Kahanda
- School of Computing, University of North Florida, Jacksonville, USA.
| |
Collapse
|
26
|
Protein-protein interaction relation extraction based on multigranularity semantic fusion. J Biomed Inform 2021; 123:103931. [PMID: 34628063 DOI: 10.1016/j.jbi.2021.103931] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 09/12/2021] [Accepted: 10/04/2021] [Indexed: 01/02/2023]
Abstract
Extracting semantic relationships about biomedical entities in a sentence is a typical task in biomedical information extraction. Because a sentence usually contains several named entities, it is important to learn global semantics of a sentence to support relation extraction. In related works, many strategies have been proposed to encode a sentence representation relevant to considered named entities. Despite the current success, according to the characteristic of languages, semantics of words are expressed on multigranular levels which also heavily depends on local semantic of a sentence. In this paper, we propose a multigranularity semantic fusion method to support biomedical relation extraction. In this method, Transformer is adopted for embedding words of a sentence into distributed representations, which is effective to encode global semantic of a sentence. Meanwhile, a multichannel strategy is applied to encode local semantics of words, which enables the same word to have different representations in a sentence. Both global and local semantic representations are fused to enhance the discriminability of the neural network. To evaluate our method, experiments are conducted on five standard PPI corpora (AImed, BioInfer, IEPA, HPRD50, and LLL), which achieve F1-scores of 83.4%, 89.9%, 81.2%, 84.5%, and 92.5%, respectively. The results show that multigranular semantic fusion is helpful to support the protein-protein interaction relationship extraction.
Collapse
|
27
|
Huang K, Xiao C, Glass LM, Critchlow CW, Gibson G, Sun J. Machine learning applications for therapeutic tasks with genomics data. PATTERNS (NEW YORK, N.Y.) 2021; 2:100328. [PMID: 34693370 PMCID: PMC8515011 DOI: 10.1016/j.patter.2021.100328] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Thanks to the increasing availability of genomics and other biomedical data, many machine learning algorithms have been proposed for a wide range of therapeutic discovery and development tasks. In this survey, we review the literature on machine learning applications for genomics through the lens of therapeutic development. We investigate the interplay among genomics, compounds, proteins, electronic health records, cellular images, and clinical texts. We identify 22 machine learning in genomics applications that span the whole therapeutics pipeline, from discovering novel targets, personalizing medicine, developing gene-editing tools, all the way to facilitating clinical trials and post-market studies. We also pinpoint seven key challenges in this field with potentials for expansion and impact. This survey examines recent research at the intersection of machine learning, genomics, and therapeutic development.
Collapse
Affiliation(s)
- Kexin Huang
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Cao Xiao
- Amplitude, San Francisco, CA 94105, USA
| | - Lucas M. Glass
- Analytics Center of Excellence, IQVIA, Cambridge, MA 02139, USA
| | | | - Greg Gibson
- Center for Integrative Genomics, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Jimeng Sun
- Computer Science Department and Carle's Illinois College of Medicine, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA
| |
Collapse
|
28
|
Conceição SIR, Couto FM. Text Mining for Building Biomedical Networks Using Cancer as a Case Study. Biomolecules 2021; 11:biom11101430. [PMID: 34680062 PMCID: PMC8533101 DOI: 10.3390/biom11101430] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 09/24/2021] [Accepted: 09/27/2021] [Indexed: 12/15/2022] Open
Abstract
In the assembly of biological networks it is important to provide reliable interactions in an effort to have the most possible accurate representation of real-life systems. Commonly, the data used to build a network comes from diverse high-throughput essays, however most of the interaction data is available through scientific literature. This has become a challenge with the notable increase in scientific literature being published, as it is hard for human curators to track all recent discoveries without using efficient tools to help them identify these interactions in an automatic way. This can be surpassed by using text mining approaches which are capable of extracting knowledge from scientific documents. One of the most important tasks in text mining for biological network building is relation extraction, which identifies relations between the entities of interest. Many interaction databases already use text mining systems, and the development of these tools will lead to more reliable networks, as well as the possibility to personalize the networks by selecting the desired relations. This review will focus on different approaches of automatic information extraction from biomedical text that can be used to enhance existing networks or create new ones, such as deep learning state-of-the-art approaches, focusing on cancer disease as a case-study.
Collapse
|
29
|
Kanjirangat V, Rinaldi F. Enhancing Biomedical Relation Extraction with Transformer Models using Shortest Dependency Path Features and Triplet Information. J Biomed Inform 2021; 122:103893. [PMID: 34481058 DOI: 10.1016/j.jbi.2021.103893] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2020] [Revised: 08/17/2021] [Accepted: 08/22/2021] [Indexed: 10/20/2022]
Abstract
Entity relation extraction plays an important role in the biomedical, healthcare, and clinical research areas. Recently, pre-trained models based on transformer architectures and their variants have shown remarkable performances in various natural language processing tasks. Most of these variants were based on slight modifications in the architectural components, representation schemes and augmenting data using distant supervision methods. In distantly supervised methods, one of the main challenges is pruning out noisy samples. A similar situation can arise when the training samples are not directly available but need to be constructed from the given dataset. The BioCreative V Chemical Disease Relation (CDR) task provides a dataset that does not explicitly offer mention-level gold annotations and hence replicates the above scenario. Selecting the representative sentences from the given abstract or document text that could convey a potential entity relationship becomes essential. Most of the existing methods in literature propose to either consider the entire text or all the sentences which contain the entity mentions. This could be a computationally expensive and time consuming approach. This paper presents a novel approach to handle such scenarios, specifically in biomedical relation extraction. We propose utilizing the Shortest Dependency Path (SDP) features for constructing data samples by pruning out noisy information and selecting the most representative samples for model learning. We also utilize triplet information in model learning using the biomedical variant of BERT, viz., BioBERT. The problem is represented as a sentence pair classification task using the sentence and the entity-relation pair as input. We analyze the approach on both intra-sentential and inter-sentential relations in the CDR dataset. The proposed approach that utilizes the SDP and triplet features presents promising results, specifically on the inter-sentential relation extraction task. We make the code used for this work publicly available on Github.1.
Collapse
Affiliation(s)
- Vani Kanjirangat
- Istituto Dalle Molle di Studi sull'Intelligenza Artificiale USI/SUPSI, Lugano, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| | - Fabio Rinaldi
- Istituto Dalle Molle di Studi sull'Intelligenza Artificiale USI/SUPSI, Lugano, Switzerland; Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
30
|
Hong G, Kim Y, Choi Y, Song M. BioPREP: Deep learning-based predicate classification with SemMedDB. J Biomed Inform 2021; 122:103888. [PMID: 34411707 DOI: 10.1016/j.jbi.2021.103888] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Revised: 06/03/2021] [Accepted: 08/13/2021] [Indexed: 11/16/2022]
Abstract
When it comes to inferring relations between entities in biomedical texts, Relation Extraction (RE) has become key to biomedical information extraction. Although previous studies focused on using rule-based and machine learning-based approaches, these methods lacked efficiency in terms of the demanding amount of feature processing while resulting in relatively low accuracy. Some existing biomedical relation extraction tools are based on neural networks. Nonetheless, they rarely analyze possible causes of the difference in accuracy among predicates. Also, there have not been enough biomedical datasets that were structured for predicate classification. With these regards, we set our research goals as follows: constructing a large-scale training dataset, namely Biomedical Predicate Relation-extraction with Entity-filtering by PKDE4J (BioPREP), based on SemMedDB then using PKDE4J as an entity-filtering tool, evaluating the performances of each neural network-based algorithms on the structured dataset. We then analyzed our model's performance in-depth by grouping predicates into semantic clusters. Based on comprehensive experimental outcomes, the experiments showed that the BioBERT-based model outperformed other models for predicate classification. The suggested model achieved an f1-score of 0.846 when BioBERT was loaded as the pre-trained model and 0.840 when SciBERT weights were loaded. Moreover, the semantic cluster analysis showed that sentences containing key phrases were classified better, such as comparison verb + 'than'.
Collapse
Affiliation(s)
- Gibong Hong
- Department of Digital Analytics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea; Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea
| | - Yuheun Kim
- Department of Digital Analytics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea; Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea
| | - YeonJung Choi
- Department of Digital Analytics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea; Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea
| | - Min Song
- Department of Digital Analytics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea; Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea.
| |
Collapse
|
31
|
Yang X, Wu C, Nenadic G, Wang W, Lu K. Mining a stroke knowledge graph from literature. BMC Bioinformatics 2021; 22:387. [PMID: 34325669 PMCID: PMC8319697 DOI: 10.1186/s12859-021-04292-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2021] [Accepted: 07/06/2021] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Stroke has an acute onset and a high mortality rate, making it one of the most fatal diseases worldwide. Its underlying biology and treatments have been widely studied both in the "Western" biomedicine and the Traditional Chinese Medicine (TCM). However, these two approaches are often studied and reported in insolation, both in the literature and associated databases. RESULTS To aid research in finding effective prevention methods and treatments, we integrated knowledge from the literature and a number of databases (e.g. CID, TCMID, ETCM). We employed a suite of biomedical text mining (i.e. named-entity) approaches to identify mentions of genes, diseases, drugs, chemicals, symptoms, Chinese herbs and patent medicines, etc. in a large set of stroke papers from both biomedical and TCM domains. Then, using a combination of a rule-based approach with a pre-trained BioBERT model, we extracted and classified links and relationships among stroke-related entities as expressed in the literature. We construct StrokeKG, a knowledge graph includes almost 46 k nodes of nine types, and 157 k links of 30 types, connecting diseases, genes, symptoms, drugs, pathways, herbs, chemical, ingredients and patent medicine. CONCLUSIONS Our Stroke-KG can provide practical and reliable stroke-related knowledge to help with stroke-related research like exploring new directions for stroke research and ideas for drug repurposing and discovery. We make StrokeKG freely available at http://114.115.208.144:7474/browser/ (Please click "Connect" directly) and the source structured data for stroke at https://github.com/yangxi1016/Stroke.
Collapse
Affiliation(s)
- Xi Yang
- College of Computer, National University of Defence Technology, Changsha, 410073 China
- State Key Laboratory of High-Performance Computing, National University of Defence Technology, Changsha, 410073 China
- Department of Computer Science, University of Manchester, Manchester, M13 9PL UK
| | - Chengkun Wu
- State Key Laboratory of High-Performance Computing, National University of Defence Technology, Changsha, 410073 China
| | - Goran Nenadic
- Department of Computer Science, University of Manchester, Manchester, M13 9PL UK
| | - Wei Wang
- College of Computer, National University of Defence Technology, Changsha, 410073 China
| | - Kai Lu
- College of Computer, National University of Defence Technology, Changsha, 410073 China
| |
Collapse
|
32
|
Fei H, Zhang Y, Ren Y, Ji D. A span-graph neural model for overlapping entity relation extraction in biomedical texts. Bioinformatics 2021; 37:1581-1589. [PMID: 33245108 DOI: 10.1093/bioinformatics/btaa993] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Revised: 10/25/2020] [Accepted: 11/17/2020] [Indexed: 01/14/2023] Open
Abstract
MOTIVATION Entity relation extraction is one of the fundamental tasks in biomedical text mining, which is usually solved by the models from natural language processing. Compared with traditional pipeline methods, joint methods can avoid the error propagation from entity to relation, giving better performances. However, the existing joint models are built upon sequential scheme, and fail to detect overlapping entity and relation, which are ubiquitous in biomedical texts. The main reason is that sequential models have relatively weaker power in capturing long-range dependencies, which results in lower performance in encoding longer sentences. In this article, we propose a novel span-graph neural model for jointly extracting overlapping entity relation in biomedical texts. Our model treats the task as relation triplets prediction, and builds the entity-graph by enumerating possible candidate entity spans. The proposed model captures the relationship between the correlated entities via a span scorer and a relation scorer, respectively, and finally outputs all valid relational triplets. RESULTS Experimental results on two biomedical entity relation extraction tasks, including drug-drug interaction detection and protein-protein interaction detection, show that the proposed method outperforms previous models by a substantial margin, demonstrating the effectiveness of span-graph-based method for overlapping relation extraction in biomedical texts. Further in-depth analysis proves that our model is more effective in capturing the long-range dependencies for relation extraction compared with the sequential models. AVAILABILITY AND IMPLEMENTATION Related codes are made publicly available at http://github.com/Baxelyne/SpanBioER.
Collapse
Affiliation(s)
- Hao Fei
- School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China
| | - Yue Zhang
- School of Engineering, Westlake University, Hangzhou 310024, China
| | - Yafeng Ren
- Laboratory of Language and Artificial Intelligence, Guangdong University of Foreign Studies, Guangzhou 510420, China
| | - Donghong Ji
- School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China
| |
Collapse
|
33
|
Zhao D, Wang J, Lin H, Wang X, Yang Z, Zhang Y. Biomedical cross-sentence relation extraction via multihead attention and graph convolutional networks. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107230] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
34
|
Azer K, Kaddi CD, Barrett JS, Bai JPF, McQuade ST, Merrill NJ, Piccoli B, Neves-Zaph S, Marchetti L, Lombardo R, Parolo S, Immanuel SRC, Baliga NS. History and Future Perspectives on the Discipline of Quantitative Systems Pharmacology Modeling and Its Applications. Front Physiol 2021; 12:637999. [PMID: 33841175 PMCID: PMC8027332 DOI: 10.3389/fphys.2021.637999] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 01/25/2021] [Indexed: 12/24/2022] Open
Abstract
Mathematical biology and pharmacology models have a long and rich history in the fields of medicine and physiology, impacting our understanding of disease mechanisms and the development of novel therapeutics. With an increased focus on the pharmacology application of system models and the advances in data science spanning mechanistic and empirical approaches, there is a significant opportunity and promise to leverage these advancements to enhance the development and application of the systems pharmacology field. In this paper, we will review milestones in the evolution of mathematical biology and pharmacology models, highlight some of the gaps and challenges in developing and applying systems pharmacology models, and provide a vision for an integrated strategy that leverages advances in adjacent fields to overcome these challenges.
Collapse
Affiliation(s)
- Karim Azer
- Quantitative Sciences, Bill and Melinda Gates Medical Research Institute, Cambridge, MA, United States
| | - Chanchala D. Kaddi
- Quantitative Sciences, Bill and Melinda Gates Medical Research Institute, Cambridge, MA, United States
| | | | - Jane P. F. Bai
- Office of Clinical Pharmacology, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD, United States
| | - Sean T. McQuade
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, United States
| | - Nathaniel J. Merrill
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, United States
| | - Benedetto Piccoli
- Department of Mathematical Sciences and Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, United States
| | - Susana Neves-Zaph
- Translational Disease Modeling, Data and Data Science, Sanofi, Bridgewater, NJ, United States
| | - Luca Marchetti
- Fondazione the Microsoft Research – University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | - Rosario Lombardo
- Fondazione the Microsoft Research – University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | - Silvia Parolo
- Fondazione the Microsoft Research – University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | | | | |
Collapse
|
35
|
Lee CY, Chen YP. Descriptive prediction of drug side‐effects using a hybrid deep learning model. INT J INTELL SYST 2021. [DOI: 10.1002/int.22389] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Chun Yen Lee
- Department of Computer Science and Information Technology La Trobe University Melbourne Australia
| | - Yi‐Ping Phoebe Chen
- Department of Computer Science and Information Technology La Trobe University Melbourne Australia
| |
Collapse
|
36
|
Drug-Drug interaction extraction using a position and similarity fusion-based attention mechanism. J Biomed Inform 2021; 115:103707. [PMID: 33571676 DOI: 10.1016/j.jbi.2021.103707] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2020] [Revised: 12/21/2020] [Accepted: 02/03/2021] [Indexed: 11/20/2022]
Abstract
Taking multiple drugs at the same time can increase or decrease each drug's effectiveness or cause side effects. These drug-drug interactions (DDIs) may lead to an increase in the cost of medical care or even threaten patients' health and life. Thus, automatic extraction of DDIs is an important research field to improve patient safety. In this work, a deep neural network model is presented for extracting DDIs from medical texts. This model utilizes a novel attention mechanism for improving the discrimination of important words from others, based on the word similarities and their relative position with respect to candidate drugs. This approach is applied for calculating the attention weights for the outputs of a bi-directional long short-term memory (Bi-LSTM) model in the deep network structure before detecting the type of DDIs. The proposed method was tested on the standard DDI Extraction 2013 dataset and according to experimental results was able to achieve an F1-Score of 78.30 which is comparable to the best results reported for the state-of-the-art methods. A detailed study of the proposed method and its components is also provided.
Collapse
|
37
|
Chasseray Y, Barthe-Delanoë AM, Négny S, Le Lann JM. A generic metamodel for data extraction and generic ontology population. J Inf Sci 2021. [DOI: 10.1177/0165551521989641] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
As the next step in the development of intelligent computing systems is the addition of human expertise and knowledge, it is a priority to build strong computable and well-documented knowledge bases. Ontologies partially respond to this challenge by providing formalisms for knowledge representation. However, one major remaining task is the population of these ontologies with concrete application. Based on Model-Driven Engineering principles, a generic metamodel for the extraction of heterogeneous data is presented in this article. The metamodel has been designed with two objectives, namely (1) the need of genericity regarding the source of collected pieces of knowledge and (2) the intent to stick to a structure close to an ontological structure. As well, an example of instantiation of the metamodel for textual data in chemistry domain and an insight of how this metamodel could be integrated in a larger automated domain independent ontology population framework are given.
Collapse
Affiliation(s)
- Yohann Chasseray
- Laboratoire de Génie Chimique, Université de Toulouse, CNRS, INPT, UPS, Toulouse, France
| | | | - Stéphane Négny
- Laboratoire de Génie Chimique, Université de Toulouse, CNRS, INPT, UPS, Toulouse, France
| | - Jean-Marc Le Lann
- Laboratoire de Génie Chimique, Université de Toulouse, CNRS, INPT, UPS, Toulouse, France
| |
Collapse
|
38
|
Zhang L, Hu J, Xu Q, Li F, Rao G, Tao C. A semantic relationship mining method among disorders, genes, and drugs from different biomedical datasets. BMC Med Inform Decis Mak 2020; 20:283. [PMID: 33317518 PMCID: PMC7734713 DOI: 10.1186/s12911-020-01274-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 09/22/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Semantic web technology has been applied widely in the biomedical informatics field. Large numbers of biomedical datasets are available online in the resource description framework (RDF) format. Semantic relationship mining among genes, disorders, and drugs is widely used in, for example, precision medicine and drug repositioning. However, most of the existing studies focused on a single dataset. It is not easy to find the most current relationships among disorder-gene-drug relationships since the relationships are distributed in heterogeneous datasets. How to mine their semantic relationships from different biomedical datasets is an important issue. METHODS First, a variety of biomedical datasets were converted into RDF triple data; then, multisource biomedical datasets were integrated into a storage system using a data integration algorithm. Second, nine query patterns among genes, disorders, and drugs from different biomedical datasets were designed. Third, the gene-disorder-drug semantic relationship mining algorithm is presented. This algorithm can query the relationships among various entities from different datasets. RESULTS AND CONCLUSIONS We focused on mining the putative and the most current disorder-gene-drug relationships about Parkinson's disease (PD). The results demonstrate that our method has significant advantages in mining and integrating multisource heterogeneous biomedical datasets. Twenty-five new relationships among the genes, disorders, and drugs were mined from four different datasets. The query results showed that most of them came from different datasets. The precision of the method increased by 2.51% compared to that of the multisource linked open data fusion method presented in the 4th International Workshop on Semantics-Powered Data Mining and Analytics (SEPDA 2019). Moreover, the number of query results increased by 7.7%, and the number of correct queries increased by 9.5%.
Collapse
Affiliation(s)
- Li Zhang
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Jiamei Hu
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Qianzhi Xu
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Fang Li
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| | - Guozheng Rao
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350 China
- Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, 300350 China
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| |
Collapse
|
39
|
DECAB-LSTM: Deep Contextualized Attentional Bidirectional LSTM for cancer hallmark classification. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106486] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|
40
|
Wang J, Li M, Diao Q, Lin H, Yang Z, Zhang Y. Biomedical document triage using a hierarchical attention-based capsule network. BMC Bioinformatics 2020; 21:380. [PMID: 32938366 PMCID: PMC7495737 DOI: 10.1186/s12859-020-03673-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biomedical document triage is the foundation of biomedical information extraction, which is important to precision medicine. Recently, some neural networks-based methods have been proposed to classify biomedical documents automatically. In the biomedical domain, documents are often very long and often contain very complicated sentences. However, the current methods still find it difficult to capture important features across sentences. RESULTS In this paper, we propose a hierarchical attention-based capsule model for biomedical document triage. The proposed model effectively employs hierarchical attention mechanism and capsule networks to capture valuable features across sentences and construct a final latent feature representation for a document. We evaluated our model on three public corpora. CONCLUSIONS Experimental results showed that both hierarchical attention mechanism and capsule networks are helpful in biomedical document triage task. Our method proved itself highly competitive or superior compared with other state-of-the-art methods.
Collapse
Affiliation(s)
- Jian Wang
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| | - Mengying Li
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| | - Qishuai Diao
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| | - Hongfei Lin
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| | - Zhihao Yang
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| | - YiJia Zhang
- Dalian University of Technology, The School of Computer Science and Technology, Dalian, 116024 China
| |
Collapse
|
41
|
Perera N, Dehmer M, Emmert-Streib F. Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Front Cell Dev Biol 2020; 8:673. [PMID: 32984300 PMCID: PMC7485218 DOI: 10.3389/fcell.2020.00673] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 07/02/2020] [Indexed: 12/29/2022] Open
Abstract
The number of scientific publications in the literature is steadily growing, containing our knowledge in the biomedical, health, and clinical sciences. Since there is currently no automatic archiving of the obtained results, much of this information remains buried in textual details not readily available for further usage or analysis. For this reason, natural language processing (NLP) and text mining methods are used for information extraction from such publications. In this paper, we review practices for Named Entity Recognition (NER) and Relation Detection (RD), allowing, e.g., to identify interactions between proteins and drugs or genes and diseases. This information can be integrated into networks to summarize large-scale details on a particular biomedical or clinical problem, which is then amenable for easy data management and further analysis. Furthermore, we survey novel deep learning methods that have recently been introduced for such tasks.
Collapse
Affiliation(s)
- Nadeesha Perera
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - Matthias Dehmer
- Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall in Tirol, Austria
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
- Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
42
|
Xie J, Jiang J, Wang Y, Guan Y, Guo X. Learning an expandable EMR-based medical knowledge network to enhance clinical diagnosis. Artif Intell Med 2020; 107:101927. [PMID: 32828460 DOI: 10.1016/j.artmed.2020.101927] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Revised: 10/04/2019] [Accepted: 07/02/2020] [Indexed: 01/10/2023]
Abstract
Electronic medical records (EMRs) contain a wealth of knowledge that can be used to assist doctors in making clinical decisions like disease diagnosis. Constructing a medical knowledge network (MKN) to link medical concepts in EMRs is an effective way to manage this knowledge. The quality of the diagnostic result made by MKN-based clinical decision support system depends on the accuracy of medical knowledge and the completeness of the network. However, collecting knowledge is a long-lasting and cumulative process, which means it's hard to construct a complete MKN with limited data. This study was conducted with the objective of developing an expandable EMR-based MKN to enhance capabilities in making an initial clinical diagnosis. A network of symptom-indicate-disease knowledge in 992 Chinese EMRs (CEMRs) was manually constructed as Original-MKN, and an incremental expansion framework was applied to it to obtain an expandable MKN based on new CEMRs. The framework was composed by: (1) integrating external knowledge extracted from the medical information websites and (2) mining potential knowledge with new EMRs. The framework also adopts a diagnosis-driven learning method to estimate the effectiveness of each knowledge in clinical practice. Experimental results indicate that our expanded MKN achieves a precision of 0.837 for a recall of 0.719 in clinical diagnosis, which outperforms Original-MKN and four classical machine learning methods. Furthermore, both external medical knowledge and potential medical knowledge benefit MKN expansion and disease diagnosis. The proposed incremental expansion framework sustains the MKN learning new knowledge.
Collapse
Affiliation(s)
- Jing Xie
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Jingchi Jiang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yehan Wang
- Unisound AI Technology Co., Ltd, Beijing 100096, China
| | - Yi Guan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China.
| | - Xitong Guo
- School of Management, Harbin Institute of Technology, Harbin 150001, China
| |
Collapse
|
43
|
|
44
|
A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories. NAT MACH INTELL 2020. [DOI: 10.1038/s42256-020-0189-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
45
|
Bio-semantic relation extraction with attention-based external knowledge reinforcement. BMC Bioinformatics 2020; 21:213. [PMID: 32448122 PMCID: PMC7245897 DOI: 10.1186/s12859-020-3540-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Accepted: 05/07/2020] [Indexed: 12/13/2022] Open
Abstract
Background Semantic resources such as knowledge bases contains high-quality-structured knowledge and therefore require significant effort from domain experts. Using the resources to reinforce the information retrieval from the unstructured text may further exploit the potentials of such unstructured text resources and their curated knowledge. Results The paper proposes a novel method that uses a deep neural network model adopting the prior knowledge to improve performance in the automated extraction of biological semantic relations from the scientific literature. The model is based on a recurrent neural network combining the attention mechanism with the semantic resources, i.e., UniProt and BioModels. Our method is evaluated on the BioNLP and BioCreative corpus, a set of manually annotated biological text. The experiments demonstrate that the method outperforms the current state-of-the-art models, and the structured semantic information could improve the result of bio-text-mining. Conclusion The experiment results show that our approach can effectively make use of the external prior knowledge information and improve the performance in the protein-protein interaction extraction task. The method should be able to be generalized for other types of data, although it is validated on biomedical texts.
Collapse
|
46
|
Lee CY, Chen YPP. Prediction of drug adverse events using deep learning in pharmaceutical discovery. Brief Bioinform 2020; 22:1884-1901. [PMID: 32349125 DOI: 10.1093/bib/bbaa040] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2019] [Revised: 02/08/2020] [Accepted: 02/25/2020] [Indexed: 01/11/2023] Open
Abstract
Traditional machine learning methods used to detect the side effects of drugs pose significant challenges as feature engineering processes are labor-intensive, expert-dependent, time-consuming and cost-ineffective. Moreover, these methods only focus on detecting the association between drugs and their side effects or classifying drug-drug interaction. Motivated by technological advancements and the availability of big data, we provide a review on the detection and classification of side effects using deep learning approaches. It is shown that the effective integration of heterogeneous, multidimensional drug data sources, together with the innovative deployment of deep learning approaches, helps reduce or prevent the occurrence of adverse drug reactions (ADRs). Deep learning approaches can also be exploited to find replacements for drugs which have side effects or help to diversify the utilization of drugs through drug repurposing.
Collapse
Affiliation(s)
- Chun Yen Lee
- Department of Computer Science and Information Technology, La Trobe University
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University
| |
Collapse
|
47
|
Wu H, Xing Y, Ge W, Liu X, Zou J, Zhou C, Liao J. Drug-drug interaction extraction via hybrid neural networks on biomedical literature. J Biomed Inform 2020; 106:103432. [PMID: 32335223 DOI: 10.1016/j.jbi.2020.103432] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Revised: 04/15/2020] [Accepted: 04/20/2020] [Indexed: 01/16/2023]
Abstract
Adverse events caused by drug-drug interaction (DDI) not only pose a serious threat to health, but also increase additional medical care expenditure. However, despite the emergence of many excellent text mining-based DDI classification methods, achieving a balance between using simpler method and better model performance is still unsatisfactory. In this article, we present a deep learning method of stacked bidirectional Gated Recurrent Unit (GRU)- convolutional neural network (SGRU-CNN) model which apply stacked bidirectional GRU (BiGRU) network and convolutional neural network (CNN) on lexical information and entity position information respectively to conduct DDIs extraction task. Furthermore, SGRU-CNN model assigns the weights of each word feature to improve performance with one attentive pooling layer. On the condition that other values are not inferior to other algorithms, experimental results on the DDI Extraction 2013 corpus show that our model achieves a 1.54% improvement in recall value. And the proposed SGRU-CNN model reaches great performance (F1-score: 0.75) with the fewest features, indicating an excellent balance between avoiding redundant preprocessing task and higher accuracy in relation extraction on biomedical literature using our method.
Collapse
Affiliation(s)
- Hong Wu
- School of science, China Pharmaceutical University, Nanjing, China
| | - Yan Xing
- School of science, China Pharmaceutical University, Nanjing, China
| | - Weihong Ge
- Department of Pharmacy, Nanjing Drum Tower Hospital, Nanjing, China; School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | - Xiaoquan Liu
- School of Pharmacy, China Pharmaceutical University, Nanjing, China
| | - Jianjun Zou
- School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China; Department of Clinical Pharmacology, Nanjing First Hospital, Nanjing Medical University, Nanjing, China
| | - Changjiang Zhou
- School of science, China Pharmaceutical University, Nanjing, China
| | - Jun Liao
- School of science, China Pharmaceutical University, Nanjing, China.
| |
Collapse
|
48
|
Döring K, Qaseem A, Becer M, Li J, Mishra P, Gao M, Kirchner P, Sauter F, Telukunta KK, Moumbock AFA, Thomas P, Günther S. Automated recognition of functional compound-protein relationships in literature. PLoS One 2020; 15:e0220925. [PMID: 32126064 PMCID: PMC7053725 DOI: 10.1371/journal.pone.0220925] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Accepted: 01/29/2020] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION Much effort has been invested in the identification of protein-protein interactions using text mining and machine learning methods. The extraction of functional relationships between chemical compounds and proteins from literature has received much less attention, and no ready-to-use open-source software is so far available for this task. METHOD We created a new benchmark dataset of 2,613 sentences from abstracts containing annotations of proteins, small molecules, and their relationships. Two kernel methods were applied to classify these relationships as functional or non-functional, named shallow linguistic and all-paths graph kernel. Furthermore, the benefit of interaction verbs in sentences was evaluated. RESULTS The cross-validation of the all-paths graph kernel (AUC value: 84.6%, F1 score: 79.0%) shows slightly better results than the shallow linguistic kernel (AUC value: 82.5%, F1 score: 77.2%) on our benchmark dataset. Both models achieve state-of-the-art performance in the research area of relation extraction. Furthermore, the combination of shallow linguistic and all-paths graph kernel could further increase the overall performance slightly. We used each of the two kernels to identify functional relationships in all PubMed abstracts (29 million) and provide the results, including recorded processing time. AVAILABILITY The software for the tested kernels, the benchmark, the processed 29 million PubMed abstracts, all evaluation scripts, as well as the scripts for processing the complete PubMed database are freely available at https://github.com/KerstenDoering/CPI-Pipeline.
Collapse
Affiliation(s)
- Kersten Döring
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Ammar Qaseem
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Michael Becer
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Jianyu Li
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Pankaj Mishra
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Mingjie Gao
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Pascal Kirchner
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Florian Sauter
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Kiran K. Telukunta
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Aurélien F. A. Moumbock
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | | | - Stefan Günther
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
- * E-mail:
| |
Collapse
|
49
|
Zhang Y, Lin H, Yang Z, Wang J, Sun Y. Chemical-protein interaction extraction via contextualized word representations and multihead attention. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5498050. [PMID: 31125403 PMCID: PMC6534182 DOI: 10.1093/database/baz054] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Revised: 03/16/2019] [Accepted: 04/02/2019] [Indexed: 12/17/2022]
Abstract
A rich source of chemical–protein interactions (CPIs) is locked in the exponentially growing biomedical literature. Automatic extraction of CPIs is a crucial task in biomedical natural language processing (NLP), which has great benefits for pharmacological and clinical research. Deep context representation and multihead attention are recent developments in deep learning and have shown their potential in some NLP tasks. Unlike traditional word embedding, deep context representation has the ability to generate comprehensive sentence representation based on the sentence context. The multihead attention mechanism can effectively learn the important features from different heads and emphasize the relatively important features. Integrating deep context representation and multihead attention with a neural network-based model may improve CPI extraction. We present a deep neural model for CPI extraction based on deep context representation and multihead attention. Our model mainly consists of the following three parts: a deep context representation layer, a bidirectional long short-term memory networks (Bi-LSTMs) layer and a multihead attention layer. The deep context representation is employed to provide more comprehensive feature input for Bi-LSTMs. The multihead attention can effectively emphasize the important part of the Bi-LSTMs output. We evaluated our method on the public ChemProt corpus. These experimental results show that both deep context representation and multihead attention are helpful in CPI extraction. Our method can compete with other state-of-the-art methods on ChemProt corpus.
Collapse
Affiliation(s)
- Yijia Zhang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Hongfei Lin
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Jian Wang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Yuanyuan Sun
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| |
Collapse
|
50
|
Jettakul A, Wichadakul D, Vateekul P. Relation extraction between bacteria and biotopes from biomedical texts with attention mechanisms and domain-specific contextual representations. BMC Bioinformatics 2019; 20:627. [PMID: 31795930 PMCID: PMC6889521 DOI: 10.1186/s12859-019-3217-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Accepted: 11/12/2019] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND The Bacteria Biotope (BB) task is a biomedical relation extraction (RE) that aims to study the interaction between bacteria and their locations. This task is considered to pertain to fundamental knowledge in applied microbiology. Some previous investigations conducted the study by applying feature-based models; others have presented deep-learning-based models such as convolutional and recurrent neural networks used with the shortest dependency paths (SDPs). Although SDPs contain valuable and concise information, some parts of crucial information that is required to define bacterial location relationships are often neglected. Moreover, the traditional word-embedding used in previous studies may suffer from word ambiguation across linguistic contexts. RESULTS Here, we present a deep learning model for biomedical RE. The model incorporates feature combinations of SDPs and full sentences with various attention mechanisms. We also used pre-trained contextual representations based on domain-specific vocabularies. To assess the model's robustness, we introduced a mean F1 score on many models using different random seeds. The experiments were conducted on the standard BB corpus in BioNLP-ST'16. Our experimental results revealed that the model performed better (in terms of both maximum and average F1 scores; 60.77% and 57.63%, respectively) compared with other existing models. CONCLUSIONS We demonstrated that our proposed contributions to this task can be used to extract rich lexical, syntactic, and semantic features that effectively boost the model's performance. Moreover, we analyzed the trade-off between precision and recall to choose the proper cut-off to use in real-world applications.
Collapse
Affiliation(s)
- Amarin Jettakul
- Chulalongkorn University Big Data Analytics and IoT Center (CUBIC), Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand
| | - Duangdao Wichadakul
- Chulalongkorn University Big Data Analytics and IoT Center (CUBIC), Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand
| | - Peerapon Vateekul
- Chulalongkorn University Big Data Analytics and IoT Center (CUBIC), Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand.
| |
Collapse
|