1
|
Zhang Y, Yang Z, Yang Y, Lin H, Wang J. Location-enhanced syntactic knowledge for biomedical relation extraction. J Biomed Inform 2024; 156:104676. [PMID: 38876451 DOI: 10.1016/j.jbi.2024.104676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 06/08/2024] [Accepted: 06/10/2024] [Indexed: 06/16/2024]
Abstract
Biomedical relation extraction has long been considered a challenging task due to the specialization and complexity of biomedical texts. Syntactic knowledge has been widely employed in existing research to enhance relation extraction, providing guidance for the semantic understanding and text representation of models. However, the utilization of syntactic knowledge in most studies is not exhaustive, and there is often a lack of fine-grained noise reduction, leading to confusion in relation classification. In this paper, we propose an attention generator that comprehensively considers both syntactic dependency type information and syntactic position information to distinguish the importance of different dependency connections. Additionally, we integrate positional information, dependency type information, and word representations together to introduce location-enhanced syntactic knowledge for guiding our biomedical relation extraction. Experimental results on three widely used English benchmark datasets in the biomedical domain consistently outperform a range of baseline models, demonstrating that our approach not only makes full use of syntactic knowledge but also effectively reduces the impact of noisy words.
Collapse
Affiliation(s)
- Yan Zhang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Yumeng Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| |
Collapse
|
2
|
Das Baksi K, Pokhrel V, Pudavar AE, Mande SS, Kuntal BK. BactInt: A domain driven transfer learning approach for extracting inter-bacterial associations from biomedical text. Comput Biol Chem 2024; 109:108012. [PMID: 38198963 DOI: 10.1016/j.compbiolchem.2023.108012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 12/15/2023] [Accepted: 12/30/2023] [Indexed: 01/12/2024]
Abstract
BACKGROUND The healthy as well as dysbiotic state of an ecosystem like human body is known to be influenced not only by the presence of the bacterial groups in it, but also with respect to the associations within themselves. Evidence reported in biomedical text serves as a reliable source for identifying and ascertaining such inter bacterial associations. However, the complexity of the reported text as well as the ever-increasing volume of information necessitates development of methods for automated and accurate extraction of such knowledge. METHODS A BioBERT (biomedical domain specific language model) based information extraction model for bacterial associations is presented that utilizes learning patterns from other publicly available datasets. Additionally, a specialized sentence corpus has been developed to significantly improve the prediction accuracy of the 'transfer learned' model using a fine-tuning approach. RESULTS The final model was seen to outperform all other variations (non-transfer learned and non-fine-tuned models) as well as models trained on BioGPT (a domain trained Generative Pre-trained Transformer). To further demonstrate the utility, a case study was performed using bacterial association network data obtained from experimental studies. CONCLUSION This study attempts to demonstrate the applicability of transfer learning in a niche field of life sciences where understanding of inter bacterial relationships is crucial to obtain meaningful insights in comprehending microbial community structures across different ecosystems. The study further discusses how such a model can be further improved by fine tuning using limited training data. The results presented and the datasets made available are expected to be a valuable addition in the field of medical informatics and bioinformatics.
Collapse
Affiliation(s)
| | - Vatsala Pokhrel
- TCS Research, Tata Consultancy Services Ltd, Pune 411057, India
| | | | | | - Bhusan K Kuntal
- TCS Research, Tata Consultancy Services Ltd, Pune 411057, India.
| |
Collapse
|
3
|
Hu Y, Chen Y, Qin Y, Huang R. Learning entity-oriented representation for biomedical relation extraction. J Biomed Inform 2023; 147:104527. [PMID: 37852347 DOI: 10.1016/j.jbi.2023.104527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 10/11/2023] [Accepted: 10/15/2023] [Indexed: 10/20/2023]
Abstract
Biomedical Relation Extraction (BioRE) aims to automatically extract semantic relations for given entity pairs and is of great significance in biomedical research. Current popular methods often utilize pretrained language models to extract semantic features from individual input instances, which frequently suffer from overlapping semantics. Overlapping semantics refers to the situation in which a sentence contains multiple entity pairs that share the same context, leading to highly similar information between these entity pairs. In this study, we propose a model for learning Entity-oriented Representation (EoR) that aims to improve the performance of the model by enhancing the discriminability between entity pairs. It contains three modules: sentence representation, entity-oriented representation, and output. The first module learns the global semantic information of the input instance; the second module focuses on extracting the semantic information of the sentence from the target entities; and the third module enhances distinguishability among entity pairs and classifies the relation type. We evaluated our approach on four BioRE tasks with eight datasets, and the experiments showed that our EoR achieved state-of-the-art performance for PPI, DDI, CPI, and DPI tasks. Further analysis demonstrated the benefits of entity-oriented semantic information in handling multiple entity pairs in the BioRE task.
Collapse
Affiliation(s)
- Ying Hu
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Yanping Chen
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Yongbin Qin
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Ruizhang Huang
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| |
Collapse
|
4
|
Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets. J Biomed Inform 2023; 146:104487. [PMID: 37673376 DOI: 10.1016/j.jbi.2023.104487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/18/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]
Abstract
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA.
| |
Collapse
|
5
|
Liu D, Zhang Y, Yang M, Yuan J, Qu W. Extracting Mutant-Affected Protein-Protein Interactions via Gaussian-Enhanced Representation and Contrastive Learning. J Comput Biol 2023; 30:972-984. [PMID: 37682321 DOI: 10.1089/cmb.2023.0080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023] Open
Abstract
Genetic mutations can impact protein-protein interactions (PPIs) in biomedical literature. Automated extraction of PPIs affected by gene mutations from biomedical literature can aid in evaluating the clinical importance of gene variations, which is crucial for the advancement of precision medicine. In this study, a new model called the Gaussian-enhanced representation model (GRM) is introduced for PPI extraction. The model utilizes the Gaussian probability distribution to produce a target entity representation based on the BioBERT pretraining model. The GRM assigns more weight to target protein entities and their adjacent entities, resolving the problem of lengthy input text and scattered distribution of target entities in the PPI extraction task. Additionally, the model introduces a supervised contrast learning approach to enhance its effectiveness and robustness. Experiments on the BioCreative VI data set demonstrate that our proposed GRM model has achieved state-of-the-art performance.
Collapse
Affiliation(s)
- Da Liu
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| | - Yijia Zhang
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| | - Ming Yang
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| | - Jianyuan Yuan
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| | - Wen Qu
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| |
Collapse
|
6
|
Lee M. Recent Advances in Deep Learning for Protein-Protein Interaction Analysis: A Comprehensive Review. Molecules 2023; 28:5169. [PMID: 37446831 DOI: 10.3390/molecules28135169] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 06/30/2023] [Accepted: 06/30/2023] [Indexed: 07/15/2023] Open
Abstract
Deep learning, a potent branch of artificial intelligence, is steadily leaving its transformative imprint across multiple disciplines. Within computational biology, it is expediting progress in the understanding of Protein-Protein Interactions (PPIs), key components governing a wide array of biological functionalities. Hence, an in-depth exploration of PPIs is crucial for decoding the intricate biological system dynamics and unveiling potential avenues for therapeutic interventions. As the deployment of deep learning techniques in PPI analysis proliferates at an accelerated pace, there exists an immediate demand for an exhaustive review that encapsulates and critically assesses these novel developments. Addressing this requirement, this review offers a detailed analysis of the literature from 2021 to 2023, highlighting the cutting-edge deep learning methodologies harnessed for PPI analysis. Thus, this review stands as a crucial reference for researchers in the discipline, presenting an overview of the recent studies in the field. This consolidation helps elucidate the dynamic paradigm of PPI analysis, the evolution of deep learning techniques, and their interdependent dynamics. This scrutiny is expected to serve as a vital aid for researchers, both well-established and newcomers, assisting them in maneuvering the rapidly shifting terrain of deep learning applications in PPI analysis.
Collapse
Affiliation(s)
- Minhyeok Lee
- School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| |
Collapse
|
7
|
Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022; 23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 43] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open
Abstract
Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|