1
|
He J, Li F, Li J, Hu X, Nian Y, Xiang Y, Wang J, Wei Q, Li Y, Xu H, Tao C. Prompt Tuning in Biomedical Relation Extraction. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024; 8:206-224. [PMID: 38681754 PMCID: PMC11052745 DOI: 10.1007/s41666-024-00162-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2022] [Revised: 02/09/2024] [Accepted: 02/19/2024] [Indexed: 05/01/2024]
Abstract
Biomedical relation extraction (RE) is critical in constructing high-quality knowledge graphs and databases as well as supporting many downstream text mining applications. This paper explores prompt tuning on biomedical RE and its few-shot scenarios, aiming to propose a simple yet effective model for this specific task. Prompt tuning reformulates natural language processing (NLP) downstream tasks into masked language problems by embedding specific text prompts into the original input, facilitating the adaption of pre-trained language models (PLMs) to better address these tasks. This study presents a customized prompt tuning model designed explicitly for biomedical RE, including its applicability in few-shot learning contexts. The model's performance was rigorously assessed using the chemical-protein relation (CHEMPROT) dataset from BioCreative VI and the drug-drug interaction (DDI) dataset from SemEval-2013, showcasing its superior performance over conventional fine-tuned PLMs across both datasets, encompassing few-shot scenarios. This observation underscores the effectiveness of prompt tuning in enhancing the capabilities of conventional PLMs, though the extent of enhancement may vary by specific model. Additionally, the model demonstrated a harmonious balance between simplicity and efficiency, matching state-of-the-art performance without needing external knowledge or extra computational resources. The pivotal contribution of our study is the development of a suitably designed prompt tuning model, highlighting prompt tuning's effectiveness in biomedical RE. It offers a robust, efficient approach to the field's challenges and represents a significant advancement in extracting complex relations from biomedical texts. Supplementary Information The online version contains supplementary material available at 10.1007/s41666-024-00162-9.
Collapse
Affiliation(s)
- Jianping He
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Fang Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL USA
| | - Jianfu Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL USA
| | - Xinyue Hu
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL USA
| | - Yi Nian
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Yang Xiang
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Jingqi Wang
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Qiang Wei
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Yiming Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Hua Xu
- Department of Bioinformatics and Data Science, Yale School of Medicine, New Haven, CT USA
| | - Cui Tao
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL USA
| |
Collapse
|
2
|
Zheng X, Wang X, Luo X, Tong F, Zhao D. BioEGRE: a linguistic topology enhanced method for biomedical relation extraction based on BioELECTRA and graph pointer neural network. BMC Bioinformatics 2023; 24:486. [PMID: 38114906 PMCID: PMC10731880 DOI: 10.1186/s12859-023-05601-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 12/04/2023] [Indexed: 12/21/2023] Open
Abstract
BACKGROUND Automatic and accurate extraction of diverse biomedical relations from literature is a crucial component of bio-medical text mining. Currently, stacking various classification networks on pre-trained language models to perform fine-tuning is a common framework to end-to-end solve the biomedical relation extraction (BioRE) problem. However, the sequence-based pre-trained language models underutilize the graphical topology of language to some extent. In addition, sequence-oriented deep neural networks have limitations in processing graphical features. RESULTS In this paper, we propose a novel method for sentence-level BioRE task, BioEGRE (BioELECTRA and Graph pointer neural net-work for Relation Extraction), aimed at leveraging the linguistic topological features. First, the biomedical literature is preprocessed to retain sentences involving pre-defined entity pairs. Secondly, SciSpaCy is employed to conduct dependency parsing; sentences are modeled as graphs based on the parsing results; BioELECTRA is utilized to generate token-level representations, which are modeled as attributes of nodes in the sentence graphs; a graph pointer neural network layer is employed to select the most relevant multi-hop neighbors to optimize representations; a fully-connected neural network layer is employed to generate the sentence-level representation. Finally, the Softmax function is employed to calculate the probabilities. Our proposed method is evaluated on three BioRE tasks: a multi-class (CHEMPROT) and two binary tasks (GAD and EU-ADR). The results show that our method achieves F1-scores of 79.97% (CHEMPROT), 83.31% (GAD), and 83.51% (EU-ADR), surpassing the performance of existing state-of-the-art models. CONCLUSION The experimental results on 3 biomedical benchmark datasets demonstrate the effectiveness and generalization of BioEGRE, which indicates that linguistic topology and a graph pointer neural network layer explicitly improve performance for BioRE tasks.
Collapse
Affiliation(s)
- Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Xuanze Wang
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Xiaowei Luo
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Fan Tong
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing, 100039, China.
| |
Collapse
|
3
|
Yang C, Deng J, Chen X, An Y. SPBERE: Boosting span-based pipeline biomedical entity and relation extraction via entity information. J Biomed Inform 2023; 145:104456. [PMID: 37482171 DOI: 10.1016/j.jbi.2023.104456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 05/03/2023] [Accepted: 07/18/2023] [Indexed: 07/25/2023]
Abstract
Triplet extraction is one of the fundamental tasks in biomedical text mining. Compared with traditional pipeline approaches, joint methods can alleviate the error propagation problem from entity recognition to relation classification. However, existing methods face challenges in detecting overlapping entities and overlapping relations, which are ubiquitous in biomedical texts. In this work, we propose a novel pipeline method of end-to-end biomedical triplet extraction. In particular, a span-based detection strategy is used to detect the overlapping triplets by enumerating possible candidate spans and entity pairs. The strategy is further used to capture different contextualized representations via an entity model and a relation model, respectively. Furthermore, to enhance interrelation between spans, entity information from the output of the entity model is used to construct the input for the relation model without utilizing any external knowledge. Our approach is evaluated on the drug-drug interaction (DDI) and chemical-protein interaction (CHEMPROT) datasets, exhibiting improvement of the absolute F1-score in relation extraction by 3.5%-3.7% compared prior work. The experimental results highlight the importance of overlapping triplet detection using the span-based approach, acquisition of various contextualized representations via different in-domain pre-trained language models, and early fusion of entity information in the relation model.
Collapse
Affiliation(s)
- Chenglin Yang
- Big Data Institute, Central South University, Changsha, 410083, China; School of Life Sciences, Central South University, Changsha, 410083, China
| | - Jiamei Deng
- Big Data Institute, Central South University, Changsha, 410083, China
| | - Xianlai Chen
- Big Data Institute, Central South University, Changsha, 410083, China; Key Laboratory of Medical Information Research, Central South University, Changsha, 410083, China.
| | - Ying An
- Big Data Institute, Central South University, Changsha, 410083, China.
| |
Collapse
|
4
|
Duan B, Peng J, Zhang Y. IMSE: interaction information attention and molecular structure based drug drug interaction extraction. BMC Bioinformatics 2022; 23:338. [PMID: 35965308 PMCID: PMC9375903 DOI: 10.1186/s12859-022-04876-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Accepted: 08/03/2022] [Indexed: 11/10/2022] Open
Abstract
Background Extraction of drug drug interactions from biomedical literature and other textual data is an important component to monitor drug-safety and this has attracted attention of many researchers in healthcare. Existing works are more pivoted around relation extraction using bidirectional long short-term memory networks (BiLSTM) and BERT model which does not attain the best feature representations. Results Our proposed DDI (drug drug interaction) prediction model provides multiple advantages: (1) The newly proposed attention vector is added to better deal with the problem of overlapping relations, (2) The molecular structure information of drugs is integrated into the model to better express the functional group structure of drugs, (3) We also added text features that combined the T-distribution and chi-square distribution to make the model more focused on drug entities and (4) it achieves similar or better prediction performance (F-scores up to 85.16%) compared to state-of-the-art DDI models when tested on benchmark datasets. Conclusions Our model that leverages state of the art transformer architecture in conjunction with multiple features can bolster the performances of drug drug interation tasks in the biomedical domain. In particular, we believe our research would be helpful in identification of potential adverse drug reactions.
Collapse
|
5
|
McInnes BT, Downie JS, Hao Y, Jett J, Keating K, Nakum G, Ranjan S, Rodriguez NE, Tang J, Xiang D, Young EM, Nguyen MH. Discovering Content through Text Mining for a Synthetic Biology Knowledge System. ACS Synth Biol 2022; 11:2043-2054. [PMID: 35671034 DOI: 10.1021/acssynbio.1c00611] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Scientific articles contain a wealth of information about experimental methods and results describing biological designs. Due to its unstructured nature and multiple sources of ambiguity and variability, extracting this information from text is a difficult task. In this paper, we describe the development of the synthetic biology knowledge system (SBKS) text processing pipeline. The pipeline uses natural language processing techniques to extract and correlate information from the literature for synthetic biology researchers. Specifically, we apply named entity recognition, relation extraction, concept grounding, and topic modeling to extract information from published literature to link articles to elements within our knowledge system. Our results show the efficacy of each of the components on synthetic biology literature and provide future directions for further advancement of the pipeline.
Collapse
Affiliation(s)
- Bridget T McInnes
- Virginia Commonwealth University, Richmond, Virginia 23284, United States
| | - J Stephen Downie
- University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Yikai Hao
- University of California San Diego, La Jolla, California 92093, United States
| | - Jacob Jett
- University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Kevin Keating
- Worcester Polytechnic Institute, Worcester, Massachusetts 01609, United States
| | - Gaurav Nakum
- University of California San Diego, La Jolla, California 92093, United States
| | - Sudhanshu Ranjan
- University of California San Diego, La Jolla, California 92093, United States
| | | | - Jiawei Tang
- University of California San Diego, La Jolla, California 92093, United States
| | - Du Xiang
- University of California San Diego, La Jolla, California 92093, United States
| | - Eric M Young
- Worcester Polytechnic Institute, Worcester, Massachusetts 01609, United States
| | - Mai H Nguyen
- University of California San Diego, La Jolla, California 92093, United States
| |
Collapse
|
6
|
Sun C, Yang Z, Wang L, Zhang Y, Lin H, Wang J. MRC4BioER: Joint extraction of biomedical entities and relations in the machine reading comprehension framework. J Biomed Inform 2021; 125:103956. [PMID: 34848329 DOI: 10.1016/j.jbi.2021.103956] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 11/09/2021] [Accepted: 11/16/2021] [Indexed: 10/19/2022]
Abstract
Extracting entities and their relations from unstructured literature to form structured triplets is essential for biomedical knowledge extraction. Because sentences in biomedical datasets usually have many special overlapping triplets, it is difficult to use previous work to extract these triplets effectively. In this work, we propose a novel tagging strategy to achieve joint extraction in the machine reading comprehension framework. On the one hand, our method uses Query in the machine reading comprehension framework to introduce the information of the specific relation. On the other hand, our method introduces a tagging strategy for overlapping triplets in the biomedical domain. We use CHEMPROT and DDIExtraction2013 datasets to evaluate our method. The experimental results demonstrate that our proposed method can enhance the model's ability to deal with overlapping triplets, improving extraction performance.
Collapse
Affiliation(s)
- Cong Sun
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Lei Wang
- Beijing Institute of Health Administration and Medical Information, Beijing 100850, China.
| | - Yin Zhang
- Beijing Institute of Health Administration and Medical Information, Beijing 100850, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| |
Collapse
|
7
|
Liu X, Tan K, Dong S. Multi-granularity sequential neural network for document-level biomedical relation extraction. Inf Process Manag 2021. [DOI: 10.1016/j.ipm.2021.102718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
8
|
Li Z, Chen H, Qi R, Lin H, Chen H. DocR-BERT: Document-level R-BERT for Chemical-induced Disease Relation Extraction via Gaussian Probability Distribution. IEEE J Biomed Health Inform 2021; 26:1341-1352. [PMID: 34591774 DOI: 10.1109/jbhi.2021.3116769] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Chemical-induced disease (CID) relation extraction from biomedical articles plays an important role in disease treatment and drug development. Existing methods are insufficient for capturing complete document level semantic information due to ignoring semantic information of entities in different sentences. In this work, we proposed an effective document-level relation extraction model to automatically extract intra-/inter-sentential CID relations from articles. Firstly, our model employed BERT to generate contextual semantic representations of the title, abstract and shortest dependency paths (SDPs). Secondly, to enhance the semantic representation of the whole document, cross attention with self-attention (named cross2self-attention) between abstract, title and SDPs was proposed to learn the mutual semantic information. Thirdly, to distinguish the importance of the target entity in different sentences, the Gaussian probability distribution was utilized to compute the weights of the co-occurrence sentence and its adjacent entity sentences. More complete semantic information of the target entity is collected from all entities occurring in the document via our presented document-level R-BERT (DocR-BERT). Finally, the related representations were concatenated and fed into the softmax function to extract CIDs. We evaluated the model on the CDR corpus provided by BioCreative V. The proposed model without external resources is superior in performance as compared with other state-of-the-art models (our model achieves 53.5%, 70%, and 63.7% of the F1-score on inter-/intra-sentential and overall CDR dataset). The experimental results indicate that cross2self-attention, the Gaussian probability distribution and DocR-BERT can effectively improve the CID extraction performance. Furthermore, the mutual semantic information learned by the cross self-attention from abstract towards title can significantly influence the extraction performance of document-level biomedical relation extraction tasks.
Collapse
|
9
|
Mante J, Hao Y, Jett J, Joshi U, Keating K, Lu X, Nakum G, Rodriguez NE, Tang J, Terry L, Wu X, Yu E, Downie JS, McInnes BT, Nguyen MH, Sepulvado B, Young EM, Myers CJ. Synthetic Biology Knowledge System. ACS Synth Biol 2021; 10:2276-2285. [PMID: 34387462 DOI: 10.1021/acssynbio.1c00188] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The Synthetic Biology Knowledge System (SBKS) is an instance of the SynBioHub repository that includes text and data information that has been mined from papers published in ACS Synthetic Biology. This paper describes the SBKS curation framework that is being developed to construct the knowledge stored in this repository. The text mining pipeline performs automatic annotation of the articles using natural language processing techniques to identify salient content such as key terms, relationships between terms, and main topics. The data mining pipeline performs automatic annotation of the sequences extracted from the supplemental documents with the genetic parts used in them. Together these two pipelines link genetic parts to papers describing the context in which they are used. Ultimately, SBKS will reduce the time necessary for synthetic biologists to find the information necessary to complete their designs.
Collapse
Affiliation(s)
- Jeanet Mante
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - Yikai Hao
- University of California San Diego, La Jolla, California 92093, United States
| | - Jacob Jett
- University of Illinois at Urbana−Champaign, Urbana, Illinois 61801, United States
| | - Udayan Joshi
- University of California San Diego, La Jolla, California 92093, United States
| | - Kevin Keating
- Worcester Polytechnic Institute, Worcester, Massachusettes 01609, United States
| | - Xiang Lu
- University of California San Diego, La Jolla, California 92093, United States
| | - Gaurav Nakum
- University of California San Diego, La Jolla, California 92093, United States
| | | | - Jiawei Tang
- University of California San Diego, La Jolla, California 92093, United States
| | - Logan Terry
- University of Utah, Salt Lake City, Utah 84112, United States
| | - Xuanyu Wu
- University of California San Diego, La Jolla, California 92093, United States
| | - Eric Yu
- University of Utah, Salt Lake City, Utah 84112, United States
| | - J. Stephen Downie
- University of Illinois at Urbana−Champaign, Urbana, Illinois 61801, United States
| | - Bridget T. McInnes
- Virginia Commonwealth University, Richmond, Virginia 23284, United States
| | - Mai H. Nguyen
- University of California San Diego, La Jolla, California 92093, United States
| | - Brandon Sepulvado
- NORC at the University of Chicago Bethesda, Chicago, Illinois 60637, United States
| | - Eric M. Young
- Worcester Polytechnic Institute, Worcester, Massachusettes 01609, United States
| | - Chris J. Myers
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| |
Collapse
|
10
|
Zheng T, Xu Z, Li Y, Zhao Y, Wang B, Yang X. A Novel Conditional Knowledge Graph Representation and Construction. ARTIF INTELL 2021. [DOI: 10.1007/978-3-030-93049-3_32] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
11
|
Wang W, Yang X, Wu C, Yang C. CGINet: graph convolutional network-based model for identifying chemical-gene interaction in an integrated multi-relational graph. BMC Bioinformatics 2020; 21:544. [PMID: 33243142 PMCID: PMC7689985 DOI: 10.1186/s12859-020-03899-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Accepted: 11/19/2020] [Indexed: 11/19/2022] Open
Abstract
Background Elucidation of interactive relation between chemicals and genes is of key relevance not only for discovering new drug leads in drug development but also for repositioning existing drugs to novel therapeutic targets. Recently, biological network-based approaches have been proven to be effective in predicting chemical-gene interactions.
Results We present CGINet, a graph convolutional network-based method for identifying chemical-gene interactions in an integrated multi-relational graph containing three types of nodes: chemicals, genes, and pathways. We investigate two different perspectives on learning node embeddings. One is to view the graph as a whole, and the other is to adopt a subgraph view that initial node embeddings are learned from the binary association subgraphs and then transferred to the multi-interaction subgraph for more focused learning of higher-level target node representations. Besides, we reconstruct the topological structures of target nodes with the latent links captured by the designed substructures. CGINet adopts an end-to-end way that the encoder and the decoder are trained jointly with known chemical-gene interactions. We aim to predict unknown but potential associations between chemicals and genes as well as their interaction types. Conclusions We study three model implementations CGINet-1/2/3 with various components and compare them with baseline approaches. As the experimental results suggest, our models exhibit competitive performances on identifying chemical-gene interactions. Besides, the subgraph perspective and the latent link both play positive roles in learning much more informative node embeddings and can lead to improved prediction.
Collapse
Affiliation(s)
- Wei Wang
- College of Computer, National University of Defense Technology, Changsha, 410073, China
| | - Xi Yang
- College of Computer, National University of Defense Technology, Changsha, 410073, China
| | - Chengkun Wu
- College of Computer, National University of Defense Technology, Changsha, 410073, China. .,State Key Laboratory of High-Performance Computing, National University of Defense Technology, Changsha, 410073, China.
| | - Canqun Yang
- College of Computer, National University of Defense Technology, Changsha, 410073, China
| |
Collapse
|